Readonly
assetThe AssetLibrary client, accessible with Client.asset
.
Readonly
chatThe Chat client, accessible with Client.chat
.
Readonly
completionsThe CompletionsAPI client, accessible with Client.completions
.
Readonly
headersHeaders used to interact with OctoAI servers. Communicates authorization and request type.
Readonly
secureSet to true to use the SecureLink API.
Readonly
tuneThe FineTuningClient, accessible with Client.tune
.
Private
coldReturn the result of a InferenceFuture generated from
Client.inferAsync as long as Client.isFutureReady returned
true
.
An InferenceFuture generated from Client.inferAsync
JSON outputs from the endpoint.
Check health of an endpoint using a get request. Try until timeout.
Target URL to run the health check.
Milliseconds before request times out. Default is 15 minutes.
Interval in milliseconds before the healthCheck method queries
HTTP status code.
The default timeout is set to 15 minutes to allow for potential cold start.
For custom containers, please follow Health Check Paths in Custom Containers to set a health check endpoint.
Information about health check endpoint URLs are available on relevant QuickStart Templates.
Send a request to the given endpoint with inputs as request body.
For LLaMA2 LLMs, this requires "stream": false
in the inputs. To stream
for LLMs, please see the inferStream method.
Target URL to run inference
Necessary inputs for the endpointURL to run inference
JSON outputs from the endpoint
Execute an inference in the background on the server.
Target URL to send inference request.
Contains necessary inputs for endpoint to run inference.
Future allows checking if results are ready then accessing them.
Please read the Async Inference Reference for more information.
Client.inferAsync returns an InferenceFuture,
which can then be used with Client.isFutureReady to see the
status. Once it returns true
, you can use the
Client.getFutureResult to get the response for your
InferenceFuture.
Assuming you have a variable with your target endpoint URL and the inputs
the model needs, and an OCTOAI_TOKEN
set as an environment variable, you
can run a server-side asynchronous inference from
QuickStart Template
endpoints with something like the below.
const client = new Client();
const future = await client.inferAsync(url, inputs);
if (await client.isFutureReady(future) === true) {
return await client.getFutureResult(future);
}
Stream text event response body for supporting endpoints. This is an
alternative to loading all response body into memory at once. Recommended
for use with LLM models. Requires "stream": true
in the inputs for
LLaMA2 LLMs.
Target URL to run inference
Necessary inputs for the endpointURL to run inference
Compatible with getReader method.
This allows you to stream back tokens from the LLMs. Below is an example on how to do this with a LLaMA2 LLM using a completions style API.
HuggingFace style APIs will usually use the variable done
below to
indicate the end of the stream. OpenAI style APIs will often send a
string in the stream "data: [DONE]\n"
to indicate the stream is complete.
This example concatenates all values from the tokens into a single text variable. How you choose to use the tokens will likely be different, so please modify the code.
This examples assumes:
{@link Client.constructor}
.const client = new Client();
const readableStream = await client.inferStream(
llamaEndpoint,
streamInputs
);
let text = ``;
const streamReader = readableStream.getReader();
for (
let { value, done } = await streamReader.read();
!done;
{ value, done } = await streamReader.read()
) {
if (done) break;
const decoded = new TextDecoder().decode(value);
if (
decoded === "data: [DONE]\n" ||
decoded.includes('"finish_reason": "')
) {
break;
}
const token = JSON.parse(decoded.substring(5));
if (token.object === "chat.completion.chunk") {
text += token.choices[0].delta.content;
}
console.log(text);
The const token = JSON.parse(decoded.substring(5))
line strips "data"
from the returned text/event-stream then parses the token as an object.
Return whether the InferenceFuture generated from Client.inferAsync has been computed and can return results.
Created from Client.inferAsync.
True if the InferenceFuture inference is completed and are able to use Client.getFutureResult. Else returns false.
Private
pollGenerated using TypeDoc
A client that allows inferences from existing OctoAI endpoints. Sets various headers, establishes clients for Chat under
Client.chat
, AssetLibrary underClient.asset
, FineTuningClient underClient.tune
, and will check forOCTOAI_TOKEN
from environment variable if no token is provided.Throws
OctoAIClientError - For client-side failures (throttled, no token)
Throws
OctoAIServerError - For server-side failures (unreachable, etc)
Remarks
You can create an OctoAI API token by following the guide at How to Create an OctoAI Access Token