A client that allows inferences from existing OctoAI endpoints. Sets various headers, establishes clients for Chat under Client.chat, AssetLibrary under Client.asset, FineTuningClient under Client.tune, and will check for OCTOAI_TOKEN from environment variable if no token is provided.


OctoAIClientError - For client-side failures (throttled, no token)


OctoAIServerError - For server-side failures (unreachable, etc)


You can create an OctoAI API token by following the guide at How to Create an OctoAI Access Token


  • Constructor for the Client class.


    • Optional token: null | string

      OctoAI token. If none is set, checks for an OCTOAI_TOKEN envvar, or will default to null.

    • secureLink: boolean = false

      Set to true to use SecureLink API instead of public API

    Returns Client


The AssetLibrary client, accessible with Client.asset.

chat: Chat

The Chat client, accessible with Client.chat.

completions: CompletionsAPI

The CompletionsAPI client, accessible with Client.completions.

headers: {
    Accept: string;
    Authorization: string;
    Content-Type: string;
    User-Agent: string;
    X-OctoAI-Async: string;

Headers used to interact with OctoAI servers. Communicates authorization and request type.

Type declaration

  • Accept: string
  • Authorization: string
  • Content-Type: string
  • User-Agent: string
  • X-OctoAI-Async: string
secureLink: boolean

Set to true to use the SecureLink API.

The FineTuningClient, accessible with Client.tune.


  • Check health of an endpoint using a get request. Try until timeout.


    • endpointUrl: string

      Target URL to run the health check.

    • timeoutMS: number = 900000

      Milliseconds before request times out. Default is 15 minutes.

    • intervalMS: number = 1000

      Interval in milliseconds before the healthCheck method queries

    Returns Promise<number>

    HTTP status code.


    The default timeout is set to 15 minutes to allow for potential cold start.

    For custom containers, please follow Health Check Paths in Custom Containers to set a health check endpoint.

    Information about health check endpoint URLs are available on relevant QuickStart Templates.

  • Send a request to the given endpoint with inputs as request body. For LLaMA2 LLMs, this requires "stream": false in the inputs. To stream for LLMs, please see the inferStream method.

    Type Parameters

    • T


    • endpointUrl: string

      Target URL to run inference

    • inputs: Record<string, any>

      Necessary inputs for the endpointURL to run inference

    Returns Promise<T>

    JSON outputs from the endpoint

  • Execute an inference in the background on the server.


    • endpointUrl: string

      Target URL to send inference request.

    • inputs: Record<string, any>

      Contains necessary inputs for endpoint to run inference.

    Returns Promise<InferenceFuture>

    Future allows checking if results are ready then accessing them.


    Please read the Async Inference Reference for more information. Client.inferAsync returns an InferenceFuture, which can then be used with Client.isFutureReady to see the status. Once it returns true, you can use the Client.getFutureResult to get the response for your InferenceFuture.

    Assuming you have a variable with your target endpoint URL and the inputs the model needs, and an OCTOAI_TOKEN set as an environment variable, you can run a server-side asynchronous inference from QuickStart Template endpoints with something like the below.

     const client = new Client();
    const future = await client.inferAsync(url, inputs);
    if (await client.isFutureReady(future) === true) {
    return await client.getFutureResult(future);
  • Stream text event response body for supporting endpoints. This is an alternative to loading all response body into memory at once. Recommended for use with LLM models. Requires "stream": true in the inputs for LLaMA2 LLMs.


    • endpointUrl: string

      Target URL to run inference

    • inputs: Record<string, any>

      Necessary inputs for the endpointURL to run inference

    Returns Promise<Response>

    Compatible with getReader method.


    This allows you to stream back tokens from the LLMs. Below is an example on how to do this with a LLaMA2 LLM using a completions style API.

    HuggingFace style APIs will usually use the variable done below to indicate the end of the stream. OpenAI style APIs will often send a string in the stream "data: [DONE]\n" to indicate the stream is complete.

    This example concatenates all values from the tokens into a single text variable. How you choose to use the tokens will likely be different, so please modify the code.

    This examples assumes:

    1. You've followed the guide at How to Create an OctoAI Access Token to create and set your OctoAI access token
    2. Either that you will set this token as an OCTOAI_TOKEN envvar or edit the snippet to pass it as a value in the {@link Client.constructor}.
    3. You have assigned your endpoint URL and inputs into variables named llamaEndpoint and streamInputs.
    const client = new Client();
    const readableStream = await client.inferStream(
    let text = ``;
    const streamReader = readableStream.getReader();
    for (
    let { value, done } = await streamReader.read();
    { value, done } = await streamReader.read()
    ) {
    if (done) break;
    const decoded = new TextDecoder().decode(value);
    if (
    decoded === "data: [DONE]\n" ||
    decoded.includes('"finish_reason": "')
    ) {
    const token = JSON.parse(decoded.substring(5));
    if (token.object === "chat.completion.chunk") {
    text += token.choices[0].delta.content;

    The const token = JSON.parse(decoded.substring(5)) line strips "data" from the returned text/event-stream then parses the token as an object.

Generated using TypeDoc