Interacting with the Ollama API

Ollama provides an HTTP-based API that allows developers to programmatically interact with its models. This guide will walk you through the detailed usage of the Ollama API, including request formats, response formats, and example code.

Starting the Ollama Service

Before using the API, ensure the Ollama service is running. You can start it with the following command:

1	ollama serve

By default, the service runs at http://localhost:11434.

All endpoints start with: http://localhost:11434

Conventions

Model Names

Model names follow a model:tag format. The model part can include an optional namespace, like example/model. For instance, deepseek-r1:14b and llama3.2:1b are valid examples. The tag is optional and defaults to latest if not specified. It’s used to pinpoint a specific version of the model.

Durations

All durations are measured and returned in nanoseconds.

Streaming Responses

Some endpoints stream responses as JSON objects. You can disable streaming by passing {"stream": false} in the request for these endpoints.

API Endpoints

Ollama offers several key API endpoints:

Generate Text

Sends a prompt to the model and retrieves the generated text.

HTTP Method: POST

URL: /api/generate

Parameters

model: (required) the model name
prompt: the prompt to generate a response for
suffix: the text after the model response
images: (optional) a list of base64-encoded images (for multimodal models such as llava)

Advanced Parameters (Optional):

format: the format to return a response in. Format can be json or a JSON schema
options: additional model parameters listed in the documentation for the Modelfile such as temperature
system: system message to (overrides what is defined in the Modelfile)
template: the prompt template to use (overrides what is defined in the Modelfile)
stream: if false the response will be returned as a single response object, rather than a stream of objects
raw: if true no formatting will be applied to the prompt. You may choose to use the raw parameter if you are specifying a full templated prompt in your request to the API
keep_alive: controls how long the model will stay loaded into memory following the request (default: 5m)
context (deprecated): the context parameter returned from a previous request to /generate, this can be used to keep a short conversational memory

Request Format:

{
  "model": "<model-name>",  // Model name
  "prompt": "<input-text>", // Input prompt
  "stream": false,          // Enable streaming (default: false)
  "options": {              // Optional parameters
    "temperature": 0.7,     // Temperature setting
    "max_tokens": 100       // Maximum token count
  }
}

Response Format:

{
  "response": "<generated-text>", // Generated text
  "done": true                    // Whether the task is complete
}

Important
It’s important to instruct the model to use JSON in the prompt. Otherwise, the model may generate large amounts whitespace.

Request Example:

curl http://localhost:11434/api/generate --data '{
    "model": "llama3.2:1b",
    "prompt": "Write a bubble sort algorithm in Python",
    "format": "json",
    "stream": false,
    "options": {
        "temperature": 0.7,
        "max_tokens": 100
    }
}'

Response (Success):

A stream of JSON objects is returned:

{
    "model": "llama3.2:1b",
    "created_at": "2025-02-08T10:58:03.1357634Z",
    "response": "{\n\n\n    \"bubble_sort\": [\n        {\"algorithm\": \"Bubble Sort\", \"description\": \"Sorts the list by repeatedly swapping the adjacent elements if they are in wrong order\"},\n        {\"steps\": {\n            \"n\": 10,  \"list\": [4, 2, 9, 6, 5, 1, 8, 3, 7, 0],\n            \"expected\": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n        }}\n    ]\n}\n\n    \n\n\n   ",
    "done": true,
    "done_reason": "stop",
    "context": [128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006,882,128007,271,8144,264,24529,3460,12384,304,13325,128009,128006,78191,128007,271,54732,262,330,78589,18942,794,2330,286,5324,20266,794,330,76878,16347,498,330,4789,794,330,10442,82,279,1160,555,19352,64819,279,24894,5540,422,814,527,304,5076,2015,7260,286,5324,25047,794,341,310,330,77,794,220,605,11,220,330,1638,794,510,19,11,220,17,11,220,24,11,220,21,11,220,20,11,220,16,11,220,23,11,220,18,11,220,22,11,220,15,1282,310,330,7475,794,510,15,11,220,16,11,220,17,11,220,18,11,220,19,11,220,20,11,220,21,11,220,22,11,220,23,11,220,24,933,286,8256,262,5243,633,62539,262],
    "total_duration": 6768781400,
    "load_duration": 22301300,
    "prompt_eval_count": 32,
    "prompt_eval_duration": 249000000,
    "eval_count": 126,
    "eval_duration": 6494000000
}

The final response in the stream also includes additional data about the generation:

total_duration: time spent generating the response
load_duration: time spent in nanoseconds loading the model
prompt_eval_count: number of tokens in the prompt
prompt_eval_duration: time spent in nanoseconds evaluating the prompt
eval_count: number of tokens in the response
eval_duration: time in nanoseconds spent generating the response
context: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memory
response: empty if the response was streamed, if not streamed, this will contain the full response

To calculate how fast the response is generated in tokens per second (token/s), divide eval_count / eval_duration * 10^9.

Chat

Supports multi-turn conversations, with the model retaining context.

HTTP Method: POST

URL: /api/chat

Parameters:

model: (required) the model name
messages: the messages of the chat, this can be used to keep a chat memory
tools: list of tools in JSON for the model to use if supported

The message object has the following fields:

role: the role of the message, either system, user, assistant, or tool
content: the content of the message
images (optional): a list of images to include in the message (for multimodal models such as llava)
tool_calls (optional): a list of tools in JSON that the model wants to use

Advanced parameters (optional):

format: the format to return a response in. Format can be json or a JSON schema.
options: additional model parameters listed in the documentation for the Modelfile such as temperature
stream: if false the response will be returned as a single response object, rather than a stream of objects
keep_alive: controls how long the model will stay loaded into memory following the request (default: 5m)

Request Format:

{
  "model": "<model-name>",  // Model name
  "messages": [             // List of messages
    {
      "role": "user",       // User role
      "content": "<input-text>" // User input
    }
  ],
  "stream": false,          // Enable streaming
  "options": {              // Optional parameters
    "temperature": 0.7,
    "max_tokens": 100
  }
}

Response Format:

{
  "message": {
    "role": "assistant",    // Assistant role
    "content": "<generated-text>" // Generated text
  },
  "done": true
}

List Local Models

Lists all locally downloaded models.

HTTP Method: POST

URL: /api/tag

Response Format:

{
  "models": [
    {
      "name": "<model-name>", // Model name
      "size": "<model-size>", // Model size
      "modified_at": "<timestamp>" // Last modified timestamp
    }
  ]
}

Pull a Model

Downloads a model from the model repository.

HTTP Method: POST

URL: /api/pull

Parameters

model: name of the model to pull
insecure: (optional) allow insecure connections to the library. Only use this if you are pulling from your own library during development.
stream: (optional) if false the response will be returned as a single response object, rather than a stream of objects

Request Format:

1
2
3

{
  "name": "<model-name>" // Model name
}

Response Format:

{
  "status": "downloading", // Download status
  "digest": "<model-digest>" // Model digest
}

Usage Examples

Generate Text

Using curl to send a request:

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-coder",
  "prompt": "Can you help me write some code?",
  "stream": false
}'

Multi-turn Chat

Using curl to send a request:

curl http://localhost:11434/api/chat -d '{
  "model": "deepseek-coder",
  "messages": [
    {
      "role": "user",
      "content": "Can you help me write a Python script?"
    }
  ],
  "stream": false
}'

List Local Models

Using curl to send a request:

1	curl http://localhost:11434/api/tags

Pull a Model

Using curl to send a request:

1
2
3

curl http://localhost:11434/api/pull -d '{
  "name": "deepseek-coder"
}'

Streaming Responses

Ollama supports streaming responses, which is useful for real-time text generation.

Enabling Streaming

Set "stream": true in the request to receive responses line by line.

Example:

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-coder",
  "prompt": "Can you help me write some code?",
  "stream": true
}'

Response Format

Each line returns a JSON object:

{
  "response": "<partial-text>", // Partially generated text
  "done": false                 // Whether the task is complete
}

Programming Language Examples

Python (using `requests` library)

Generate Text:

import requests

response = requests.post(
    "http://localhost:11434/api/generate",
    json={
        "model": "deepseek-coder",
        "prompt": "Can you help me write some code?",
        "stream": False
    }
)
print(response.json())

Multi-turn Chat:

response = requests.post(
    "http://localhost:11434/api/chat",
    json={
        "model": "deepseek-coder",
        "messages": [
            {
                "role": "user",
                "content": "Can you help me write a Python script?"
            }
        ],
        "stream": False
    }
)
print(response.json())

JavaScript (using `fetch` API)

Generate Text:

fetch("http://localhost:11434/api/generate", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "deepseek-coder",
    prompt: "Can you help me write some code?",
    stream: false
  })
})
  .then(response => response.json())
  .then(data => console.log(data));

Multi-turn Chat:

fetch("http://localhost:11434/api/chat", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    model: "deepseek-coder",
    messages: [
      {
        role: "user",
        content: "Can you help me write a Python script?"
      }
    ],
    stream: false
  })
})
  .then(response => response.json())
  .then(data => console.log(data));

Parameter Name	Type	Requirement	Description
Header		Required	Request message header.
Token	String	Required	User token after login. If not logged in, this will be an empty string.
Version	String	Required	API version number.
SystemId	Integer	Required	Organization ID, representing the system ID making the request.
Timestamp	Long	Required	Current UNIX timestamp.

Interacting with the Ollama API

Starting the Ollama Service

Conventions

API Endpoints

Generate Text

Chat

List Local Models

Pull a Model

Usage Examples

Generate Text

Multi-turn Chat

List Local Models

Pull a Model

Streaming Responses

Enabling Streaming

Response Format

Programming Language Examples

Python (using requests library)

JavaScript (using fetch API)

Python (using `requests` library)

JavaScript (using `fetch` API)