Ollama provides an HTTP-based API that allows developers to programmatically interact with its models. This guide will walk you through the detailed usage of the Ollama API, including request formats, response formats, and example code.

Starting the Ollama Service

Before using the API, ensure the Ollama service is running. You can start it with the following command:

1
ollama serve

By default, the service runs at http://localhost:11434.

All endpoints start with: http://localhost:11434

Conventions

Model Names

Model names follow a model:tag format. The model part can include an optional namespace, like example/model. For instance, deepseek-r1:14b and llama3.2:1b are valid examples. The tag is optional and defaults to latest if not specified. It’s used to pinpoint a specific version of the model.

Durations

All durations are measured and returned in nanoseconds.

Streaming Responses

Some endpoints stream responses as JSON objects. You can disable streaming by passing {"stream": false} in the request for these endpoints.

API Endpoints

Ollama offers several key API endpoints:

Generate Text

Sends a prompt to the model and retrieves the generated text.

HTTP Method: POST

URL: /api/generate

Parameters

  • model: (required) the model name
  • prompt: the prompt to generate a response for
  • suffix: the text after the model response
  • images: (optional) a list of base64-encoded images (for multimodal models such as llava)

Advanced Parameters (Optional):

  • format: the format to return a response in. Format can be json or a JSON schema
  • options: additional model parameters listed in the documentation for the Modelfile such as temperature
  • system: system message to (overrides what is defined in the Modelfile)
  • template: the prompt template to use (overrides what is defined in the Modelfile)
  • stream: if false the response will be returned as a single response object, rather than a stream of objects
  • raw: if true no formatting will be applied to the prompt. You may choose to use the raw parameter if you are specifying a full templated prompt in your request to the API
  • keep_alive: controls how long the model will stay loaded into memory following the request (default: 5m)
  • context (deprecated): the context parameter returned from a previous request to /generate, this can be used to keep a short conversational memory

Request Format:

1
2
3
4
5
6
7
8
9
{
"model": "<model-name>", // Model name
"prompt": "<input-text>", // Input prompt
"stream": false, // Enable streaming (default: false)
"options": { // Optional parameters
"temperature": 0.7, // Temperature setting
"max_tokens": 100 // Maximum token count
}
}

Response Format:

1
2
3
4
{
"response": "<generated-text>", // Generated text
"done": true // Whether the task is complete
}

Important
It’s important to instruct the model to use JSON in the prompt. Otherwise, the model may generate large amounts whitespace.

Request Example:

1
2
3
4
5
6
7
8
9
10
curl http://localhost:11434/api/generate --data '{
"model": "llama3.2:1b",
"prompt": "Write a bubble sort algorithm in Python",
"format": "json",
"stream": false,
"options": {
"temperature": 0.7,
"max_tokens": 100
}
}'

Response (Success):

A stream of JSON objects is returned:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"model": "llama3.2:1b",
"created_at": "2025-02-08T10:58:03.1357634Z",
"response": "{\n\n\n \"bubble_sort\": [\n {\"algorithm\": \"Bubble Sort\", \"description\": \"Sorts the list by repeatedly swapping the adjacent elements if they are in wrong order\"},\n {\"steps\": {\n \"n\": 10, \"list\": [4, 2, 9, 6, 5, 1, 8, 3, 7, 0],\n \"expected\": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n }}\n ]\n}\n\n \n\n\n ",
"done": true,
"done_reason": "stop",
"context": [128006,9125,128007,271,38766,1303,33025,2696,25,6790,220,2366,18,271,128009,128006,882,128007,271,8144,264,24529,3460,12384,304,13325,128009,128006,78191,128007,271,54732,262,330,78589,18942,794,2330,286,5324,20266,794,330,76878,16347,498,330,4789,794,330,10442,82,279,1160,555,19352,64819,279,24894,5540,422,814,527,304,5076,2015,7260,286,5324,25047,794,341,310,330,77,794,220,605,11,220,330,1638,794,510,19,11,220,17,11,220,24,11,220,21,11,220,20,11,220,16,11,220,23,11,220,18,11,220,22,11,220,15,1282,310,330,7475,794,510,15,11,220,16,11,220,17,11,220,18,11,220,19,11,220,20,11,220,21,11,220,22,11,220,23,11,220,24,933,286,8256,262,5243,633,62539,262],
"total_duration": 6768781400,
"load_duration": 22301300,
"prompt_eval_count": 32,
"prompt_eval_duration": 249000000,
"eval_count": 126,
"eval_duration": 6494000000
}

The final response in the stream also includes additional data about the generation:

  • total_duration: time spent generating the response
  • load_duration: time spent in nanoseconds loading the model
  • prompt_eval_count: number of tokens in the prompt
  • prompt_eval_duration: time spent in nanoseconds evaluating the prompt
  • eval_count: number of tokens in the response
  • eval_duration: time in nanoseconds spent generating the response
  • context: an encoding of the conversation used in this response, this can be sent in the next request to keep a conversational memory
  • response: empty if the response was streamed, if not streamed, this will contain the full response

To calculate how fast the response is generated in tokens per second (token/s), divide eval_count / eval_duration * 10^9.

Chat

Supports multi-turn conversations, with the model retaining context.

HTTP Method: POST

URL: /api/chat

Parameters:

  • model: (required) the model name
  • messages: the messages of the chat, this can be used to keep a chat memory
  • tools: list of tools in JSON for the model to use if supported

The message object has the following fields:

  • role: the role of the message, either system, user, assistant, or tool
  • content: the content of the message
  • images (optional): a list of images to include in the message (for multimodal models such as llava)
  • tool_calls (optional): a list of tools in JSON that the model wants to use

Advanced parameters (optional):

  • format: the format to return a response in. Format can be json or a JSON schema.
  • options: additional model parameters listed in the documentation for the Modelfile such as temperature
  • stream: if false the response will be returned as a single response object, rather than a stream of objects
  • keep_alive: controls how long the model will stay loaded into memory following the request (default: 5m)

Request Format:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
"model": "<model-name>", // Model name
"messages": [ // List of messages
{
"role": "user", // User role
"content": "<input-text>" // User input
}
],
"stream": false, // Enable streaming
"options": { // Optional parameters
"temperature": 0.7,
"max_tokens": 100
}
}

Response Format:

1
2
3
4
5
6
7
{
"message": {
"role": "assistant", // Assistant role
"content": "<generated-text>" // Generated text
},
"done": true
}

List Local Models

Lists all locally downloaded models.

HTTP Method: POST

URL: /api/tag

Response Format:

1
2
3
4
5
6
7
8
9
{
"models": [
{
"name": "<model-name>", // Model name
"size": "<model-size>", // Model size
"modified_at": "<timestamp>" // Last modified timestamp
}
]
}

Pull a Model

Downloads a model from the model repository.

HTTP Method: POST

URL: /api/pull

Parameters

  • model: name of the model to pull
  • insecure: (optional) allow insecure connections to the library. Only use this if you are pulling from your own library during development.
  • stream: (optional) if false the response will be returned as a single response object, rather than a stream of objects

Request Format:

1
2
3
{
"name": "<model-name>" // Model name
}

Response Format:

1
2
3
4
{
"status": "downloading", // Download status
"digest": "<model-digest>" // Model digest
}

Usage Examples

Generate Text

Using curl to send a request:

1
2
3
4
5
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-coder",
"prompt": "Can you help me write some code?",
"stream": false
}'

Multi-turn Chat

Using curl to send a request:

1
2
3
4
5
6
7
8
9
10
curl http://localhost:11434/api/chat -d '{
"model": "deepseek-coder",
"messages": [
{
"role": "user",
"content": "Can you help me write a Python script?"
}
],
"stream": false
}'

List Local Models

Using curl to send a request:

1
curl http://localhost:11434/api/tags

Pull a Model

Using curl to send a request:

1
2
3
curl http://localhost:11434/api/pull -d '{
"name": "deepseek-coder"
}'

Streaming Responses

Ollama supports streaming responses, which is useful for real-time text generation.

Enabling Streaming

Set "stream": true in the request to receive responses line by line.

Example:

1
2
3
4
5
curl http://localhost:11434/api/generate -d '{
"model": "deepseek-coder",
"prompt": "Can you help me write some code?",
"stream": true
}'

Response Format

Each line returns a JSON object:

1
2
3
4
{
"response": "<partial-text>", // Partially generated text
"done": false // Whether the task is complete
}

Programming Language Examples

Python (using requests library)

Generate Text:

1
2
3
4
5
6
7
8
9
10
11
import requests

response = requests.post(
"http://localhost:11434/api/generate",
json={
"model": "deepseek-coder",
"prompt": "Can you help me write some code?",
"stream": False
}
)
print(response.json())

Multi-turn Chat:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
response = requests.post(
"http://localhost:11434/api/chat",
json={
"model": "deepseek-coder",
"messages": [
{
"role": "user",
"content": "Can you help me write a Python script?"
}
],
"stream": False
}
)
print(response.json())

JavaScript (using fetch API)

Generate Text:

1
2
3
4
5
6
7
8
9
10
11
fetch("http://localhost:11434/api/generate", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "deepseek-coder",
prompt: "Can you help me write some code?",
stream: false
})
})
.then(response => response.json())
.then(data => console.log(data));

Multi-turn Chat:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
fetch("http://localhost:11434/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: "deepseek-coder",
messages: [
{
role: "user",
content: "Can you help me write a Python script?"
}
],
stream: false
})
})
.then(response => response.json())
.then(data => console.log(data));
Parameter Name Type Requirement Description
Header Required Request message header.
Token String Required User token after login. If not logged in, this will be an empty string.
Version String Required API version number.
SystemId Integer Required Organization ID, representing the system ID making the request.
Timestamp Long Required Current UNIX timestamp.