Using Ollama with Python

Ollama provides a Python SDK that allows you to interact with locally running models directly from your Python environment. This SDK makes it easy to integrate natural language processing tasks into your Python projects, enabling operations like text generation, conversational AI, and model management—all without the need for manual command-line interactions.

Installing the Python SDK

To get started, you’ll need to install the Ollama Python SDK. You can do this using pip:

1	pip install ollama

Make sure you have Python 3.x installed and that your environment can access the Ollama local service.

Starting the Local Service

Before using the Python SDK, ensure that the Ollama local service is up and running. You can start it using the command line:

1	ollama serve

Once the local service is running, the Python SDK will communicate with it to perform tasks like model inference.

Using the Ollama Python SDK for Inference

After installing the SDK and starting the local service, you can interact with Ollama using Python code. Here’s how:

Import the necessary modules:
1
from ollama import chat, ChatResponse

Send requests to a specified model to generate text or dialogue:

response: ChatResponse = chat(model='deepseek-r1:1.5b', messages=[
    {
        'role': 'user',
        'content': 'Who are you?',
    },
])

# Print the response content
print(response['message']['content'])

# Alternatively, access the response object directly
# print(response.message.content)

When you run this code, the output might look like this:

1	Greetings! I'm DeepSeek-R1, an artificial intelligence assistant created by DeepSeek. I'm at your service and would be delighted to assist you with any inquiries or tasks you may have.

Streaming Responses

The Ollama SDK also supports streaming responses. You can enable this by setting stream=True when sending a request:

stream = chat(
    model='deepseek-r1:1.5b',
    messages=[{'role': 'user', 'content': 'Who are you?'}],
    stream=True,
)

# Print the response in chunks
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Custom Clients

For more control over request configurations, such as custom headers or specifying the local service URL, you can create a custom client:

from ollama import Client

client = Client(
    host='http://localhost:11434',
    headers={'x-some-header': 'some-value'}
)

response = client.chat(model='deepseek-r1:1.5b', messages=[
    {
        'role': 'user',
        'content': 'Who are you?',
    },
])
print(response['message']['content'])

Asynchronous Clients

If you need to handle requests asynchronously, you can use the AsyncClient class, which is ideal for high-concurrency scenarios:

import asyncio
from ollama import AsyncClient

async def chat():
    message = {'role': 'user', 'content': 'Who are you?'}
    response = await AsyncClient().chat(model='deepseek-r1:1.5b', messages=[message])
    print(response['message']['content'])

asyncio.run(chat())

Asynchronous Streaming

For asynchronous streaming responses, you can use an asynchronous generator:

import asyncio
from ollama import AsyncClient

async def chat():
    message = {'role': 'user', 'content': 'Who are you?'}
    async for part in await AsyncClient().chat(model='deepseek-r1:1.5b', messages=[message], stream=True):
        print(part['message']['content'], end='', flush=True)

asyncio.run(chat())

Here, the response is returned in parts asynchronously, allowing you to process each part immediately.

Common API Methods

The Ollama Python SDK provides several useful API methods for managing and interacting with models:

Chat: Generate conversational responses.

1	ollama.chat(model='deepseek-r1:1.5b', messages=[{'role': 'user', 'content': 'Why is the sky blue?'}])

Generate: Generate text based on a prompt.

1	ollama.generate(model='deepseek-r1:1.5b', prompt='Why is the sky blue?')

List: List all available models.
1
ollama.list()
Show: Display details about a specific model.
1
ollama.show('deepseek-r1:1.5b')

Create: Create a new model from an existing one.

1	ollama.create(model='example', from_='deepseek-r1:1.5b', system="You are Mario from Super Mario Bros.")

Copy: Copy a model to another location.

1	ollama.copy('deepseek-r1:1.5b', 'user/deepseek-r1:1.5b')

Delete: Delete a model.
1
ollama.delete('deepseek-r1:1.5b')
Pull: Download a model from a remote repository.
1
ollama.pull('deepseek-r1:1.5b')
Push: Upload a model to a remote repository.
1
ollama.push('user/deepseek-r1:1.5b')

Embed: Generate text embeddings.

1	ollama.embed(model='deepseek-r1:1.5b', input='The sky is blue because of Rayleigh scattering')

Ps: List running models.

1	ollama.ps()

Error Handling

The Ollama SDK raises errors when requests fail or streaming issues occur. You can handle these errors using try-except blocks:

import ollama

model = 'does-not-yet-exist'

try:
    response = ollama.chat(model)
except ollama.ResponseError as e:
    print('Error:', e.error)
    if e.status_code == 404:
        ollama.pull(model)

In this example, if the model doesn’t exist, a ResponseError is raised, and you can choose to pull the model or handle the error accordingly.