ollama

Get up and running with Llama 3.2, Mistral, Gemma 2, and other large language models.

MIT License

Downloads
219
Stars
92.5K
Committers
310

Bot releases are visible (Hide)

ollama - v0.2.0

Published by github-actions[bot] 4 months ago

Concurrency

Concurrency

Ollama 0.2.0 is now available with concurrency support. This unlocks 2 specific features:

Parallel requests

Ollama can now serve multiple requests at the same time, using only a little bit of additional memory for each request. This enables use cases such as:

  • Handling multiple chat sessions at the same time
  • Hosting a code completion LLM for your internal team
  • Processing different parts of a document simultaneously
  • Running several agents at the same time.

https://github.com/ollama/ollama/assets/251292/9772a5f1-c072-41db-be6c-dd3c621aa2fd

Multiple models

Ollama now supports loading different models at the same time, dramatically improving:

  • Retrieval Augmented Generation (RAG): both the embedding and text completion models can be loaded into memory simultaneously.
  • Agents: multiple different agents can now run simultaneously
  • Running large and small models side-by-side

Models are automatically loaded and unloaded based on requests and how much GPU memory is available.

To see which models are loaded, run ollama ps:

% ollama ps
NAME                    ID              SIZE    PROCESSOR       UNTIL
gemma:2b                030ee63283b5    2.8 GB  100% GPU        4 minutes from now
all-minilm:latest       1b226e2802db    530 MB  100% GPU        4 minutes from now
llama3:latest           365c0bd3c000    6.7 GB  100% GPU        4 minutes from now

For more information on concurrency, see the FAQ

New models

  • GLM-4: A strong multi-lingual general language model with competitive performance to Llama 3.
  • CodeGeeX4: A versatile model for AI software development scenarios, including code completion.
  • Gemma 2: Improved output quality and base text generation models now available

What's Changed

  • Improved Gemma 2
    • Fixed issue where model would generate invalid tokens after hitting context window
    • Fixed inference output issues with gemma2:27b
    • Re-downloading the model may be required: ollama pull gemma2 or ollama pull gemma2:27b
  • Ollama will now show a better error if a model architecture isn't supported
  • Improved handling of quotes and spaces in Modelfile FROM lines
  • Ollama will now return an error if the system does not have enough memory to run a model on Linux

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.48...v0.2.0

ollama - v0.1.48

Published by github-actions[bot] 4 months ago

gemma 2

What's Changed

  • Fixed issue where Gemma 2 would continuously output when reaching context limits
  • Fixed out of memory and core dump errors when running Gemma 2
  • /show info will now show additional model information in ollama run
  • Fixed issue where ollama show would result in an error on certain vision models

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.47...v0.1.48

ollama - v0.1.47

Published by github-actions[bot] 4 months ago

Ollama Gemma 2 illustration

What's Changed

  • Added support for Google Gemma 2 models (9B and 27B)
  • Fixed issues with ollama create when importing from Safetensors

A special thank you to the Google Cloud and DeepMind team members for Gemma 2 support.

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.46...v0.1.47

ollama - v0.1.46

Published by github-actions[bot] 4 months ago

ollama run

What's Changed

  • Increased model loading speed with ollama run, especially if running an already-loaded model
  • Improved performance of /api/show including for large models
  • Fixes issue where the --quantize flag in ollama create would lead to an error
  • Improved model loading times when models would not completely fit in system memory on Linux
  • Fixed issue where certain Modelfile parameters would not be parsed correctly

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.45...v0.1.46

ollama - v0.1.45

Published by github-actions[bot] 4 months ago

New models

  • DeepSeek-Coder-V2: A 16B & 236B open-source Mixture-of-Experts code language model that achieves performance comparable to GPT4-Turbo in code-specific tasks.

ollama show

ollama show will now show model details such as context length, parameters, embedding size, license and more:

% ollama show llama3
  Model                                              
  	arch            	llama	                              
  	parameters      	8.0B 	                              
  	quantization    	Q4_0 	                              
  	context length  	8192 	                              
  	embedding length	4096 	                              
  	                                                   
  Parameters                                         
  	num_keep	24                   	                      
  	stop    	"<|start_header_id|>"	                      
  	stop    	"<|end_header_id|>"  	                      
  	stop    	"<|eot_id|>"         	                      
  	                                                   
  License                                            
  	META LLAMA 3 COMMUNITY LICENSE AGREEMENT         	  
  	Meta Llama 3 Version Release Date: April 18, 2024

What's Changed

  • ollama show <model> will now show model information such as context window size
  • Model loading on Windows with CUDA GPUs is now faster
  • Setting seed in the /v1/chat/completions OpenAI compatibility endpoint no longer changes temperature
  • Enhanced GPU discovery and multi-gpu support with concurrency
  • The Linux install script will now skip searching for network devices
  • Introduced a workaround for AMD Vega RX 56 SDMA support on Linux
  • Fix memory prediction for deepseek-v2 and deepseek-coder-v2 models
  • api/show endpoint returns extensive model metadata
  • GPU configuration variables are now reported in ollama serve
  • Update Linux ROCm to v6.1.1

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.44...v0.1.45

ollama - v0.1.44

Published by github-actions[bot] 4 months ago

What's Changed

  • Fixed issue where unicode characters such as emojis would not be loaded correctly when running ollama create
  • Fixed certain cases where Nvidia GPUs would not be detected and reported as compute capability 1.0 devices

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.43...v0.1.44

ollama - v0.1.43

Published by github-actions[bot] 4 months ago

Ollama honest work

What's Changed

  • New import.md guide for converting and importing models to Ollama
  • Fixed issue where embedding vectors resulting from /api/embeddings would not be accurate
  • JSON mode responses will no longer include invalid escape characters
  • Removing a model will no longer show incorrect File not found errors
  • Fixed issue where running ollama create would result in an error on Windows with certain file formatting

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.42...v0.1.43

ollama - v0.1.42

Published by github-actions[bot] 4 months ago

New models

  • Qwen 2: a new series of large language models from Alibaba group

What's Changed

  • Fixed issue where qwen2 would output erroneous text such as GGG on Nvidia and AMD GPUs
  • ollama pull is now faster if it detects a model is already downloaded
  • ollama create will now automatically detect prompt templates for popular model architectures such as Llama, Gemma, Phi and more.
  • Ollama can now be accessed from local apps built with Electron and Tauri, as well as in developing apps in local html files
  • Update welcome prompt in Windows to llama3
  • Fixed issues where /api/ps and /api/tags would show invalid timestamps in responses

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.41...v0.1.42

ollama - v0.1.41

Published by github-actions[bot] 5 months ago

What's Changed

  • Fixed issue on Windows 10 and 11 with Intel CPUs with integrated GPUs where Ollama would encounter an error

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.40...v0.1.41

ollama - v0.1.40

Published by github-actions[bot] 5 months ago

ollama continuing to capture bugs

New models

  • Codestral: Codestral is Mistral AI’s first-ever code model designed for code generation tasks.
  • IBM Granite Code: now in 3B and 8B parameter sizes.
  • Deepseek V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

What's Changed

  • Fixed out of memory and incorrect token issues when running Codestral on 16GB Macs
  • Fixed issue where full-width characters (e.g. Japanese, Chinese, Russian) were deleted at end of the line when using ollama run

New Examples

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.39...v0.1.40

ollama - v0.1.39

Published by github-actions[bot] 5 months ago

New models

  • Cohere Aya 23: A new state-of-the-art, multilingual LLM covering 23 different languages.
  • Mistral 7B 0.3: A new version of Mistral 7B with initial support for function calling.
  • Phi-3 Medium: a 14B parameters, lightweight, state-of-the-art open model by Microsoft.
  • Phi-3 Mini 128K and Phi-3 Medium 128K: versions of the Phi-3 models that support a context window size of 128K
  • Granite code: A family of open foundation models by IBM for Code Intelligence

Llama 3 import

It is now possible to import and quantize Llama 3 and its finetunes from Safetensors format to Ollama.

First, clone a Hugging Face repo with a Safetensors model:

git clone https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
cd Meta-Llama-3-8B-Instruct

Next, create a Modelfile:

FROM .

TEMPLATE """{{ if .System }}<|start_header_id|>system<|end_header_id|>

{{ .System }}<|eot_id|>{{ end }}{{ if .Prompt }}<|start_header_id|>user<|end_header_id|>

{{ .Prompt }}<|eot_id|>{{ end }}<|start_header_id|>assistant<|end_header_id|>

{{ .Response }}<|eot_id|>"""

PARAMETER stop <|start_header_id|>
PARAMETER stop <|end_header_id|>
PARAMETER stop <|eot_id|>

Then, create and quantize a model:

ollama create --quantize q4_0 -f Modelfile my-llama3 
ollama run my-llama3

What's Changed

  • Fixed issues where wide characters such as Chinese, Korean, Japanese and Russian languages.
  • Added new OLLAMA_NOHISTORY=1 environment variable that can be set to disable history when using ollama run
  • New experimental OLLAMA_FLASH_ATTENTION=1 flag for ollama serve that improves token generation speed on Apple Silicon Macs and NVIDIA graphics cards
  • Fixed error that would occur on Windows running ollama create -f Modelfile
  • ollama create can now create models from I-Quant GGUF files
  • Fixed EOF errors when resuming downloads via ollama pull
  • Added a Ctrl+W shortcut to ollama run

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.38...v0.1.39

ollama - v0.1.38

Published by github-actions[bot] 5 months ago

New Models

  • Falcon 2: A new 11B parameters causal decoder-only model built by TII and trained over 5T tokens.
  • Yi 1.5: A new high-performing version of Yi, now licensed as Apache 2.0. Available in 6B, 9B and 34B sizes.

What's Changed

ollama ps

A new command is now available: ollama ps. This command displays currently loaded models, their memory footprint, and the processors used (GPU or CPU):

% ollama ps
NAME             	ID          	SIZE  	PROCESSOR      	UNTIL              
mixtral:latest   	7708c059a8bb	28 GB 	47%/53% CPU/GPU	Forever           	
llama3:latest    	a6990ed6be41	5.5 GB	100% GPU       	4 minutes from now	
all-minilm:latest	1b226e2802db	585 MB	100% GPU       	4 minutes from now

/clear

To clear the chat history for a session when running ollama run, use /clear:

>>> /clear
Cleared session context
  • Fixed issue where switching loaded models on Windows would take several seconds
  • Running /save will no longer abort the chat session if an incorrect name is provided
  • The /api/tags API endpoint will now correctly return an empty list [] instead of null if no models are provided

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.37...v0.1.38

ollama - v0.1.37

Published by github-actions[bot] 5 months ago

What's Changed

  • Fixed issue where models with uppercase characters in the name would not show with ollama list
  • Fixed usage string for ollama create
  • Fix finish_reason being "" instead of null in the Open-AI compatible chat API.

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.36...v0.1.37

ollama - v0.1.36

Published by github-actions[bot] 5 months ago

What's Changed

  • Fixed exit status 0xc0000005 error with AMD graphics cards on Windows
  • Fixed rare out of memory errors when loading a model to run with CPU

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.35...v0.1.36

ollama - v0.1.35

Published by github-actions[bot] 5 months ago

New models

  • Llama 3 ChatQA: A model from NVIDIA based on Llama 3 that excels at conversational question answering (QA) and retrieval-augmented generation (RAG).

What's Changed

  • Quantization: ollama create can now quantize models when importing them using the --quantize or -q flag:
ollama create -f Modelfile --quantize q4_0 mymodel

[!NOTE]
--quantize works when importing float16 or float32 models:

  • From a binary GGUF files (e.g. FROM ./model.gguf)
  • From a library model (e.g. FROM llama3:8b-instruct-fp16)
  • Fixed issue where inference subprocesses wouldn't be cleaned up on shutdown.
  • Fixed a series out of memory errors when loading models on multi-GPU systems
  • Ctrl+J characters will now properly add newlines in ollama run
  • Fixed issues when running ollama show for vision models
  • OPTIONS requests to the Ollama API will no longer result in errors
  • Fixed issue where partially downloaded files wouldn't be cleaned up
  • Added a new done_reason field in responses describing why generation stopped responding
  • Ollama will now more accurately estimate how much memory is available on multi-GPU systems especially when running different models one after another

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.34...v0.1.35

ollama - v0.1.34

Published by github-actions[bot] 6 months ago

Ollama goes on an adventure to hunt down bugs

New models

  • Llava Llama 3: A new high-performing LLaVA model fine-tuned from Llama 3 Instruct.
  • Llava Phi 3: A new small LLaVA model fine-tuned from Phi 3.
  • StarCoder2 15B Instruct: A new instruct fine-tune of the StarCoder2 model
  • CodeGemma 1.1: A new release of the CodeGemma model.
  • StableLM2 12B: A new 12B version of the StableLM 2 model from Stability AI
  • Moondream 2: Moondream 2's runtime parameters have been improved for better responses

What's Changed

  • Fixed issues with LLaVa models where they would respond incorrectly after the first request
  • Fixed out of memory errors when running large models such as Llama 3 70B
  • Fixed various issues with Nvidia GPU discovery on Linux and Windows
  • Fixed a series of Modelfile errors when running ollama create
  • Fixed no slots available error that occurred when cancelling a request and then sending follow up requests
  • Improved AMD GPU detection on Fedora
  • Improved reliability when using the experimental OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS flags
  • ollama serve will now shut down quickly, even if a model is loading

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.33...v0.1.34

ollama - v0.1.33

Published by github-actions[bot] 6 months ago

Models:

  • Llama 3: a new model by Meta, and the most capable openly available LLM to date
  • Phi 3 Mini: a new 3.8B parameters, lightweight, state-of-the-art open model by Microsoft.
  • Dolphin Llama 3: The uncensored Dolphin model, trained by Eric Hartford and based on Llama 3 with a variety of instruction, conversational, and coding skills.
  • Qwen 110B: The first Qwen model over 100B parameters in size with outstanding performance in evaluations

What's Changed

  • Fixed issues where the model would not terminate, causing the API to hang.
  • Fixed a series of out of memory errors on Apple Silicon Macs
  • Fixed out of memory errors when running Mixtral architecture models

Experimental concurrency features

New concurrency features are coming soon to Ollama. They are available

  • OLLAMA_NUM_PARALLEL: Handle multiple requests simultaneously for a single model
  • OLLAMA_MAX_LOADED_MODELS: Load multiple models simultaneously

To enable these features, set the environment variables for ollama serve. For more info see this guide:

OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.32...v0.1.33-rc5

ollama - v0.1.32

Published by github-actions[bot] 6 months ago

What's Changed

  • Support for larger models such as mixtral:8x22b and command-r-plus
  • Ollama will now better estimate memory utilization when loading models, leading to less out-of-memory errors, as well as better GPU utilization
  • Fixed several issues where Ollama would hang upon encountering an error
  • Fix issue where using quotes in OLLAMA_ORIGINS would cause an error

To install this pre-release version on Linux:

curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.1.32-rc1 sh

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.31...v0.1.32-rc1

ollama - v0.1.31

Published by github-actions[bot] 7 months ago

New models

  • Qwen 1.5 32B: A new 32B multilingual model competitive with larger models such as Mixtral
  • StarlingLM Beta: A high ranking 7B model that ranks higher than 7B on popular benchmarks and includes a permissive Apache 2.0 license.

What's new

  • Fixed issue where Ollama would hang when using unicode characters in the prompt such as emojis

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.30...v0.1.31

ollama - v0.1.30

Published by github-actions[bot] 7 months ago

New models

  • Command R: a Large Language Model optimized for conversational interaction and long context tasks.
  • mxbai-embed-large: A new state-of-the-art large embedding model

What's Changed

  • Fixed various issues with ollama run on Windows
    • History now will work when pressing up and down arrow keys
    • Right and left arrow keys will now move the cursor appropriately
    • Pasting multi-line strings will now work on Windows
  • Fixed issue where mounting or sharing files between Linux and Windows (e.g. via WSL or Docker) would cause errors due to having : in the filename.
  • Improved support for AMD MI300 and MI300X Accelerators
  • Improved cleanup of temporary files resulting in better space utilization

Important change

For filesystem compatibility, Ollama has changed model data filenames to use - instead of :. This change will be applied automatically. If downgrading to 0.1.29 or lower from 0.1.30 (on Linux or macOS only) run:

find ~/.ollama/models/blobs -type f -exec bash -c 'mv "$0" "${0//-/:}"' {} \;

New Contributors

Full Changelog: https://github.com/ollama/ollama/compare/v0.1.29...v0.1.30

Package Rankings
Top 9.59% on Proxy.golang.org
Top 34.91% on Pypi.org
Badges
Extracted from project README
Discord