mistral-inference

Official inference library for Mistral models

APACHE-2.0 License

Downloads
12.8K
Stars
9.4K

Bot releases are hidden (Show)

mistral-inference - v1.3.0 Mistral-Nemo Latest Release

Published by patrickvonplaten 3 months ago

Welcome Mistral-Nemo from Mistral 🤝 NVIDIA

Read more about Mistral-Nemo here.

Install

pip install mistral-inference>=1.3.0

Download

export NEMO_MODEL=$HOME/12B_NEMO_MODEL
wget https://models.mistralcdn.com/mistral-nemo-2407/mistral-nemo-instruct-2407.tar
mkdir -p $NEMO_MODEL
tar -xf mistral-nemo-instruct-v0.1.tar -C $NEMO_MODEL

Chat

mistral-chat $HOME/NEMO_MODEL --instruct --max_tokens 1024

or directly in Python:

import os
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_model("mistral-nemo")
model = Transformer.from_folder(os.environ.get("NEMO_MODEL"))

prompt = "How expensive would it be to ask a window cleaner to clean all windows in Paris. Make a reasonable guess in US Dollar."

completion_request = ChatCompletionRequest(messages=[UserMessage(content=prompt)])

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=1024, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.decode(out_tokens[0])

print(result)

Function calling:

from mistral_common.protocol.instruct.tool_calls import Function, Tool
from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest


tokenizer = MistralTokenizer.from_model("mistral-nemo")
model = Transformer.from_folder(os.environ.get("NEMO_MODEL"))

completion_request = ChatCompletionRequest(
    tools=[
        Tool(
            function=Function(
                name="get_current_weather",
                description="Get the current weather",
                parameters={
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "format": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The temperature unit to use. Infer this from the users location.",
                        },
                    },
                    "required": ["location", "format"],
                },
            )
        )
    ],
    messages=[
        UserMessage(content="What's the weather like today in Paris?"),
        ],
)

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=256, temperature=0.35, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.decode(out_tokens[0])

print(result)

Summary

The Mistral-Nemo-Instruct-2407 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-Nemo-Base-2407. Trained jointly by Mistral AI and NVIDIA, it significantly outperforms existing models smaller or similar in size.

For more details about this model please refer to our release blog post.

Key features

  • Released under the Apache 2 License
  • Pre-trained and instructed versions
  • Trained with a 128k context window
  • Trained on a large proportion of multilingual and code data
  • Drop-in replacement of Mistral 7B

Model Architecture

Mistral Nemo is a transformer model, with the following architecture choices:

  • Layers: 40
  • Dim: 5,120
  • Head dim: 128
  • Hidden dim: 14,436
  • Activation Function: SwiGLU
  • Number of heads: 32
  • Number of kv-heads: 8 (GQA)
  • Vocabulary size: 2**17 ~= 128k
  • Rotary embeddings (theta = 1M)

Metrics

Main Benchmarks

Benchmark Score
HellaSwag (0-shot) 83.5%
Winogrande (0-shot) 76.8%
OpenBookQA (0-shot) 60.6%
CommonSenseQA (0-shot) 70.4%
TruthfulQA (0-shot) 50.3%
MMLU (5-shot) 68.0%
TriviaQA (5-shot) 73.8%
NaturalQuestions (5-shot) 31.2%

Multilingual Benchmarks (MMLU)

Language Score
French 62.3%
German 62.7%
Spanish 64.6%
Italian 61.3%
Portuguese 63.3%
Russian 59.2%
Chinese 59.0%
Japanese 59.0%

What's Changed

Full Changelog: https://github.com/mistralai/mistral-inference/compare/v1.2.0...v1.3.0

mistral-inference - v1.2.0 Add Mamba

Published by patrickvonplaten 3 months ago

Welcome 🐍 Codestral-Mamba and 🔢 Mathstral

pip install mistral-inference>=1.2.0

Codestral-Mamba

pip install packaging mamba-ssm causal-conv1d transformers
  1. Download
export MAMBA_CODE=$HOME/7B_MAMBA_CODE
wget https://models.mistralcdn.com/codestral-mamba-7b-v0-1/codestral-mamba-7B-v0.1.tar
mkdir -p $MAMBA_CODE
tar -xf codestral-mamba-7B-v0.1.tar -C $MAMBA_CODE
  1. Chat
mistral-chat $HOME/7B_MAMBA_CODE --instruct --max_tokens 256

Mathstral

  1. Download
export MATHSTRAL=$HOME/7B_MATH
wget https://models.mistralcdn.com/mathstral-7b-v0-1/mathstral-7B-v0.1.tar
mkdir -p $MATHSTRAL
tar -xf mathstral-7B-v0.1.tar -C $MATHSTRAL
  1. Chat
mistral-chat $HOME/7B_MATH --instruct --max_tokens 256

Blogs:
Blog Codestral Mamba 7B: https://mistral.ai/news/codestral-mamba/
Blog Mathstral 7B: https://mistral.ai/news/mathstral/

What's Changed

New Contributors

Full Changelog: https://github.com/mistralai/mistral-inference/compare/v1.1.0...v1.2.0

mistral-inference - v1.1.0 Add LoRA

Published by patrickvonplaten 5 months ago

mistral-inference==1.1.0 supports running LoRA models that were trained with: https://github.com/mistralai/mistral-finetune

Having trained a 7B base LoRA, you can run mistral-inference as follows:

from mistral_inference.model import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest


MODEL_PATH = "path/to/downloaded/7B_base_dir"

tokenizer = MistralTokenizer.from_file(f"{MODEL_PATH}/tokenizer.model.v3")  # change to extracted tokenizer file
model = Transformer.from_folder(MODEL_PATH)  # change to extracted model dir
model.load_lora("/path/to/run_lora_dir/checkpoints/checkpoint_000300/consolidated/lora.safetensors")

completion_request = ChatCompletionRequest(messages=[UserMessage(content="Explain Machine Learning to me in a nutshell.")])

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)
mistral-inference - Mistral-inference

Published by patrickvonplaten 5 months ago

Mistral-inference is the official inference library for all Mistral models: 7B, 8x7B, 8x22B.

Install with:

pip install mistral-inference

Run with:

from mistral_inference.model import Transformer
from mistral_inference.generate import generate

from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.protocol.instruct.tool_calls import Function, Tool

tokenizer = MistralTokenizer.from_file("/path/to/tokenizer/file")  # change to extracted tokenizer file
model = Transformer.from_folder("./path/to/model/folder")  # change to extracted model dir

from mistral_common.protocol.instruct.tool_calls import Function, Tool

completion_request = ChatCompletionRequest(
    tools=[
        Tool(
            function=Function(
                name="get_current_weather",
                description="Get the current weather",
                parameters={
                    "type": "object",
                    "properties": {
                        "location": {
                            "type": "string",
                            "description": "The city and state, e.g. San Francisco, CA",
                        },
                        "format": {
                            "type": "string",
                            "enum": ["celsius", "fahrenheit"],
                            "description": "The temperature unit to use. Infer this from the users location.",
                        },
                    },
                    "required": ["location", "format"],
                },
            )
        )
    ],
    messages=[
        UserMessage(content="What's the weather like today in Paris?"),
        ],
)

tokens = tokenizer.encode_chat_completion(completion_request).tokens

out_tokens, _ = generate([tokens], model, max_tokens=64, temperature=0.0, eos_id=tokenizer.instruct_tokenizer.tokenizer.eos_id)
result = tokenizer.instruct_tokenizer.tokenizer.decode(out_tokens[0])

print(result)