Runnable with vLLMPhi-tiny-MoE is a lightweight Mixture of Experts (MoE) model with 3.8B total parameters and 1.1B activated parameters. It is compressed and distilled from the base model shared by Phi-3.5-MoE and GRIN-MoE using the SlimMoE approach, then post-trained via supervised fine-tuning and direct preference optimization for instruction following and safety. The model is trained on Phi-3 synthetic data and filtered public documents, with a focus on high-quality, reasoning-dense content. It is part of the SlimMoE series, which includes a larger variant, Phi-mini-MoE, with 7.6B total and 2.4B activated parameters.
References:
đź“– SlimMoE Paper
đź“– Phi-3 Technical Report
đź“– GRIN-MoE
The model is intended for commercial and research use in English. The model provides uses for general purpose AI systems and applications which require memory/compute constrained environments and latency bound scenarios.
Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.
Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.
Given the nature of the training data, the Phi-tiny-MoE model is best suited for prompts using the chat format as follows:
<|system|>
You are a helpful assistant.<|end|>
<|user|>
How to explain Internet for a medieval knight?<|end|>
<|assistant|>
After obtaining the Phi-tiny-MoE model checkpoints, users can use this sample code for inference.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
torch.random.manual_seed(0)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-tiny-MoE-instruct",
device_map="cuda",
torch_dtype="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-tiny-MoE-instruct")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Can you provide ways to eat combinations of bananas and dragonfruits?"},
{"role": "assistant", "content": "Sure! Here are some ways to eat bananas and dragonfruits together: 1. Banana and dragonfruit smoothie: Blend bananas and dragonfruits together with some milk and honey. 2. Banana and dragonfruit salad: Mix sliced bananas and dragonfruits together with some lemon juice and honey."},
{"role": "user", "content": "What about solving an 2x + 3 = 7 equation?"},
]
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
generation_args = {
"max_new_tokens": 500,
"return_full_text": False,
"temperature": 0.0,
"do_sample": False,
}
output = pipe(messages, **generation_args)
print(output[0]['generated_text'])
To understand the capabilities, we compare Phi-tiny-MoE with a set of models over a variety of benchmarks using lm-evaluation-harness. Detailed evaluation settings can be found in the SlimMoE paper.
| Model | # Total param | # Act. param | MMLU | MMLU pro | BBH | Arc-C (chat) | Human-eval | GSM8K | MT-bench |
|---|---|---|---|---|---|---|---|---|---|
| MoE Models | |||||||||
| Phi 3.5-MoE | 42B | 6.6B | 78.36 | 59.38 | 63.93 | 91.38 | 81.70 | 87.87 | 8.34 |
| Qwen 1.5 MoE | 14B | 2.7B | 60.73 | 26.49 | 42.65 | 67.24 | 46.30 | 53.07 | 6.55 |
| DeepSeek V2 Lite | 16B | 2.4B | 56.69 | 17.89 | 36.30 | 61.09 | 54.40 | 63.23 | 6.82 |
| OL-MoE | 7B | 1.3B | 54.27 | 20.87 | 38.00 | 55.63 | 37.80 | 71.49 | 6.60 |
| Granite 3.0 MoE | 3.4B | 0.8B | 50.06 | 4.82 | 39.65 | 56.06 | 51.80 | 60.12 | 6.91 |
| Dense Models | |||||||||
| LLaMA 3.1 8B | 8B | 8B | 68.71 | 45.28 | 50.86 | 82.42 | 69.50 | 84.84 | 8.03 |
| Qwen 2.5 7B | 7.6B | 7.6B | 73.47 | 56.24 | 53.74 | 88.82 | 81.70 | 84.84 | 8.34 |
| Phi 3 small | 7.4B | 7.4B | 75.35 | 52.06 | 62.07 | 84.30 | 70.10 | 84.84 | 8.03 |
| Gemma 3 4B | 4B | 4B | 59.49 | 40.13 | 49.45 | 75.85 | 67.10 | 78.92 | 8.28 |
| Phi 3 mini | 3.8B | 3.8B | 69.94 | 45.65 | 54.94 | 85.58 | 72.60 | 84.61 | 7.46 |
| LLaMA 3.2 3B | 3.2B | 3.2B | 61.73 | 36.70 | 45.46 | 75.77 | 52.40 | 77.41 | 7.46 |
| Qwen 2.5 3B | 3B | 3B | 65.06 | 41.00 | 46.61 | 80.20 | 73.80 | 76.57 | 7.60 |
| Gemma 3 1B | 1B | 1B | 40.80 | 14.70 | 34.80 | 37.46 | 41.50 | 41.77 | 6.67 |
| LLaMA 3.2 1B | 1B | 1B | 46.30 | 18.67 | 35.18 | 49.91 | 35.40 | 44.96 | 5.23 |
| Our (SlimMoE) Models | |||||||||
| Phi-mini-MoE | 7.6B | 2.4B | 70.68 | 49.68 | 55.27 | 84.91 | 73.80 | 84.89 | 7.59 |
| Phi-tiny-MoE | 3.8B | 1.1B | 60.83 | 36.34 | 45.58 | 76.37 | 58.50 | 78.47 | 7.05 |
Architecture: Phi-tiny-MoE has 3.8 total parameters with 1.1B active parameters. The model is a mixture-of-expert decoder-only Transformer model using the tokenizer with vocabulary size of 32,064.
Inputs: Text. It is best suited for prompts using chat format.
Context length: 4k tokens
GPUs: 64 A100-80G
Training time: 11 days
Training data: 400B tokens
Outputs: Generated text in response to the input
Dates: Trained between September 2024 and March 2025
Status: This is a static model trained on an offline dataset with cutoff date October 2023 for publicly available data.
Our training data is a subset with 400B tokens of Phi-3 datasets, which includes a wide variety of sources and is a combination of
More details about data can be found in the Phi-3 Technical Report.
Like other language models, Phi-tiny-MoE can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:
Developers should apply responsible AI best practices, including mapping, measuring, and mitigating risks associated with their specific use case and cultural, linguistic context. Important areas for consideration include:
Note that by default, the Phi-tiny-MoE model uses flash attention, which requires certain types of GPU hardware to run. We have tested on the following GPU types:
The model is licensed under the MIT license.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
https://huggingface.co/microsoft/Phi-tiny-MoE-instruct/blob/main/data_summary_card.md