Model Prebuilts¶

MLC-LLM is a universal solution for deploying different language models. Any language models that can be described in TVM Relax (a general representation for Neural Networks and can be imported from models written in PyTorch) can be recognized by MLC-LLM and thus deployed to different backends with the help of TVM Unity.

The community has already supported several LLM architectures (LLaMA, GPT-NeoX, etc.) and have prebuilt some models (Vicuna, RedPajama, etc.) which you can use off the shelf. With the goal of democratizing the deployment of LLMs, we eagerly anticipate further contributions from the community to expand the range of supported model architectures.

This page contains the list of prebuilt models for our CLI (command line interface) app, iOS and Android apps. The models have undergone extensive testing on various devices, and their performance has been optimized by developers with the help of TVM.

Prebuilt Models for CLI ¶

Model code	Model Series	Quantization Mode	Hugging Face repo
Llama-2-7b-q4f16_1	Llama	Weight storage data type: int4 Running data type: float16 Symmetric quantization	link
vicuna-v1-7b-q3f16_0	Vicuna	Weight storage data type: int3 Running data type: float16 Symmetric quantization	link
RedPajama-INCITE-Chat-3B-v1-q4f16_1	RedPajama	Weight storage data type: int4 Running data type: float16 Symmetric quantization	link
rwkv-raven-1b5-q8f16_0	RWKV	Weight storage data type: uint8 Running data type: float16 Symmetric quantization	link
rwkv-raven-3b-q8f16_0	RWKV	Weight storage data type: uint8 Running data type: float16 Symmetric quantization	link
rwkv-raven-7b-q8f16_0	RWKV	Weight storage data type: uint8 Running data type: float16 Symmetric quantization	link

To download and run one model with CLI, follow the instructions below:

# Create conda environment and install CLI if you have not installed.
conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-cli-nightly
conda activate mlc-chat-venv
conda install git git-lfs
git lfs install

# Download prebuilt model binary libraries from GitHub if you have not downloaded.
mkdir -p dist/prebuilt
git clone https://github.com/mlc-ai/binary-mlc-llm-libs.git dist/prebuilt/lib

# Download prebuilt model weights and run CLI.
cd dist/prebuilt
git clone https://huggingface.co/mlc-ai/mlc-chat-[model-code]
cd ../..
mlc_chat_cli --local-id [model-code]

# e.g.,
# cd dist/prebuilt
# git clone https://huggingface.co/mlc-ai/mlc-chat-rwkv-raven-7b-q8f16_0
# cd ../..
# mlc_chat_cli --local-id rwkv-raven-7b-q8f16_0

Prebuilt Models for iOS ¶

Prebuilt models for iOS¶
Model code	Model Series	Quantization Mode	Hugging Face repo
Llama-2-7b-q3f16_1	Llama	Weight storage data type: int3 Running data type: float16 Symmetric quantization	link
vicuna-v1-7b-q3f16_0	Vicuna	Weight storage data type: int3 Running data type: float16 Symmetric quantization	link
RedPajama-INCITE-Chat-3B-v1-q4f16_1	RedPajama	Weight storage data type: int4 Running data type: float16 Symmetric quantization	link

The downloadable iOS app has builtin RedPajama-3B model support. To add a model to the iOS app, follow the steps below:

Click to show instructions

Open “MLCChat” app, click “Add model variant”.

https://raw.githubusercontent.com/mlc-ai/web-data/main/images/mlc-llm/tutorials/iPhone-custom-1.png

The iOS app has integrated with the following model libraries, which can be directly reused when you want to run a model you compiled in iOS, as long as the model is in the supported model family and is compiled with supported quantization mode. For example, if you compile OpenLLaMA-7B with quantization mode q3f16_0, then you can run the compiled OpenLLaMA model on iPhone without rebuilding the iOS app by reusing the vicuna-v1-7b-q3f16_0 model library. Please check the model distribution page for detailed instructions.

Prebuilt model libraries which are integrated in the iOS app¶
Model library name	Model Family	Quantization Mode
vicuna-v1-7b-q3f16_0	LLaMA	Weight storage data type: int3 Running data type: float16 Symmetric quantization
RedPajama-INCITE-Chat-3B-v1-q4f16_1	GPT-NeoX	Weight storage data type: int4 Running data type: float16 Symmetric quantization

Prebuilt Models for Android ¶

Prebuilt models for Android¶
Model code	Model Series	Quantization Mode	Hugging Face repo
vicuna-v1-7b-q4f16_1	Vicuna	Weight storage data type: int4 Running data type: float16 Symmetric quantization	link
RedPajama-INCITE-Chat-3B-v1-q4f16_0	RedPajama	Weight storage data type: int4 Running data type: float16 Symmetric quantization	link

You can check MLC-LLM pull requests to track the ongoing efforts of new models. We encourage users to upload their compiled models to Hugging Face and share with the community.

Supported Model Architectures ¶

MLC-LLM supports the following model architectures:

Supported Model Architectures¶
Category Code	Series	Model Definition	Variants
`llama`	LLaMa	Relax Code	Llama-2 Alpaca Vicuna Guanaco OpenLLaMA Gorilla WizardLM YuLan-Chat
`gpt-neox`	GPT-NeoX	Relax Code	RedPajama Dolly Pythia
`gptj`	GPT-J	Relax Code	MOSS
`rwkv`	RWKV	Relax Code	RWKV-raven
`minigpt`	MiniGPT	Relax Code
`gpt_bigcode`	GPTBigCode	Relax Code	StarCoder WizardCoder SantaCoder

For models structured in these model architectures, you can check the model compilation page on how to compile models. Please create a new issue if you want to request a new model architecture. Our tutorial Define New Models introduces how to bring a new model architecture to MLC-LLM.

Contribute Models to MLC-LLM ¶

Ready to contribute your compiled models/new model architectures? Awesome! Please check Contribute New Models to MLC-LLM on how to contribute new models to MLC-LLM.

Model Prebuilts¶

Prebuilt Models for CLI¶

Prebuilt Models for iOS¶

Prebuilt Models for Android¶

Supported Model Architectures¶

Contribute Models to MLC-LLM¶

Prebuilt Models for CLI ¶

Prebuilt Models for iOS ¶

Prebuilt Models for Android ¶

Supported Model Architectures ¶

Contribute Models to MLC-LLM ¶