CLI and C++ API

MLCChat CLI is the command line tool to run MLC-compiled LLMs out of the box. You may install it from the prebuilt package we provide, or compile it from source.

Option 1. Conda Prebuilt

The prebuilt package supports Metal on macOS and Vulkan on Linux and Windows, and can be installed via Conda one-liner.

To use other GPU runtimes, e.g. CUDA, please instead build it from source.

conda create -n mlc-chat-venv -c mlc-ai -c conda-forge mlc-chat-cli-nightly
conda activate mlc-chat-venv
mlc_chat_cli --help

After installation, activating mlc-chat-venv environment in Conda will give the mlc_chat_cli command available.

Note

The prebuilt package supports Metal on macOS and Vulkan on Linux and Windows. It is possible to use other GPU runtimes such as CUDA by compiling MLCChat CLI from source.

Option 2. Build MLC Runtime from Source

We also provid options to build mlc runtime libraries and mlc_chat_cli from source. This step is useful when you want to directly obtain a version of mlc runtime library and the cli. Please click the details below to see the instruction.

Details

Step 1. Set up build dependency. To build from source, you need to ensure that the following build dependencies are satisfied:

  • CMake >= 3.24

  • Git

  • Rust and Cargo, required by Hugging Face’s tokenizer

  • One of the GPU runtimes:

    • CUDA >= 11.8 (NVIDIA GPUs)

    • Metal (Apple GPUs)

    • Vulkan (NVIDIA, AMD, Intel GPUs)

Set up build dependencies in Conda
# make sure to start with a fresh environment
conda env remove -n mlc-chat-venv
# create the conda environment with build dependency
conda create -n mlc-chat-venv -c conda-forge \
    "cmake>=3.24" \
    rust \
    git
# enter the build environment
conda activate mlc-chat-venv

Note

TVM Unity compiler is not a dependency to MLCChat CLI. Only its runtime is required, which is automatically included in 3rdparty/tvm.

Step 2. Configure and build. A standard git-based workflow is recommended to download MLC LLM, after which you can specify build requirements with our lightweight config generation tool:

Configure and build
# clone from GitHub
git clone --recursive https://github.com/mlc-ai/mlc-llm.git && cd mlc-llm/
# create build directory
mkdir -p build && cd build
# generate build configuration
python3 ../cmake/gen_cmake_config.py
# build `mlc_chat_cli`
cmake .. && cmake --build . --parallel $(nproc) && cd ..

Step 3. Validate installation. You may validate if MLCChat CLI is compiled successfully using the following command:

Validate installation
# expected to see `mlc_chat_cli`, `libmlc_llm.so` and `libtvm_runtime.so`
ls -l ./build/
# expected to see help message
./build/mlc_chat_cli --help

Run Models through MLCChat CLI

Once mlc_chat_cli is installed, you are able to run any MLC-compiled model on command line.

Ensure Model Exists. As the input to mlc_chat_cli, it is always good to double check if the compiled model exists.

Details

If you downloaded prebuilt models from MLC LLM, by default:

  • Model lib should be placed at ./dist/prebuilt/lib/$(local_id)-$(arch).$(suffix).

  • Model weights and chat config are located under ./dist/prebuilt/mlc-chat-$(local_id)/.

Example
>>> ls -l ./dist/prebuilt/lib
Llama-2-7b-chat-hf-q4f16_1-metal.so  # Format: $(local_id)-$(arch).$(suffix)
Llama-2-7b-chat-hf-q4f16_1-vulkan.so
...
>>> ls -l ./dist/prebuilt/mlc-chat-Llama-2-7b-chat-hf-q4f16_1  # Format: ./dist/prebuilt/mlc-chat-$(local_id)/
# chat config:
mlc-chat-config.json
# model weights:
ndarray-cache.json
params_shard_*.bin
...

Run the Model. Next run mlc_chat_cli in command line:

# `local_id` is `$(model_name)-$(quantize_mode)`
# In this example, `model_name` is `Llama-2-7b-chat-hf`, and `quantize_mode` is `q4f16_1`
>>> mlc_chat_cli --local-id Llama-2-7b-chat-hf-q4f16_1
Use MLC config: "....../mlc-chat-config.json"
Use model weights: "....../ndarray-cache.json"
Use model library: "....../Llama-2-7b-chat-hf-q4f16_1-metal.so"
...

Have fun chatting with MLC-compiled LLM!

Advanced: Build Apps with C++ API

MLC-compiled models can be integrated into any C++ project using TVM’s C/C++ API without going through the command line.

Step 1. Create libmlc_llm. Both static and shared libraries are available via the CMake instructions, and the downstream developer may include either one into the C++ project according to needs.

Step 2. Calling into the model in your C++ Project. Use tvm::runtime::Module API from TVM runtime to interact with MLC LLM without MLCChat.

Note

DLPack that comes with TVM is an in-memory representation of tensors in deep learning. It is widely adopted in NumPy, PyTorch, JAX, TensorFlow, etc.

Using MLCChat APIs in Your Own Programs

Below is a minimal example of using MLCChat C++ APIs.

#define TVM_USE_LIBBACKTRACE 0
#define DMLC_USE_LOGGING_LIBRARY <tvm/runtime/logging.h>

#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/registry.h>

// DLPack is a widely adopted in-memory representation of tensors in deep learning.
#include <dlpack/dlpack.h>

void ChatModule(
  const DLDeviceType& device_type, // from dlpack.h
  int device_id, // which one if there are multiple devices, usually 0
  const std::string& path_model_lib,
  const std::string& path_weight_config
) {
  // Step 0. Make sure the following files exist:
  // - model lib  : `$(path_model_lib)`
  // - chat config: `$(path_weight_config)/mlc-chat-config.json`
  // - weights    : `$(path_weight_config)/ndarray-cache.json`
  using tvm::runtime::PackedFunc;

  // Step 1. Call `mlc.llm_chat_create`
  // This method will exist if `libmlc_llm` is successfully loaded or linked as a shared or static library.
  const PackedFunc* llm_chat_create = tvm::runtime::Registry::Get("mlc.llm_chat_create");
  assert(llm_chat_create != nullptr);
  tvm::runtime::Module mlc_llm = (*llm_chat_create)(
    static_cast<int>(device_type),
    device_id,
  );
  // Step 2. Obtain all available functions in `mlc_llm`
  PackedFunc prefill = mlc_llm->GetFunction("prefill");
  PackedFunc decode = mlc_llm->GetFunction("decode");
  PackedFunc stopped = mlc_llm->GetFunction("stopped");
  PackedFunc get_message = mlc_llm->GetFunction("get_message");
  PackedFunc reload = mlc_llm->GetFunction("reload");
  PackedFunc get_role0 = mlc_llm->GetFunction("get_role0");
  PackedFunc get_role1 = mlc_llm->GetFunction("get_role1");
  PackedFunc runtime_stats_text = mlc_llm->GetFunction("runtime_stats_text");
  PackedFunc reset_chat = mlc_llm->GetFunction("reset_chat");
  PackedFunc process_system_prompts = mlc_llm->GetFunction("process_system_prompts");
  // Step 3. Load the model lib containing optimized tensor computation
  tvm::runtime::Module model_lib = tvm::runtime::Module::LoadFromFile(path_model_lib);
  // Step 4. Inform MLC LLM to use `model_lib`
  reload(model_lib, path_weight_config);
}

Note

MLCChat CLI can be considered as a single-file project serving a good example of using MLC LLM in any C++ project.

Step 3. Set up compilation flags. To properly compile the code above, you will have to set up compiler flags properly in your own C++ project:

  • Make sure the following directories are included where TVM_HOME is /path/to/mlc-llm/3rdparty/tvm:

    • TVM runtime: ${TVM_HOME}/include,

    • Header-only DLPack: ${TVM_HOME}/3rdparty/dlpack/include,

    • Header-only DMLC core: ${TVM_HOME}/3rdparty/dmlc-core/include.

  • Make sure to link either the static or the shared libtvm_runtime library, which is provided via CMake.