To program Llama 2 easily, it is highly recommended to encode quantized model.
There is llama C++ port repository.
Download llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make
Convert model to GGLM format
cd llama.cpp
python3 -m venv llama2
source llama2/bin/activate
python3 -m pip install -r requirements.txt
Converting process consists of two step.
- convert model to f16 format
- convert f16 model to ggml
convert to f16 format
mkdir -p models/7B
python3 convert.py --output models/7B/ggml-model-f16.bin \
--outtype f16 \
../llama2/llama/llama-2-7b-chat \
--vocab-dir ../llama2/llama
Before run the convert, create output directory (ex. models/7B)
--outfile is for specifying the output file name
--outtype is for specifying the output type which is f16
--vocab-dir is for specifying the directory containing tokenizer.model file
convert f16 model to ggml
This step is called as quantize the model
./quantize ./models/7B/ggml-model-f16.bin \
./models/7B/ggml-model-q4_0.bin q4_0
After quantize model, the file size became very small.
mzc01-choonhoson@MZC01-CHOONHOSON 7B % ls -alh
total 33831448
drwxr-xr-x@ 4 mzc01-choonhoson staff 128B 9 12 17:23 .
drwxr-xr-x@ 5 mzc01-choonhoson staff 160B 9 12 16:50 ..
-rw-r--r--@ 1 mzc01-choonhoson staff 13G 9 12 17:23 ggml-model-f16.bin
-rw-r--r--@ 1 mzc01-choonhoson staff 3.6G 9 12 17:23 ggml-model-q4_0.bin
Example
All done. run example binary!!!
./main -m ./models/7B/ggml-model-q4_0.bin -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt
References
GGML - Large Language Models for Everyone
https://github.com/rustformers/llm/blob/main/crates/ggml/README.md
Series
Llama 2 in Apple Silicon Bacbook (1/3)
https://dev.to/choonho/llama-2-in-apple-silicon-macbook-13-54h
Llama 2 in Apple Silicon Bacbook (2/3)
https://dev.to/choonho/llama-2-in-apple-silicon-macbook-23-2j51
Llama 2 in Apple Silicon Bacbook (3/3)
https://dev.to/choonho/llama-2-in-apple-silicon-macbook-33-3hb7