Llama 2 in Apple Silicon Macbook (2/3)

sunshout 2023. 10. 29. 15:17

To program Llama 2 easily, it is highly recommended to encode quantized model.

There is llama C++ port repository.

Download llama.cpp

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

Convert model to GGLM format

cd llama.cpp
python3 -m venv llama2
source llama2/bin/activate
python3 -m pip install -r requirements.txt

Converting process consists of two step.

  1. convert model to f16 format
  2. convert f16 model to ggml

convert to f16 format

mkdir -p models/7B
python3 convert.py --output models/7B/ggml-model-f16.bin \
--outtype f16 \
../llama2/llama/llama-2-7b-chat \
--vocab-dir ../llama2/llama

Before run the convert, create output directory (ex. models/7B)

--outfile is for specifying the output file name
--outtype is for specifying the output type which is f16
--vocab-dir is for specifying the directory containing tokenizer.model file

convert f16 model to ggml

This step is called as quantize the model

./quantize ./models/7B/ggml-model-f16.bin \
./models/7B/ggml-model-q4_0.bin q4_0

After quantize model, the file size became very small.

mzc01-choonhoson@MZC01-CHOONHOSON 7B % ls -alh
total 33831448
drwxr-xr-x@ 4 mzc01-choonhoson  staff   128B  9 12 17:23 .
drwxr-xr-x@ 5 mzc01-choonhoson  staff   160B  9 12 16:50 ..
-rw-r--r--@ 1 mzc01-choonhoson  staff    13G  9 12 17:23 ggml-model-f16.bin
-rw-r--r--@ 1 mzc01-choonhoson  staff   3.6G  9 12 17:23 ggml-model-q4_0.bin


All done. run example binary!!!

./main -m ./models/7B/ggml-model-q4_0.bin -n 1024 --repeat_penalty 1.0 --color -i -r "User:" -f ./prompts/chat-with-bob.txt


GGML - Large Language Models for Everyone


Llama 2 in Apple Silicon Bacbook (1/3)

Llama 2 in Apple Silicon Bacbook (2/3)

Llama 2 in Apple Silicon Bacbook (3/3)
