setting up
- to use a large language model in your source code you need different abstractions:
- a model
- a runtime to run the model
- an abstraction in your programming language
model
- a model that fits your hardware
- models are loaded into ram completely, so you will need size that suits the available memory
- e.g. a mac m1 with 16GB memory can run Llama2 7B Q8
- the model should be in a readable appropriate format (e.g. GGUF)
- the model should have the right quantisation (e.g. Q8 for 8 bits precision)
format
- since 2023 august llama team is using GGUF instead of GGML
- GGUF has better tokenisation and support for special tokens
- GGUF format
quantisation
Quantization is a compression technique that involes mapping high precision values to a lower precision one. For an LLM, that means modifying the precision of their weights and activations making it less memory intensive.
LLMs are generally trained with full(float32) or half precision(float16 floating point numbers. One float16 has 16 bits which is 2 bytes. So it requires two gigabytes for one billion parameter model trained on FP16. —Nithin Devanand
- quantisation means turning the precision of values to lower amounts of bits (downsampling / rounding)
- models are often offered in different quantisation values such as Q2 to Q8 (2bits to 8 bits fidelity)
- They differ in resulting model disk size and the inference speed.
runtime
- llama.cpp