setting up

to use a large language model in your source code you need different abstractions:
- a model
- a runtime to run the model
- an abstraction in your programming language

model

a model that fits your hardware
- models are loaded into ram completely, so you will need size that suits the available memory
- e.g. a mac m1 with 16GB memory can run Llama2 7B Q8
the model should be in a readable appropriate format (e.g. GGUF)
the model should have the right quantisation (e.g. Q8 for 8 bits precision)

format

since 2023 august llama team is using GGUF instead of GGML
GGUF has better tokenisation and support for special tokens
GGUF format

quantisation

Quantization is a compression technique that involes mapping high precision values to a lower precision one. For an LLM, that means modifying the precision of their weights and activations making it less memory intensive.

LLMs are generally trained with full(float32) or half precision(float16 floating point numbers. One float16 has 16 bits which is 2 bytes. So it requires two gigabytes for one billion parameter model trained on FP16. —Nithin Devanand

quantisation means turning the precision of values to lower amounts of bits (downsampling / rounding)
models are often offered in different quantisation values such as Q2 to Q8 (2bits to 8 bits fidelity)
They differ in resulting model disk size and the inference speed.

runtime

llama.cpp

Peter's Mind Vault

Explorer

Learning Large Language Models

setting up

model

format

quantisation

runtime

Graph View

Table of Contents

Backlinks

Recent notes

Trivy

Books

Changelog

Zettelkasten

Crash Bandicoot 4