Llama.cpp for fun (and maybe profit) - 30 minute

llama.cpp for fun and (maybe) profit BudapestData 2024 talk @IanOzsvald
– ianozsvald.com

Strategist/Trainer/Speaker/Author 20+ years Trainer – new Fast Pandas course Interim
Chief Data Scientist By [ian]@ianozsvald[.com] Ian Ozsvald 2nd Edition!

When should we be using LLMs? Strategic Team Advisor By
[ian]@ianozsvald[.com] Ian Ozsvald

No need for a GPU+VRAM Llama.cpp runs on CPU+RAM Nothing
sent off your machine llama.cpp By [ian]@ianozsvald[.com] Ian Ozsvald X

Experiment with models as they’re published Use client data/src code
– no data sent off your machine Stable – it doesn’t change every week Why use local models? By [ian]@ianozsvald[.com] Ian Ozsvald

Prototype ideas! By [ian]@ianozsvald[.com] Ian Ozsvald llama-2-7b-chat.Q5_K_M.gguf 5GB on disk
and in RAM, near real time

Let’s see MS Phi 2 summarise We can constrain with
JSON too See the wackyness early on What’s your strategy to catch varied outputs? Why use local models? By [ian]@ianozsvald[.com] Ian Ozsvald

MS Phi2 can “reason” (IS IT RIGHT?) By [ian]@ianozsvald[.com] Ian
Ozsvald I had confident answers: 125.2m/s (good Python) 17.2m/s (partial Python with comments that had mistakes), 40m/s and 31.3m/s (as teacher) Which one to believe? My model is quantised (Q5) but random variation exists anyway… The MS post didn’t disclose the prompt they used https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/

Similar to JPG compression Shrink the trained model 32→16→8→7/6/5/4/3/2 bits
Fewer bits→worse text completion “Q5 generally an acceptable level” Quantisation By [ian]@ianozsvald[.com] Ian Ozsvald

Quantisation By [ian]@ianozsvald[.com] Ian Ozsvald Original fp16 models Better Bigger
models with higher quantisation still has lower perplexity than simpler, less quantised models Choose the biggest you can K-quants PR https://github.com/ggerganov/llama.cpp/pull/1684

Experiment with multi-modal e.g. OCR and checking photo meets requirements
What about image queries? By [ian]@ianozsvald[.com] Ian Ozsvald

Llava multi-modal Extract facts from images? By [ian]@ianozsvald[.com] Ian Ozsvald
llava-v1.5-7b-Q4_K.gguf 4GB on disk & RAM 5s for example llama.cpp provides ./server

Trial code-support Code review? “Is this test readable?” What do
you do with code and LLMs? Can they help with coding? By [ian]@ianozsvald[.com] Ian Ozsvald

Can you explain this function please? By [ian]@ianozsvald[.com] Ian Ozsvald
codellama-34b-python.Q5_K_M.gguf 23GB on disk & RAM, 30s for example Can we use this as a “code reviewer” for internal code? codellama answer: “The function test_uniform_distribution creates a list of 10 zeros, then increments the position in that list indicated by the murmurhash3_32() digest of i. It does this 100000 times and then checks if the means of those incremented values are uniformly distributed (i.e., if they're all roughly the same).” (surprisingly clear!) https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/tests/test_murmurhash.py

Using the Python API we can learn how it works
Get embeddings API leads to understanding By [ian]@ianozsvald[.com] Ian Ozsvald

Q&A model trained on “Let, What, Suppose, Calculate, Solve” as
very-likely first tokens API leads to understanding By [ian]@ianozsvald[.com] Ian Ozsvald log(p(1)) == 0 log(p(0.5)) == -0.7

Summarise .vtt from Zoom Group call, speakers named Needs Python
preprocessing Llama 3 7B good (80B better) Python automation Summarising Rebel Zoom calls By [ian]@ianozsvald[.com] Ian Ozsvald

Run quantised models on client data locally CPU only or
GPU if you want it faster Fixed target – it doesn’t change next week Experience the wackyness – mitigation? Use Python API to see tokens+perplexity+more Why try llama.cpp? By [ian]@ianozsvald[.com] Ian Ozsvald

Who will try llama.cpp on their own machine? Discuss: How
do we measure correctness? What’s the worst (!) that could go wrong with your projects? Summary By [ian]@ianozsvald[.com] Ian Ozsvald

Appendix - Attack via asciiart By [ian]@ianozsvald[.com] Ian Ozsvald

Appendix – Ask Mixtral to challenge my Monte Carlo estimation
approach By [ian]@ianozsvald[.com] Ian Ozsvald Mixtral gave 5 points and some items I should be careful about, ChatGPT 3.5 gave 7 points, both felt similar

WizardCoder is good (tuned llama2) By [ian]@ianozsvald[.com] Ian Ozsvald wizardcoder-python-34b
-v1.0.Q5_K_S.gguf 22GB on disk & RAM 15s for example You can replace CoPilot with this for completions

Llama.cpp for fun (and maybe profit) - 30 minute

Llama.cpp for fun (and maybe profit) - 30 minute

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript

llama.cpp for fun and (maybe) profit BudapestData 2024 talk @IanOzsvald

Strategist/Trainer/Speaker/Author 20+ years Trainer – new Fast Pandas course Interim

When should we be using LLMs? Strategic Team Advisor By

No need for a GPU+VRAM Llama.cpp runs on CPU+RAM Nothing

Experiment with models as they’re published Use client data/src code

Prototype ideas! By [ian]@ianozsvald[.com] Ian Ozsvald llama-2-7b-chat.Q5_K_M.gguf 5GB on disk

Let’s see MS Phi 2 summarise We can constrain with

MS Phi2 can “reason” (IS IT RIGHT?) By [ian]@ianozsvald[.com] Ian

Similar to JPG compression Shrink the trained model 32→16→8→7/6/5/4/3/2 bits

Quantisation By [ian]@ianozsvald[.com] Ian Ozsvald Original fp16 models Better Bigger

Experiment with multi-modal e.g. OCR and checking photo meets requirements

Llava multi-modal Extract facts from images? By [ian]@ianozsvald[.com] Ian Ozsvald

Trial code-support Code review? “Is this test readable?” What do

Can you explain this function please? By [ian]@ianozsvald[.com] Ian Ozsvald

Using the Python API we can learn how it works

Q&A model trained on “Let, What, Suppose, Calculate, Solve” as

Summarise .vtt from Zoom Group call, speakers named Needs Python

Run quantised models on client data locally CPU only or

Who will try llama.cpp on their own machine? Discuss: How

Appendix - Attack via asciiart By [ian]@ianozsvald[.com] Ian Ozsvald

Appendix – Ask Mixtral to challenge my Monte Carlo estimation

WizardCoder is good (tuned llama2) By [ian]@ianozsvald[.com] Ian Ozsvald wizardcoder-python-34b