Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Llama.cpp for fun (and maybe profit)

March 21, 2024

Llama.cpp for fun (and maybe profit)

A short talk on ways to use llama.cpp accompanied by live demos (I used MS Phi 2, llava 1.5, codellama) which are recent than some of the models in the slides. This was given to Citibank to see "what you might do" as a playful talk, with notes on the various ways these models can also go wrong.
I'll be writing this up in my newsletter too: https://buttondown.email/NotANumber/archive/


March 21, 2024


  1. Engineering & Testing for Pythonic Research Fast Pandas & Higher

    Performance Python Successful Data Science Projects You – post in Chat if you’re using LLMs? Ian Ozsvald By [ian]@ianozsvald[.com] Ian Ozsvald
  2. No need for a GPU+VRAM Llama.cpp runs on CPU+RAM Nothing

    sent off your machine llama.cpp By [ian]@ianozsvald[.com] Ian Ozsvald X
  3. Experiment with models as they’re published Use client data/src code

    – no data sent off your machine Why use local models? By [ian]@ianozsvald[.com] Ian Ozsvald
  4. See the wackyness early on What’s your strategy to catch

    varied outputs? Why use local models? By [ian]@ianozsvald[.com] Ian Ozsvald
  5. MS Phi2 can “reason” (IS IT RIGHT?) By [ian]@ianozsvald[.com] Ian

    Ozsvald I had confident answers: 125.2m/s (good Python) 17.2m/s (partial Python with comments that had mistakes), 40m/s and 31.3m/s (as teacher) Which one to believe? My model is quantised (Q5) but random variation exists anyway… The MS post didn’t disclose the prompt they used https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
  6. Similar to JPG compression Shrink the trained model 32→16→8→7/6/5/4/3/2 bits

    Fewer bits→worse text completion “Q5 generally an acceptable level” Quantisation By [ian]@ianozsvald[.com] Ian Ozsvald
  7. Experiment with multi-modal e.g. OCR and checking photo meets requirements

    What about image queries? By [ian]@ianozsvald[.com] Ian Ozsvald
  8. Llava multi-modal Extract facts from images? By [ian]@ianozsvald[.com] Ian Ozsvald

    llava-v1.5-7b-Q4_K.gguf 4GB on disk & RAM 5s for example llama.cpp provides ./server
  9. Trial code-support Code review? “Is this test readable?” Can they

    help with coding? By [ian]@ianozsvald[.com] Ian Ozsvald
  10. Can you explain this function please? By [ian]@ianozsvald[.com] Ian Ozsvald

    codellama-34b-python.Q5_K_M.gguf 23GB on disk & RAM, 30s for example Can we use this as a “code reviewer” for internal code? codellama answer: “The function test_uniform_distribution creates a list of 10 zeros, then increments the position in that list indicated by the murmurhash3_32() digest of i. It does this 100000 times and then checks if the means of those incremented values are uniformly distributed (i.e., if they're all roughly the same).” (surprisingly clear!) https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/tests/test_murmurhash.py
  11. Give test functions (e.g. Pandas) to codellama Ask it “is

    this a good test function?” Try to get it to propose new test functions Check using pytest and coverage tools Shortcut human effort at project maintenance? My experiment for code assist By [ian]@ianozsvald[.com] Ian Ozsvald
  12. Run quantised models on client data locally Experience the wackyness

    – mitigation? Use Python API to see tokens+perplexity+more Why try llama.cpp? By [ian]@ianozsvald[.com] Ian Ozsvald
  13. Let me know if training/strategy chats useful? Discuss: How do

    we measure correctness? What’s the worst (!) that could go wrong with your projects? Summary By [ian]@ianozsvald[.com] Ian Ozsvald
  14. Appendix – Ask Mixtral to challenge my Monte Carlo estimation

    approach By [ian]@ianozsvald[.com] Ian Ozsvald Mixtral gave 5 points and some items I should be careful about, ChatGPT 3.5 gave 7 points, both felt similar
  15. WizardCoder is good (tuned llama2) By [ian]@ianozsvald[.com] Ian Ozsvald wizardcoder-python-34b

    -v1.0.Q5_K_S.gguf 22GB on disk & RAM 15s for example You can replace CoPilot with this for completions
  16. Quantisation By [ian]@ianozsvald[.com] Ian Ozsvald Original fp16 models Better Bigger

    models with higher quantisation still has lower perplexity than simpler, less quantised models Choose the biggest you can K-quants PR https://github.com/ggerganov/llama.cpp/pull/1684