LLMs on Small devices [DutchAUG]

LLMs on small devices Hugo Visser @botteaap [email protected]

Poem Booth

Lees Simpel

Agenda • LLM & machine learning • Explore Android demo
and setup

Neural networks • Inputs, (hidden) layers, outputs → matrices of
values • Weights: ~a factor that influences transforming a value between layers • Training: fiddle with the weights until output approximates known input • Result of training → collection of weights (numbers)

https://medium.com/data-science-365/overview-of-a-neura l-networks-learning-process-61690a502fa

Transformer models • Model architecture • Predict next word based
on previous words • Context window • Training: show text, mask next word, calculate accuracy of prediction ♻ • Inference: start with input sentence, predict word, add to input sentence ♻

Transformer models • Model architecture • Predict probabilities of next
token based on previous tokens • Context window • Training: show tokens, mask next token, calculate accuracy of prediction ♻ • Inference: start with input tokens, predict next token, add to input tokens ♻

Tokenizers • Mapping from input to numbers • Token: (part
of) a word, or character, byte etc • Dictionary of a model

Tokenizer https://tiktokenizer.vercel.app

• Base model: trained on ~internet (trillions of tokens) ◦
Can predict next token, not an assistant ◦ 💸💸💸💸 (1000’s of GPUs, $5-100 million) • Fine tuning: continue training to create assistant model ◦ Training text is structured conversation ideal examples (~10k - 100k+ examples) • Human preference tuning ◦ Further fine tuning by generating responses and selecting the best ones ◦ “Alignment”, “Safety” etc Type of models

• Parameters (~weights): bigger is better ◦ (Open source) Llama
7b, 13b, 70b ◦ GPT4 220b? 8x 220b? (huge) • Each parameter is ~4 bytes → 7b model ~30gb • Need GPU (memory) for acceleration LLMs

• Quantize the model: store each parameter in less bytes
◦ Reduces accuracy, but is generally OK ◦ Other optimisations possible • Train smaller models (less parameters) • Both • Or: train specialized ultra small model (llama2.c) (megabytes vs gigabytes) Shrinking LLMs

• Trained by Microsoft on “high quality” data • Open
source research model • 2.7b parameters w/ performance of 13b parameters model • Quantized model can run on a phone • https://huggingface.co/microsoft/phi-2 Small LLM: Phi-2

Use cases • Private processing of (sensitive) data • Searching
/ querying local documents (RAG) • Sentiment analysis, summarization, background jobs • No additional costs

Limitations • Smaller model → less “knowledge” → more specialised
models • Open source base models often need tuning → (Dutch) data

A word about privacy • ChatGPT, Gemini web → data
used for training ◦ Except for enterprise editions • OpenAI API → data not used for training • Google Gemini…it depends ◦ TL;DR Gemini API (not available in the EU) may train on your users data ◦ The Google Cloud Platform Vertex API has different terms and does not

Running LLMs on Android • llama.cpp → open source, CPU
& GPU ◦ Supports many model architectures, like Phi-2 and Gemma ◦ Moves very fast (multiple releases a day) • Candle → ML framework from 🤗 written in Rust 😅 • Gemma.cpp • Mediapipe (crashes on a real device 󰤇) • Gemini nano → Only on select high end devices, currently EAP

Integrating llama.cpp 1. Download source code 2. Compile with Android
Studio / NDK 3. Create JNI layer -or- find an existing library 4. Profit!

NDK + cmake • Out of tree build vs part
of source ◦ (Sometimes) Easier for simple cases ◦ Can be convenient for testing, e.g. run cmd line tools on device, import .so w/ cmake ◦ Hard to maintain https://developer.android.com/ndk/guides/cmake#command-line

NDK + cmake defaultConfig { externalNativeBuild { cmake { arguments+=
listOf( "-DCMAKE_BUILD_TYPE=Release", "-DBUILD_SHARED_LIBS=ON", "-DLLAMA_STANDALONE=ON", "-DANDROID_STL=c++_shared") } } ndk { abiFilters.add("arm64-v8a") ndkVersion = libs.versions.ndk.get() } }

NDK + cmake externalNativeBuild { cmake { path = file("CMakeLists.txt")
version = "3.22.1" } }

NDK and dependencies • It’s complicated ◦ Not all projects
use cmake ◦ Even if they do…😬 • Options: precompile dependencies or include all source • NDK prefab packages ◦ AAR containing libs + headers ◦ Mixed results producing and consuming with cmake

NDK and dependencies android { buildFeatures { prefab = true
prefabPublishing = true } prefab { create("llama") { headers = "llama.cpp" libraryName = "libllama" } create("common") { headers = "llama.cpp/common" libraryName = "libcommon" } } }

NDK and dependencies # CMakeLists.txt find_package("llamacpp-android" CONFIG REQUIRED) target_link_libraries(jllama llamacpp-android::llama
llamacpp-android::common)

JNI layer • java-llama.cpp ◦ Use as source for Android
◦ Add (prefab) llamacpp-android as dependency • Keeping up with llama.cpp is hard, llama.cpp api changes → 💥 https://github.com/kherud/java-llama.cpp

Let’s explore!

• Very much a proof of concept, due to model
file size • Acceptable performance with room for improvement • llama.cpp runs open source, fine tuned (by you) models, no restrictions • For real apps maybe prefer a JNI layer with only what you use Conclusions

Resources • Andrej Karpathy’s YouTube channel • Niels Rogge 🤗
on LLMs https://www.youtube.com/watch?v=Ma4clS-IdhA • huggingface.co • Try quantized models on your own machine: Ollama, LMStudio

LLMs on small devices Hugo Visser @botteaap [email protected]

LLMs on Small devices [DutchAUG]

LLMs on Small devices [DutchAUG]

Hugo Visser

More Decks by Hugo Visser

Other Decks in Technology

Featured

Transcript

LLMs on small devices Hugo Visser @botteaap [email protected]

Poem Booth

Lees Simpel

Agenda • LLM & machine learning • Explore Android demo

Neural networks • Inputs, (hidden) layers, outputs → matrices of

https://medium.com/data-science-365/overview-of-a-neura l-networks-learning-process-61690a502fa

Transformer models • Model architecture • Predict next word based

Transformer models • Model architecture • Predict probabilities of next

Tokenizers • Mapping from input to numbers • Token: (part

Tokenizer https://tiktokenizer.vercel.app

Tokenizer https://tiktokenizer.vercel.app

• Base model: trained on ~internet (trillions of tokens) ◦

• Parameters (~weights): bigger is better ◦ (Open source) Llama

• Quantize the model: store each parameter in less bytes

• Trained by Microsoft on “high quality” data • Open

Demo

Use cases • Private processing of (sensitive) data • Searching

Limitations • Smaller model → less “knowledge” → more specialised

A word about privacy • ChatGPT, Gemini web → data

Running LLMs on Android • llama.cpp → open source, CPU

Integrating llama.cpp 1. Download source code 2. Compile with Android

NDK + cmake • Out of tree build vs part

NDK + cmake defaultConfig { externalNativeBuild { cmake { arguments+=

NDK + cmake externalNativeBuild { cmake { path = file("CMakeLists.txt")

NDK and dependencies • It’s complicated ◦ Not all projects

NDK and dependencies android { buildFeatures { prefab = true

NDK and dependencies # CMakeLists.txt find_package("llamacpp-android" CONFIG REQUIRED) target_link_libraries(jllama llamacpp-android::llama

JNI layer • java-llama.cpp ◦ Use as source for Android

Let’s explore!

• Very much a proof of concept, due to model

Resources • Andrej Karpathy’s YouTube channel • Niels Rogge 🤗

LLMs on small devices Hugo Visser @botteaap [email protected]