Things you never dared to ask about LLMs — v2

Things you never dared to ask about LLMs – Guillaume
Laforge Developer Advocate Two R’s or not two R’s, that is the question glaforge glaforge.dev @glaforge @glaforge.dev @[email protected]

LANGUAGE • Writing • Summarization • Classification • Ideation •
Sentiment analysis • Data extraction • Semantic search CODING • Code completion • Code generation • Code chat • PR review • Security & quality suggestions LLM, the big black box with lots of potential (beyond just chat…) SPEECH • Speech to text • Text to speech • Audio transcription • Live voice multimodal streaming chat VISION & VIDEO • Image generation • Image editing • Captioning • Image Q&A • Image search • Video description • Video generation

Now let’s enter the mysterious realm of LLM oddities…

How many R’s in “strawberry”? Two or Three?

How many R’s in “strawberry”? They process language in chunks,
not individual letters

LLMs process language in chunks… or actually tokens. GPT-4 is
rumored to have been trained on ~13 trillion tokens. Roughly: 4 tokens == 3 words Lots & lots of tokens… A human reading 8-hours a day needs 450,000 years to read 13 trillion tokens

LLMs reason on tokens, not letters x.com/karpathy/status/1816637781659254908

What you see… What LLM sees…

For efficiency! Better for memory Keeps attention high Tokens are
the sweet spot between letters and words Why tokens and not letters? or even words? reddit.com/r/learnmachinelearning/comments/1d0sopa/why_not_use_words_instead_of_tokens

Most common algorithms: • BPE (Byte-Pair Encoding) used by GPTs
• WordPiece, used by BERT • Unigram, often used in SentencePiece • SentencePiece, used by Gemini & Gemma Some require pre-tokenization, or don’t offer reversible tokenization. How does tokenization work? huggingface.co/learn/nlp-course/chapter6/4#algorithm-overview

Zooming on Byte-Pair Encoding u-n-r-e-l-a-t-e-d

Zooming on Byte-Pair Encoding Merge rules r + e →
re u-n-r-e-l-a-t-e-d

re u-n-re-l-a-t-e-d

re a + t → at u-n-re-l-a-t-e-d

re a + t → at u-n-re-l-at-e-d

re a + t → at e + d → ed u-n-re-l-at-e-d

re a + t → at e + d → ed u-n-re-l-at-ed

re a + t → at e + d → ed u + n → un u-n-re-l-at-ed

re a + t → at e + d → ed u + n → un un-re-l-at-ed

re a + t → at e + d → ed u + n → un at + ed → ated un-re-l-at-ed

re a + t → at e + d → ed u + n → un at + ed → ated un-re-l-ated

re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel un-re-l-ated

re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel un-rel-ated

re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel rel + ated → related un-rel-ated

re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel rel + ated → related un-related

‘▁’ is not your regular ‘_’ SentencePiece tokenization uses this
character to denote white space. Weird ▁ character in front of tokens? github.com/google/sentencepiece/tree/master#whitespace-is-treated-as-a-basic-symbol (U+2581) Lower One Eighth Block Sentence, Piece, ▁token, ization, ▁uses, ▁this, ▁character, ▁to, ▁denote, ▁white, ▁space, .

Acronyms, abbreviations, and jargon is hard… RATP R, ATP Devoxx
Devo, xx trihexyphenidyle tri, he, xy, phen, id, yle Meum athamanticum Me, um, ath, aman, ticum Me, um, ath, aman, ticum?

A very talkative and imaginative parrot! Speaking endlessly… Stochastic parrots
Blah, blah, blah…

Foundation or instruction-tuned models?

Because we told them so? LLM text generation reached •
the max output tokens count • its <|endoftext|> or <end_of_turn>model tokens from its instruction tuning set How do LLMs know when to stop generating tokens? www.louisbouchard.ai/how-llms-know-when-to-stop/

How do LLMs choose the next token? I’m just playing
dice!

poloclub.github.io/transformer-explainer/

The cat is a… Choosing the next token 56% chance
of being picked 2% chance of being picked

Flying through hyperspace… Err… Hyperparameters To max output tokens, and
beyond!

Top-K — Just pick from k tokens The cat is
a… Pick the top 2

Top-P — Cumulative probability The cat is a… Also called
“nucleus sampling” Pick top 0.9 0.56 + 0.28 + 0.11 > 0.9 0.95

What is temperature exactly? It’s hot, that’s all I know!

Temperature — Flattening / sharpening the curve Temp = 0.2

Some more space LLM oddities

If I set temperature at 0, top-K at 1, do
I have a deterministic output? What about if there’s a seed parameter? Well, no, because of: • Floating point numbers (non-)associativity (a + b) + c ≠ a + (b + c) due to rounding errors • Parallel execution order GPUs reorder reductions (like sums) in threads and cores Deterministic… to a certain extent

Deterministic vs probabilistic problems www.threads.net/@benedictevans/post/DF5FPgOublw words.narain.io/on-probabilism-and-determinism-in-ai Predictive AI Generative AI

Hallucinations or confabulation? news.ycombinator.com/item?id=33841672 www.linkedin.com/pulse/hallucinating-confabulating-peter-mcelwaine-johnn/ Anthropo- morphism Fit the gaps
in one’s memory High token probability, or more deterministic output, doesn’t mean being correct.

Even when it was actually right in the first place!
You can convince an LLM it’s wrong! Damn, I was right this time!

Context matters, LLMs are easily influenced… Mbappé, of course!

The reversal curse arxiv.org/pdf/2309.12288 the-decoder.com/language-models-know-tom-cruises-mother-but-not-her-son/

⚠ Pay attention to the data you feed to your
LLM 🚰 Fine-tuning with customer data can leak private information LLMs don’t know about ACLs, dates, and more There are other things I’m not good at… …and I can’t forget what I know!

Base64 is just like any other (human) language! Do you
speak Base64? SGVsbG8g V29ybGQh

www.anthropic.com/research/tracing-thoughts-language-model Cross-linguistic concepts

blog.google/technology/ai/dolphingemma/ Soon, we’ll speak with Dolphins! So long, and thanks
for all the fish!

Thanks for your attention It’s all you need! glaforge glaforge.dev
@glaforge @glaforge.dev @[email protected]

Illustrations courtesy of Imagen 3

Things you never dared to ask about LLMs — v2

Things you never dared to ask about LLMs — v2

More Decks by Guillaume Laforge

Other Decks in Technology

Featured

Transcript