Things you never dared to ask about LLMs

Things you never dared to ask about LLMs – Guillaume
Laforge Developer Advocate Two R’s or not two R’s, that is the question @[email protected]

1. I don’t have a PhD in Machine Learning. 2.
This talk may contain hallucinations! 3. Please correct me if I’m wrong 🙏 I’m eager to learn. 🚧 Be warned 🚧

How many R’s in “strawberry”? Two or Three?

How many R’s in “strawberry”? They process language in chunks,
not individual letters

LLMs process language in chunks… or actually tokens. GPT-4 is
rumored to have been trained on ~2 trillion tokens. Roughly: 4 tokens == 3 words Lots & lots of tokens… A human reading 8-hours a day needs 44,000 years to read 2 trillion tokens

LLMs reason on tokens, not letters x.com/karpathy/status/1816637781659254908

What you see… What LLM sees…

For efficiency! Better for memory Keeps attention high Tokens are
the sweet spot between letters and words Why tokens and not letters? or even words? reddit.com/r/learnmachinelearning/comments/1d0sopa/why_not_use_words_instead_of_tokens

Most common algorithms: • BPE (Byte-Pair Encoding) used by GPTs
• WordPiece, used by BERT • Unigram, often used in SentencePiece • SentencePiece, used by Gemini & Gemma Some require pre-tokenization, or don’t offer reversible tokenization. How does tokenization work? huggingface.co/learn/nlp-course/chapter6/4#algorithm-overview

Zooming on Byte-Pair Encoding u-n-r-e-l-a-t-e-d

Zooming on Byte-Pair Encoding Merge rules r + e →
re u-n-r-e-l-a-t-e-d

re u-n-re-l-a-t-e-d

re a + t → at u-n-re-l-a-t-e-d

re a + t → at u-n-re-l-at-e-d

re a + t → at e + d → ed u-n-re-l-at-e-d

re a + t → at e + d → ed u-n-re-l-at-ed

re a + t → at e + d → ed u + n → un u-n-re-l-at-ed

re a + t → at e + d → ed u + n → un un-re-l-at-ed

re a + t → at e + d → ed u + n → un at + ed → ated un-re-l-at-ed

re a + t → at e + d → ed u + n → un at + ed → ated un-re-l-ated

re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel un-re-l-ated

re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel un-rel-ated

re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel rel + ated → related un-rel-ated

re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel rel + ated → related un-related

‘▁’ is not your regular ‘_’ SentencePiece uses this character
to denote white space. Weird ▁ character in front of tokens? github.com/google/sentencepiece/tree/master#whitespace-is-treated-as-a-basic-symbol (U+2581) Lower One Eighth Block Sentence, Piece, ▁uses, ▁this, ▁character, ▁to, ▁denote, ▁white, ▁space, .

A very talkative and imaginative parrot! Speaking endlessly… Stochastic parrots
Blah, blah, blah…

Foundation or instruction-tuned models?

Because we told them so? LLM text generation reached •
the max output tokens count • its <|endoftext|> or <end_of_turn>model tokens from its instruction tuning set How do LLMs know when to stop generating tokens? www.louisbouchard.ai/how-llms-know-when-to-stop/

How do LLMs choose the next token? I’m just playing
dice!

The cat is a… Choosing the next token 56% chance
of being picked 2% chance of being picked

Flying through hyperspace… Err… Hyperparameters To max output tokens, and
beyond!

Top-K — Just pick from k tokens The cat is
a… Pick the top 2

Top-P — Cumulative probability The cat is a… Also called
“nucleus sampling” Pick top 0.9 0.56 + 0.28 + 0.11 > 0.9 0.95

What is temperature exactly? It’s hot, that’s all I know!

Temperature — Flattening / sharpening the curve Temp = 0.2

Some more space LLM oddities

If I set temperature at 0, top-K at 1, do
I have a deterministic output? What about if there’s a seed parameter? Well, no, because of: • Floating point numbers (non-)associativity (a + b) + c ≠ a + (b + c) due to rounding errors • Parallel execution order GPUs reorder reductions (like sums) in threads and cores Deterministic… to a certain extent

Hallucinations or confabulation? news.ycombinator.com/item?id=33841672 www.linkedin.com/pulse/hallucinating-confabulating-peter-mcelwaine-johnn/ Anthropo- morphism Fit the gaps
in one’s memory High token probability, or more deterministic output, doesn’t mean being correct.

Even when it was actually right in the first place!
You can convince an LLM it’s wrong! Damn, I was right this time!

Context matters, LLMs are easily influenced… Mbappé, of course!

The reversal curse arxiv.org/pdf/2309.12288 the-decoder.com/language-models-know-tom-cruises-mother-but-not-her-son/

Base64 is just like any other (human) language! Do you
speak Base64? RW5qb3kgZ G90QUkh

Thanks for your attention @[email protected] It’s all you need!

Illustrations courtesy of Imagen 3

Things you never dared to ask about LLMs

Things you never dared to ask about LLMs

More Decks by Guillaume Laforge

Featured

Transcript