Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Things you never dared to ask about LLMs — v2

Things you never dared to ask about LLMs — v2

Large Language Models (LLMs) have taken the world by storm, powering applications from chatbots to content generation.
Yet, beneath the surface, these models remain enigmatic.

This presentation will “delve” into the hidden corners of LLM technology that often leave developers scratching their heads.
It’s time to ask those questions you’ve never dared ask about the mysteries underpinning LLMs.

Here are some questions we’ll to answer:
Do you wonder why LLMs spit tokens instead of words? Where do those tokens come from?
What’s the difference between a “foundation” / “pre-trained” model, and an “instruction-tuned” one?
We’re often tweaking (hyper)parameters like temperature, top-p, top-k, but do you know how they really affect how tokens are picked up?
Quantization makes models smaller, but what are all those number encodings like fp32, bfloat16, int8, etc?
LLMs are good at translation, right? Do you speak the Base64 language too?

We’ll realize together that LLMs are far from perfect:
We’ve all heard about hallucinations, or should we say confabulations?
What is this reversal curse that makes LLMs ignore some facts from a different viewpoint?
You’d think that LLMs are deterministic at low temperature, but you’d be surprised by how the context influences LLMs’ answers…

Buckle up, it’s time to dispel the magic of LLMs, and ask those questions we never dared to ask!

Avatar for Guillaume Laforge

Guillaume Laforge

May 26, 2025
Tweet

More Decks by Guillaume Laforge

Other Decks in Technology

Transcript

  1. Things you never dared to ask about LLMs – Guillaume

    Laforge Developer Advocate Two R’s or not two R’s, that is the question glaforge glaforge.dev @glaforge @glaforge.dev @[email protected]
  2. LANGUAGE • Writing • Summarization • Classification • Ideation •

    Sentiment analysis • Data extraction • Semantic search CODING • Code completion • Code generation • Code chat • PR review • Security & quality suggestions LLM, the big black box with lots of potential (beyond just chat…) SPEECH • Speech to text • Text to speech • Audio transcription • Live voice multimodal streaming chat VISION & VIDEO • Image generation • Image editing • Captioning • Image Q&A • Image search • Video description • Video generation
  3. LLMs process language in chunks… or actually tokens. GPT-4 is

    rumored to have been trained on ~13 trillion tokens. Roughly: 4 tokens == 3 words Lots & lots of tokens… A human reading 8-hours a day needs 450,000 years to read 13 trillion tokens
  4. For efficiency! Better for memory Keeps attention high Tokens are

    the sweet spot between letters and words Why tokens and not letters? or even words? reddit.com/r/learnmachinelearning/comments/1d0sopa/why_not_use_words_instead_of_tokens
  5. Most common algorithms: • BPE (Byte-Pair Encoding) used by GPTs

    • WordPiece, used by BERT • Unigram, often used in SentencePiece • SentencePiece, used by Gemini & Gemma Some require pre-tokenization, or don’t offer reversible tokenization. How does tokenization work? huggingface.co/learn/nlp-course/chapter6/4#algorithm-overview
  6. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at u-n-re-l-a-t-e-d
  7. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at u-n-re-l-at-e-d
  8. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at e + d → ed u-n-re-l-at-e-d
  9. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at e + d → ed u-n-re-l-at-ed
  10. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at e + d → ed u + n → un u-n-re-l-at-ed
  11. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at e + d → ed u + n → un un-re-l-at-ed
  12. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at e + d → ed u + n → un at + ed → ated un-re-l-at-ed
  13. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at e + d → ed u + n → un at + ed → ated un-re-l-ated
  14. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel un-re-l-ated
  15. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel un-rel-ated
  16. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel rel + ated → related un-rel-ated
  17. Zooming on Byte-Pair Encoding Merge rules r + e →

    re a + t → at e + d → ed u + n → un at + ed → ated re + l → rel rel + ated → related un-related
  18. ‘▁’ is not your regular ‘_’ SentencePiece tokenization uses this

    character to denote white space. Weird ▁ character in front of tokens? github.com/google/sentencepiece/tree/master#whitespace-is-treated-as-a-basic-symbol (U+2581) Lower One Eighth Block Sentence, Piece, ▁token, ization, ▁uses, ▁this, ▁character, ▁to, ▁denote, ▁white, ▁space, .
  19. Acronyms, abbreviations, and jargon is hard… RATP R, ATP Devoxx

    Devo, xx trihexyphenidyle tri, he, xy, phen, id, yle Meum athamanticum Me, um, ath, aman, ticum Me, um, ath, aman, ticum?
  20. Because we told them so? LLM text generation reached •

    the max output tokens count • its <|endoftext|> or <end_of_turn>model tokens from its instruction tuning set How do LLMs know when to stop generating tokens? www.louisbouchard.ai/how-llms-know-when-to-stop/
  21. The cat is a… Choosing the next token 56% chance

    of being picked 2% chance of being picked
  22. Top-P — Cumulative probability The cat is a… Also called

    “nucleus sampling” Pick top 0.9 0.56 + 0.28 + 0.11 > 0.9 0.95
  23. If I set temperature at 0, top-K at 1, do

    I have a deterministic output? What about if there’s a seed parameter? Well, no, because of: • Floating point numbers (non-)associativity (a + b) + c ≠ a + (b + c) due to rounding errors • Parallel execution order GPUs reorder reductions (like sums) in threads and cores Deterministic… to a certain extent
  24. Even when it was actually right in the first place!

    You can convince an LLM it’s wrong! Damn, I was right this time!
  25. ⚠ Pay attention to the data you feed to your

    LLM 🚰 Fine-tuning with customer data can leak private information LLMs don’t know about ACLs, dates, and more There are other things I’m not good at… …and I can’t forget what I know!