Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Valuable Lessons Learned on Kaggle’s ARC AGI LL...

ianozsvald
November 04, 2024

Valuable Lessons Learned on Kaggle’s ARC AGI LLM challenge

Having worked on Kaggle's LLM-based ARC AGI program-writing challenge for 6 months using Llama3, I'll give reflections on the lessons learned making an automatic program generator, evaluating it, coming up with strong representations for the challenge, chain-of and program-of-thought styles and some multi-stage critical thinking approaches. You'll get ideas for how to tune your own prompts and shortcuts to help you evaluate your own LLM usage with greater assurance in the face of non-deterministic outcomes.

Given at: https://www.meetup.com/pydata-london-meetup/events/304178729/?eventOrigin=group_upcoming_events

Get updates via: https://notanumber.email/

ianozsvald

November 04, 2024
Tweet

More Decks by ianozsvald

Other Decks in Technology

Transcript

  1. Strategist/Trainer/Speaker/Author 20+ years Slight LLM fanatic now Interim Chief Data

    Scientist By [ian]@ianozsvald[.com] Ian Ozsvald 3rd Edition!
  2. 6 mo evening work on ARC AGI Do you have

    the right text representation? Can we score (feedback) our way to the truth? Might we have an interesting collaboration? What are some LLM limits? By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play
  3. Can LLMs reason? Abstract JSON “initial → target” 400 public

    problems, 3-4 examples each + 1 test case 100 private problems on Kaggle (so no OpenAI etc) Abstraction & Reasoning Challenge ARC AGI on Kaggle By [ian]@ianozsvald[.com] Ian Ozsvald
  4. Llama 3.0 8B (Q8) 10 GB, 20sec/pr. Llama 3.0 70B

    (Q4) 40GB, 3min/pr. Can I solve simpler challenges on RTX 3090 24GB? Why not harder challenges? Llama 3.0 By [ian]@ianozsvald[.com] Ian Ozsvald
  5. Generate prompt, add specific hints per problem Call Llama.cpp, get

    response Run through subprocess (clean env) If it runs – did it get it right? Else capture exception If valid, try separate on test example (write all to db) Overall process – repeat many times By [ian]@ianozsvald[.com] Ian Ozsvald
  6. 42% solutions pretty good! By [ian]@ianozsvald[.com] Ian Ozsvald It counts!

    Comments! Reasonable numpy! Correct substitution! They correctly solve train and test problems
  7. Repeat 50 (70B) (overnight) or 250 (8B) times (hours) Do

    many independent runs By [ian]@ianozsvald[.com] Ian Ozsvald
  8. Maybe a JSON list isn’t optimal? What about a block

    of numbers? Separated numbers? Or CSV-like “ ‘ ‘ “ ? Or Excel? What about representation? By [ian]@ianozsvald[.com] Ian Ozsvald AUDIENCE - WHICH IS BEST?
  9. JSON-only (J) – 12% success JSON+number block – 42% JSON+numbers

    – 30% JSON+Excel – 14% JSON+CSV-like – 54% What about representation? By [ian]@ianozsvald[.com] Ian Ozsvald My early "local optima" Also combos, single quotes, double quotes no commas etc - all worse n=50, 70BQ4 model, 9565186b.json, prompt fixed except for representations
  10. Just like in ML? Features matter! Do you have a

    way to change your features? YAML / JSON / XML / descriptions / markdown / ? Representation By [ian]@ianozsvald[.com] Ian Ozsvald
  11. What if we tell it “that didn’t work, try better?”

    We know we can have a conversation Does it actually improve if we show the mistakes? What if we try to add feedback? By [ian]@ianozsvald[.com] Ian Ozsvald
  12. 54% baseline (none or “1 iteration”) Say “do better next

    time”, it improves ~65% Results for iteration By [ian]@ianozsvald[.com] Ian Ozsvald n=50, 70BQ4 model, 9565186b.json, prompt fixed, except for the feedback section
  13. >75% success rate if we add guidance Faster too than

    previous method Results for iteration + feedback By [ian]@ianozsvald[.com] Ian Ozsvald
  14. Jumped 12% to 75% correctness by changing representation and by

    giving useful feedback Code-based execution (rules/Python) + Oracle feels pretty powerful vs hallucinatory LLM Can you give feedback and iterate? By [ian]@ianozsvald[.com] Ian Ozsvald
  15. Solutions fixate – e.g. on a85 it fixates on mode!

    Recording history->request “don’t do this again” avoids same mistake, but do we approach success? Trying→summarised successes as “seed” ideas? Next steps? By [ian]@ianozsvald[.com] Ian Ozsvald
  16. Representations matter – like classical ML Scoring + iteration with

    ground truth enables improvement – like classical ML I’m open to (work) collaborations See my NotANumber.email for updates Conclusion By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play
  17. Give it a history “all this reasoning hist. hasn’t worked…”

    Llama 3.2 vision (vs “telling it what to focus on”) Extract a library of “helper functions”? More scoring+feedback? (e.g. mask->example?) But how to get it to “see” what’s there? Next steps By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play
  18. Show hints What happens if you have the wrong hint?

    What’s the point if the human is solving the hard part of the problem? Hints seem to be critical By [ian]@ianozsvald[.com] Ian Ozsvald
  19. No hints – no joy (iteration + feedback) By [ian]@ianozsvald[.com]

    Ian Ozsvald * CUDA_VISIBLE_DEVICES=0 nice -n 1 python system4_iterations.py -p 9565186b.json -m Meta-Llama-3-70B-Instruct-IQ4_XS.gguf -i 50 -l -t 0.4 --prompt_iterations 3 Given no hints, it'll come up with rules and it'll keep modifying, but it never "locks on" to a valid solution, even with iterations of self-critique