Valuable Lessons Learned on Kaggle’s ARC AGI LLM challenge

Valuable Lessons Learned on Kaggle’s ARC AGI LLM challenge PyDataLondon
2024-11 talk @IanOzsvald – ianozsvald.com

Strategist/Trainer/Speaker/Author 20+ years Slight LLM fanatic now Interim Chief Data
Scientist By [ian]@ianozsvald[.com] Ian Ozsvald 3rd Edition!

6 mo evening work on ARC AGI Do you have
the right text representation? Can we score (feedback) our way to the truth? Might we have an interesting collaboration? What are some LLM limits? By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Can LLMs reason? Abstract JSON “initial → target” 400 public
problems, 3-4 examples each + 1 test case 100 private problems on Kaggle (so no OpenAI etc) Abstraction & Reasoning Challenge ARC AGI on Kaggle By [ian]@ianozsvald[.com] Ian Ozsvald

What rules do you need? By [ian]@ianozsvald[.com] Ian Ozsvald I'm
working on these "easy" problems

Some are really complex! By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Llama 3.0 8B (Q8) 10 GB, 20sec/pr. Llama 3.0 70B
(Q4) 40GB, 3min/pr. Can I solve simpler challenges on RTX 3090 24GB? Why not harder challenges? Llama 3.0 By [ian]@ianozsvald[.com] Ian Ozsvald

Example prompt – setup (Alpaca) By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Example - 1 of 3-4 example pairs By [ian]@ianozsvald[.com] Ian
Ozsvald

XXX Example – the “ask” By [ian]@ianozsvald[.com] Ian Ozsvald Chain
of thoughts prompt lifts success rate

Whole Prompt By [ian]@ianozsvald[.com] Ian Ozsvald Setup -> Examples ->
Prototype ->

Generate prompt, add specific hints per problem Call Llama.cpp, get
response Run through subprocess (clean env) If it runs – did it get it right? Else capture exception If valid, try separate on test example (write all to db) Overall process – repeat many times By [ian]@ianozsvald[.com] Ian Ozsvald

42% solutions pretty good! By [ian]@ianozsvald[.com] Ian Ozsvald It counts!
Comments! Reasonable numpy! Correct substitution! They correctly solve train and test problems

Convincing weirdness By [ian]@ianozsvald[.com] Ian Ozsvald

Repeat 50 (70B) (overnight) or 250 (8B) times (hours) Do
many independent runs By [ian]@ianozsvald[.com] Ian Ozsvald

Maybe a JSON list isn’t optimal? What about a block
of numbers? Separated numbers? Or CSV-like “ ‘ ‘ “ ? Or Excel? What about representation? By [ian]@ianozsvald[.com] Ian Ozsvald AUDIENCE - WHICH IS BEST?

JSON-only (J) – 12% success JSON+number block – 42% JSON+numbers
– 30% JSON+Excel – 14% JSON+CSV-like – 54% What about representation? By [ian]@ianozsvald[.com] Ian Ozsvald My early "local optima" Also combos, single quotes, double quotes no commas etc - all worse n=50, 70BQ4 model, 9565186b.json, prompt fixed except for representations

Just like in ML? Features matter! Do you have a
way to change your features? YAML / JSON / XML / descriptions / markdown / ? Representation By [ian]@ianozsvald[.com] Ian Ozsvald

What if we tell it “that didn’t work, try better?”
We know we can have a conversation Does it actually improve if we show the mistakes? What if we try to add feedback? By [ian]@ianozsvald[.com] Ian Ozsvald

54% baseline (none or “1 iteration”) Say “do better next
time”, it improves ~65% Results for iteration By [ian]@ianozsvald[.com] Ian Ozsvald n=50, 70BQ4 model, 9565186b.json, prompt fixed, except for the feedback section

XXX Better feedback example (2 of 4 shown) By [ian]@ianozsvald[.com]
Ian Ozsvald

>75% success rate if we add guidance Faster too than
previous method Results for iteration + feedback By [ian]@ianozsvald[.com] Ian Ozsvald

Jumped 12% to 75% correctness by changing representation and by
giving useful feedback Code-based execution (rules/Python) + Oracle feels pretty powerful vs hallucinatory LLM Can you give feedback and iterate? By [ian]@ianozsvald[.com] Ian Ozsvald

Solutions fixate – e.g. on a85 it fixates on mode!
Recording history->request “don’t do this again” avoids same mistake, but do we approach success? Trying→summarised successes as “seed” ideas? Next steps? By [ian]@ianozsvald[.com] Ian Ozsvald

“Techniques” library? By [ian]@ianozsvald[.com] Ian Ozsvald

Representations matter – like classical ML Scoring + iteration with
ground truth enables improvement – like classical ML I’m open to (work) collaborations See my NotANumber.email for updates Conclusion By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

Give it a history “all this reasoning hist. hasn’t worked…”
Llama 3.2 vision (vs “telling it what to focus on”) Extract a library of “helper functions”? More scoring+feedback? (e.g. mask->example?) But how to get it to “see” what’s there? Next steps By [ian]@ianozsvald[.com] Ian Ozsvald https://arcprize.org/play

“Techniques” library? (a85) By [ian]@ianozsvald[.com] Ian Ozsvald

Show hints What happens if you have the wrong hint?
What’s the point if the human is solving the hard part of the problem? Hints seem to be critical By [ian]@ianozsvald[.com] Ian Ozsvald

No hints – no joy (iteration + feedback) By [ian]@ianozsvald[.com]
Ian Ozsvald * CUDA_VISIBLE_DEVICES=0 nice -n 1 python system4_iterations.py -p 9565186b.json -m Meta-Llama-3-70B-Instruct-IQ4_XS.gguf -i 50 -l -t 0.4 --prompt_iterations 3 Given no hints, it'll come up with rules and it'll keep modifying, but it never "locks on" to a valid solution, even with iterations of self-critique

Valuable Lessons Learned on Kaggle’s ARC AGI LL...

Valuable Lessons Learned on Kaggle’s ARC AGI LLM challenge

ianozsvald

More Decks by ianozsvald

Other Decks in Technology

Featured

Transcript