IA for theory

AI for Theory State of the art, uses and actions
for the community Gabriel Peyré CNRS and ENS, Université PSL July 2, 2026 github.com/gpeyre/ia4maths

Talk roadmap LLMs and mathematics AI for research in mathematics
Experience reports Impact on the community Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 2 / 20

AI for mathematics AI for scientific discovery PINNs to explore
PDEs: Navier–Stokes / Euler, instabilities, counterexamples. AI for theorem proving Lean: verified proof; Viazovska’s optimal E8 packing. Focus: mathematical reasoning, from informal to formal. Source: Wang–Lai–Gómez-Serrano–Buckmaster, arXiv:2201.06780 ; Hariharan et al., arXiv:2604.23468. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 3 / 20

LLMs for reasoning and mathematics 2017 2020 2022 2024 2025
2025 2026 Transformer architecture GPT-3 scale ChatGPT mass adoption Reasoning o1/o3 DeepSeek-R1 Gemini Deep Think Claude reasoning Agentic / code Codex Claude Code Mistral vibe OpenClaws Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 4 / 20

Why mathematics became central for LLMs Problem : mathematics as
a testbed for reasoning. ▶ Training: next-token prediction, evaluation oriented toward reasoning. ▶ Benchmarks: GSM8K, MATH, AIME/IMO-style as strategic signals. ▶ Reasoning: reinforcement learning and mathematical rewards. Benchmark snippets GSM8K word arithmetic: “Alice buys 3 notebooks... how many are left?” MATH contest problem: algebra / geometry / analysis, structured solution. AIME / IMO short statement, long search: determine, optimize, prove. Takeaway : agentic AI: local answer to full pipeline – plan, code, verify, write. Source: Cobbe et al. 2021; Hendrycks et al. 2021; OpenAI Codex 2025/2026. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 5 / 20

Simple problems: mathematical olympiads Results 2024: formal reasoning AlphaProof +
AlphaGeometry 2, translation to formal languages then proof: 28/42. Results 2025: informal reasoning Gemini Deep Think, general-purpose models and informal proofs: 35/42. Example of IMO 2025 statement Source: DeepMind 25/07/2024 and 21/07/2025; official IMO 2025 PDF (Deep Think). Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 6 / 20

Research level: First Proof, from pilot to benchmark Problem :
reproducible measurement of research-level proving. Task: unseen problems, available human solutions, full write-up; criterion: anonymized acceptance. Results : First Batch → Second Batch First Batch (02/2026) Signal: ∼2 clear successes (Q9,Q10). Gray zone: Q5/Q8 close or repairable; no formal review. Second Batch (03–06/2026) OK: 17/39 submissions passed. Coverage: 7/10 problems had ≥ 1 OK solution. Criterion: flawless or minor revisions after anonymized expert review. Takeaway : scale shift, but human validation remains central. ▶ Capability: serious proofs possible on nontrivial problems. ▶ Costs: batch 2, $117–$4.8k for 10 answers; ∼$10–$480/answer. ▶ Bottleneck: expert review, skipped lemmas, “standard arguments”, fragile citations. ▶ Benchmark: prompts, logs, tokens, time, costs and acceptance criteria. Source: ressources/first-proof-summary.md ; 1stProof ; arXiv:2602.05192 ; arXiv:2606.18119. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 8 / 20

Research level: unit distance conjecture Problem : u(n), maximum number
of unit-distance pairs among n points in the plane. OpenAI prompt (summary) “Resolve Erdős’s planar unit-distance problem completely.” Prove either u(n) ≤ n1+O(1/ log log n), or counterexamples to every bound n1+C/ log log n. No partial progress. Results : OpenAI counterexample, then human verification and optimization. 1 + o(1) ⇝ 1.0152 Class. OpenAI Sawin Emmerich Exp. 1 + o(1) > 1 1.014 1.0152 Takeaway : LLM + harness ⇒ PhD-student level on an open problem. grid Emmerich n increases Sources: OpenAI proof PDF p.3; Alon et al. 2605.20695; Sawin 2605.20579; Emmerich 2606.03419. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 9 / 20

Formal mathematics: Viazovska theorem (Lean) Problem : formalize an existing
major theorem. ▶ Object: optimal packing in dimension 8 (Viazovska), extension to dimension 24. ▶ Stakes: transfer a very high-level proof into Lean. Results : public Math, Inc. repository Sphere-Packing-Lean. ▶ Status: public, massive Lean project around certification of the E8 / Leech packing proof. ▶ Scope: geometry of numbers, coding theory, 2022 Fields Medal. ▶ Current size: 830 files, 180 661 Lean lines; FLT Lean reference: 117 files, 15 411 lines. Takeaway : large-scale formal certification, but sensitive governance. ▶ Governance point: academic (EPFL) → private (Math, Inc.) transfer was highly controversial. ▶ Tension: scientific credit, code maintenance, open Lean infrastructure. Source: GitHub repositories; Hariharan et al., arXiv:2604.23468; local count 09/03/2026 with rg + wc -l. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 10 / 20

Formal mathematics: analysis / PDEs (Lean) Problem : formalize modern
analysis, still only sparsely covered by mathlib. Standard use case: math expert who is not a Lean specialist, guided by LLM + formal checking. Results : Scott Armstrong (PDEs) and Julia Kempe; Harnack, Hölder and reusable analysis foundations in scottnarmstrong/DeGiorgi. Short excerpt : Harnack.lean theorem harnack (A : NormalizedEllipticCoeff d (ball 0 1)) (hsol : IsSolution A.1 u) : essSup u µ1/2 ≤ exp(C_harnack d * A.1.Λ1/2) * essInf u µ1/2 := by ... Sobolev Weak Harnack Harnack Hölder Takeaway ▶ Feasibility: modern analysis with Claude/Codex pro accounts. ▶ Expertise: math guidance essential; impossible outside the domain. ▶ Cost: many tokens, manageable with a solid blueprint. ▶ Tension: math expert vs Lean idiom / core developers. ▶ Goal: formalized foundations to put Lean in the loop. Source: Armstrong–Kempe, arXiv:2604.05984; scottnarmstrong/DeGiorgi, especially Harnack.lean; Armstrong blog post 07/04/2026. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 11 / 20

My research experience Problem : make progress on a fairly
well-scoped problem where one is stuck; use Lean as a non-specialist. Results : personal report from intensive use. ▶ Unlock: open problem solved in 2 weeks through intensive GPT-5 prompting. ▶ Preprint: https://arxiv.org/abs/2602.01372 ▶ Agents: Codex / Claude Code in daily use. ▶ Library: https://github.com/gpeyre/flow-sinkhorn Takeaway : two regimes and workflow gains. ▶ Informal: unlocks open problems. ▶ Formal (Lean): works, but costs ∼ 10× more tokens. ▶ Writing: faster prototyping. ▶ Technical: automated repetitive tasks. ▶ Reasoning: Lean certifies the proof without directly improving the mathematical idea. Source: Personal experience + arXiv/GitHub (accessed 09/03/2026). Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 13 / 20

Group experience report Problem : collective evaluation at CSD (ENS).
▶ Protocol: same prompt, multiple answers. ▶ Task: semi-open question, notebook, code, paper, formalization. ▶ Audit: submissions evaluated across several criteria. Prompt excerpt Consider flow matching with a stochastic interpolant a(t)*X0+b(t)*X1. In python/ do an indepth numerical simulation ... X0 sim N(0,Id), X1 is a mixture of three Dirac. In paper/ write a detailed LaTeX article ... compute in closed form Sigma_t and the flow map T_t. Snapshot: paper page from the codex-gabriel/ submission. Source: ressources/agentic-benchs/todo.md; comparative report from 18/06/2026; codex-gabriel submission. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 14 / 20

Group experience report (continued) Problem : multicriteria evaluation. Results :
LLM-as-judge aggregation Submission Math Code Overall emmanuel 9.3 8.8 9.1 codex-gabriel 9.0 9.0 8.8 openclaw 8.2 8.6 8.2 hermes 7.8 8.0 7.7 claude-gabriel 6.5 7.5 7.1 codex-kimia 7.0 7.0 7.0 codex-clement 7.0 5.5 6.2 antigravity-a. 5.8 6.0 5.8 antigravity-o. 5.5 5.0 5.4 Overall = aggregation of math, code, paper and robustness criteria. Results : robust signals ▶ Math: Codex ≫ Claude ≫ rest. ▶ Code: Claude ≫ Codex ≫ rest. ▶ Budget: performance ≃ tokens. Takeaway : effect of token budget Budget Typical outcome Small incorrect proof Medium proof OK, statement imperfect Large complete solution Huge Lean viable Source: CSD experiment; audit ressources/agentic-benchs/report. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 15 / 20

Zoom: Emmanuel Sérié / Noogram Problem : CSD prompt →
full scientific artifact. Results : Noogram contribution ▶ Theorem: Gaussian FM = OT iff covariances commute. ▶ Artifacts: site, code, paper, figures, Lean, report. ▶ Audit: false proof found and fixed; score 9.1. Takeaway : agent harnessing for science. ▶ Chain: generate, test, refute, publish. ▶ Rigor: Lean as anchor; expert required. ▶ Cost: heavy, but auditable end to end. Sources: Noogram site; GitHub noogram-labs/flow-matching-gaussians; CSD audit. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 16 / 20

Impact on evaluation and teaching Reviewing: AI as first reader
▶ Automation: first report, local flaws, suggestions. ▶ Weak signal: triage, not scientific validation. ▶ Human value: validity, novelty, taste. Teaching: preserve human training ▶ Level: PhD-level writing, code, local critique. ▶ Risk: delegating before learning to reason. ▶ Loss: future teachers, reviewers, AI evaluators. ▶ Protect: oral exams, proof critique, AI audit. Source: PaperReview.ai, Stanford ML Group (screenshot 02/07/2026) ; qualitative synthesis. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 18 / 20

Actions for the community Structure the community ▶ CNRS RT:
bridge AI, math, formalization, CS (INSMI/INS2I). ▶ AISSAI: “AI for mathematics” quarter with the RT. Open access to models ▶ Repos: academic projects connected to models. ▶ Models: control and auditability. Dialogue with industry ▶ DeepMind: funds AISSAI postdoc program. ▶ Mistral: Emmy tools for CNRS to develop further. Do not forget training ▶ Faculty: design + evaluation. ▶ Evaluators: preserve human training. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 19 / 20

Conclusion 1. Acceleration already visible ▶ Performance: very fast progress.
▶ Workflows: code, verify, write. 2. Concrete scientific impact ▶ Informal: open problems unlocked. ▶ Formal: Lean progressing, high token cost. 3. Human role displaced ▶ Positioning: augment, not replace. ▶ Role: researcher → judge of proof quality. 4. Collective stakes ▶ Access: controllable and auditable models. ▶ Tension: academic–industry exchanges. ▶ Sovereignty: scientific and institutional framing. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 20 / 20

IA for theory

IA for theory

Gabriel Peyré

More Decks by Gabriel Peyré

Other Decks in Research

Featured

Transcript

AI for Theory State of the art, uses and actions

Talk roadmap LLMs and mathematics AI for research in mathematics

AI for mathematics AI for scientific discovery PINNs to explore

LLMs for reasoning and mathematics 2017 2020 2022 2024 2025

Why mathematics became central for LLMs Problem : mathematics as

Simple problems: mathematical olympiads Results 2024: formal reasoning AlphaProof +

Talk roadmap LLMs and mathematics AI for research in mathematics

Research level: First Proof, from pilot to benchmark Problem :

Research level: unit distance conjecture Problem : u(n), maximum number

Formal mathematics: Viazovska theorem (Lean) Problem : formalize an existing

Formal mathematics: analysis / PDEs (Lean) Problem : formalize modern

Talk roadmap LLMs and mathematics AI for research in mathematics

My research experience Problem : make progress on a fairly

Group experience report Problem : collective evaluation at CSD (ENS).

Group experience report (continued) Problem : multicriteria evaluation. Results :

Zoom: Emmanuel Sérié / Noogram Problem : CSD prompt →

Talk roadmap LLMs and mathematics AI for research in mathematics

Impact on evaluation and teaching Reviewing: AI as first reader

Actions for the community Structure the community ▶ CNRS RT:

Conclusion 1. Acceleration already visible ▶ Performance: very fast progress.