PDEs: Navier–Stokes / Euler, instabilities, counterexamples. AI for theorem proving Lean: verified proof; Viazovska’s optimal E8 packing. Focus: mathematical reasoning, from informal to formal. Source: Wang–Lai–Gómez-Serrano–Buckmaster, arXiv:2201.06780 ; Hariharan et al., arXiv:2604.23468. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 3 / 20
2025 2026 Transformer architecture GPT-3 scale ChatGPT mass adoption Reasoning o1/o3 DeepSeek-R1 Gemini Deep Think Claude reasoning Agentic / code Codex Claude Code Mistral vibe OpenClaws Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 4 / 20
a testbed for reasoning. ▶ Training: next-token prediction, evaluation oriented toward reasoning. ▶ Benchmarks: GSM8K, MATH, AIME/IMO-style as strategic signals. ▶ Reasoning: reinforcement learning and mathematical rewards. Benchmark snippets GSM8K word arithmetic: “Alice buys 3 notebooks... how many are left?” MATH contest problem: algebra / geometry / analysis, structured solution. AIME / IMO short statement, long search: determine, optimize, prove. Takeaway : agentic AI: local answer to full pipeline – plan, code, verify, write. Source: Cobbe et al. 2021; Hendrycks et al. 2021; OpenAI Codex 2025/2026. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 5 / 20
AlphaGeometry 2, translation to formal languages then proof: 28/42. Results 2025: informal reasoning Gemini Deep Think, general-purpose models and informal proofs: 35/42. Example of IMO 2025 statement Source: DeepMind 25/07/2024 and 21/07/2025; official IMO 2025 PDF (Deep Think). Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 6 / 20
reproducible measurement of research-level proving. Task: unseen problems, available human solutions, full write-up; criterion: anonymized acceptance. Results : First Batch → Second Batch First Batch (02/2026) Signal: ∼2 clear successes (Q9,Q10). Gray zone: Q5/Q8 close or repairable; no formal review. Second Batch (03–06/2026) OK: 17/39 submissions passed. Coverage: 7/10 problems had ≥ 1 OK solution. Criterion: flawless or minor revisions after anonymized expert review. Takeaway : scale shift, but human validation remains central. ▶ Capability: serious proofs possible on nontrivial problems. ▶ Costs: batch 2, $117–$4.8k for 10 answers; ∼$10–$480/answer. ▶ Bottleneck: expert review, skipped lemmas, “standard arguments”, fragile citations. ▶ Benchmark: prompts, logs, tokens, time, costs and acceptance criteria. Source: ressources/first-proof-summary.md ; 1stProof ; arXiv:2602.05192 ; arXiv:2606.18119. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 8 / 20
of unit-distance pairs among n points in the plane. OpenAI prompt (summary) “Resolve Erdős’s planar unit-distance problem completely.” Prove either u(n) ≤ n1+O(1/ log log n), or counterexamples to every bound n1+C/ log log n. No partial progress. Results : OpenAI counterexample, then human verification and optimization. 1 + o(1) ⇝ 1.0152 Class. OpenAI Sawin Emmerich Exp. 1 + o(1) > 1 1.014 1.0152 Takeaway : LLM + harness ⇒ PhD-student level on an open problem. grid Emmerich n increases Sources: OpenAI proof PDF p.3; Alon et al. 2605.20695; Sawin 2605.20579; Emmerich 2606.03419. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 9 / 20
major theorem. ▶ Object: optimal packing in dimension 8 (Viazovska), extension to dimension 24. ▶ Stakes: transfer a very high-level proof into Lean. Results : public Math, Inc. repository Sphere-Packing-Lean. ▶ Status: public, massive Lean project around certification of the E8 / Leech packing proof. ▶ Scope: geometry of numbers, coding theory, 2022 Fields Medal. ▶ Current size: 830 files, 180 661 Lean lines; FLT Lean reference: 117 files, 15 411 lines. Takeaway : large-scale formal certification, but sensitive governance. ▶ Governance point: academic (EPFL) → private (Math, Inc.) transfer was highly controversial. ▶ Tension: scientific credit, code maintenance, open Lean infrastructure. Source: GitHub repositories; Hariharan et al., arXiv:2604.23468; local count 09/03/2026 with rg + wc -l. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 10 / 20
analysis, still only sparsely covered by mathlib. Standard use case: math expert who is not a Lean specialist, guided by LLM + formal checking. Results : Scott Armstrong (PDEs) and Julia Kempe; Harnack, Hölder and reusable analysis foundations in scottnarmstrong/DeGiorgi. Short excerpt : Harnack.lean theorem harnack (A : NormalizedEllipticCoeff d (ball 0 1)) (hsol : IsSolution A.1 u) : essSup u µ1/2 ≤ exp(C_harnack d * A.1.Λ1/2) * essInf u µ1/2 := by ... Sobolev Weak Harnack Harnack Hölder Takeaway ▶ Feasibility: modern analysis with Claude/Codex pro accounts. ▶ Expertise: math guidance essential; impossible outside the domain. ▶ Cost: many tokens, manageable with a solid blueprint. ▶ Tension: math expert vs Lean idiom / core developers. ▶ Goal: formalized foundations to put Lean in the loop. Source: Armstrong–Kempe, arXiv:2604.05984; scottnarmstrong/DeGiorgi, especially Harnack.lean; Armstrong blog post 07/04/2026. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 11 / 20
well-scoped problem where one is stuck; use Lean as a non-specialist. Results : personal report from intensive use. ▶ Unlock: open problem solved in 2 weeks through intensive GPT-5 prompting. ▶ Preprint: https://arxiv.org/abs/2602.01372 ▶ Agents: Codex / Claude Code in daily use. ▶ Library: https://github.com/gpeyre/flow-sinkhorn Takeaway : two regimes and workflow gains. ▶ Informal: unlocks open problems. ▶ Formal (Lean): works, but costs ∼ 10× more tokens. ▶ Writing: faster prototyping. ▶ Technical: automated repetitive tasks. ▶ Reasoning: Lean certifies the proof without directly improving the mathematical idea. Source: Personal experience + arXiv/GitHub (accessed 09/03/2026). Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 13 / 20
▶ Protocol: same prompt, multiple answers. ▶ Task: semi-open question, notebook, code, paper, formalization. ▶ Audit: submissions evaluated across several criteria. Prompt excerpt Consider flow matching with a stochastic interpolant a(t)*X0+b(t)*X1. In python/ do an indepth numerical simulation ... X0 sim N(0,Id), X1 is a mixture of three Dirac. In paper/ write a detailed LaTeX article ... compute in closed form Sigma_t and the flow map T_t. Snapshot: paper page from the codex-gabriel/ submission. Source: ressources/agentic-benchs/todo.md; comparative report from 18/06/2026; codex-gabriel submission. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 14 / 20
▶ Automation: first report, local flaws, suggestions. ▶ Weak signal: triage, not scientific validation. ▶ Human value: validity, novelty, taste. Teaching: preserve human training ▶ Level: PhD-level writing, code, local critique. ▶ Risk: delegating before learning to reason. ▶ Loss: future teachers, reviewers, AI evaluators. ▶ Protect: oral exams, proof critique, AI audit. Source: PaperReview.ai, Stanford ML Group (screenshot 02/07/2026) ; qualitative synthesis. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 18 / 20
bridge AI, math, formalization, CS (INSMI/INS2I). ▶ AISSAI: “AI for mathematics” quarter with the RT. Open access to models ▶ Repos: academic projects connected to models. ▶ Models: control and auditability. Dialogue with industry ▶ DeepMind: funds AISSAI postdoc program. ▶ Mistral: Emmy tools for CNRS to develop further. Do not forget training ▶ Faculty: design + evaluation. ▶ Evaluators: preserve human training. Gabriel Peyré CNRS and ENS, Université PSL AI for Theory July 2, 2026 19 / 20