[mercari GEARS 2025] Exploring LLM-Driven Formal Verification for Robust Continuous Integration of Services

Exploring LLM-Driven Formal Veriﬁcation for Robust Continuous Integration of Services 
Cheng-Hui Weng  Mercari / Product Engineer 

Product engineer (BE) since 2019  R4D-supported Ph.d. student since 2022 
  Graduate School of Mathematics, Nagoya University    A newbie father since Jan 2025  42

Thoughts:    Don't try this at home.    43

The title has lots of buzzwords, but actually…  LLM-Driven  ≈ 
Always using LLMs to automate boring, heavy-labor tasks  Formal Veriﬁcation  ≈  Check speciﬁcations logically with a machine  Continuous Integration  ≈  Continuously check every new change  44

Once upon a time, there was an incident  Our data
storage of item attributes randomly dropped attributes  It only occurred on production  There were no logs or errors at all   Actual: random attributes are dropped silently  System writes the change to the database  No logs or errors occur  Color  Dark blue  +Authentication  Veriﬁed  Color  Dark blue  Authentication  Veriﬁed  45

Finding and ﬁxing the issue took 180+ days    It
is a production-only bug.  There were no errors and no logs when the drop happened.  Finding issues is costly, and each time only a few items were impacted.  The issue was resolved after the incident was identiﬁed.  46

LLM + Formal Verification Identify the root cause within 1
day 47

Bug detection from 180 days to 1 day  48 Setup 
1. Check out the commit before the bug ﬁx  2. No hint  3. Give symptoms  4. Give possible ﬁles  5. Ask the LLM service to build the formal spec to reveal the bug 

Why formal models (specs)?    No logs or errors occur
Code tests can’t identify specification logic issues  Formal verification ≈ checking specifications logically by machine  Natural-language spec  (ambiguous)  Formal spec  (TLA+)  TLC model checker  Invariants (requirements) hold  Bug found  Implementation code  ≈ state machine  49

Invariant violation identified by TLA+ 50

Reality  Natural-language spec  (ambiguous)  Formal spec  (TLA+)  TLC model checker 
Invariants (requirements) hold  Bug found  Implementation code  • No time to learn new concepts  • No time to apply the knowledge  • Duplicated spec writing  • Extra work before coding • Delivery time gets longer • Code may be buggy still • Tests are still required 51

Therefore: Coding ﬁrst, then veriﬁcation in CI Flow Natural-language spec 
(ambiguous)  Formal spec  (TLA+)  ModelFuzz*  Model-guided fuzz test  Invariants (requirements) hold  Bug found  Implementation code  Git commit  JIRA ticket  Prompts  To LLM service  • Spec per change  • Less duplication work  Coverage for every change guaranteed by fuzz test  * Gulcan, E. B., Ozkan, B. K., Majumdar, R., & Nagendra, S. (2025). Model-guided Fuzzing of Distributed Systems.  https://arxiv.org/abs/2410.02307  52

Why not directly use LLM fuzz test?  Answer from ChatGPT 
53

Current progress  For Bugs  A TLA+ spec for the silent
attribute dropping bug shows the invariant violation, fully by a LLM service within one day    For Features  Another TLA+ spec also fully by a LLM service for a API ﬂow change to prove its integrity      Next Step  Complete the ModelFuzz setup and run it for the whole attributes data store service to showcase the method  54

Takeaway 55 🧩 Formal verification isn’t just theory ⚙ LLM
+ TLA+ = Faster, logical debugging 🧠 Automation can make correctness part of the workflow 🚀 Code first, verify continuously 🌱 Next step: make “robust CI” the default.

Takeaway(2) 56 AI presentation is awesome

Thank You! 

[mercari GEARS 2025] Exploring LLM-Driven Forma...

[mercari GEARS 2025] Exploring LLM-Driven Formal Verification for Robust Continuous Integration of Services

mercari PRO

More Decks by mercari

Other Decks in Technology

Featured

Transcript

Exploring LLM-Driven Formal Veriﬁcation for Robust Continuous Integration of Services

Product engineer (BE) since 2019  R4D-supported Ph.d. student since 2022

Thoughts:    Don't try this at home.    43

The title has lots of buzzwords, but actually…  LLM-Driven  ≈

Once upon a time, there was an incident  Our data

Finding and ﬁxing the issue took 180+ days    It

LLM + Formal Verification Identify the root cause within 1

Bug detection from 180 days to 1 day  48 Setup

Why formal models (specs)?    No logs or errors occur

Invariant violation identified by TLA+ 50

Reality  Natural-language spec  (ambiguous)  Formal spec  (TLA+)  TLC model checker

Therefore: Coding ﬁrst, then veriﬁcation in CI Flow Natural-language spec

Why not directly use LLM fuzz test?  Answer from ChatGPT

Current progress  For Bugs  A TLA+ spec for the silent

Takeaway 55 🧩 Formal verification isn’t just theory ⚙ LLM

Takeaway(2) 56 AI presentation is awesome

Thank You!