Understanding Formal Reasoning Failures in LLMs as Abstract Interpreters
Published in Workshop on Language Models and Programming Languages at SPLASH (LMPL), 2025
Recommended citation: Mitchell, J., Kim, B. H., Zhou, C., & Wang, C. (2025, October). Understanding Formal Reasoning Failures in LLMs as Abstract Interpreters. In Proceedings of the 1st ACM SIGPLAN International Workshop on Language Models and Programming Languages (pp. 71-83). https://dl.acm.org/doi/10.1145/3759425.3763389
Large language models (LLMs) are increasingly used for program verification, and yet little is known about how they reason about program semantics during this process. In this work, we focus on abstract interpretation based-reasoning for invariant generation and introduce two novel prompting strategies that aim to elicit such reasoning from LLMs. We evaluate these strategies across several state-of-the-art LLMs on 22 programs from the SV-COMP benchmark suite widely used in software verification. We analyze both the soundness of the generated invariants and the key thematic patterns in the models’ reasoning errors. This work aims to highlight new research opportunities at the intersection of LLMs and program verification for applying LLMs to verification tasks and advancing their reasoning capabilities in this application.
Links: proceedings | arXiv
