MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

English | 中文

We introduce MA-ProofBench, to the best of our knowledge, the first formal benchmark for evaluating large language models (LLMs) on theorem proving in Mathematical Analysis. It contains 200 rigorously formalized theorem-proving problems in Lean 4 (v4.28.0), split into two difficulty tiers:

Tier	Description	Source	Count
Level I	Undergraduate	Basic Textbook Exercises	100
Level II	Ph.D.	Exam Problems from Top-Tier Universities	100

The problems span 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis, among other categories. MA-ProofBench targets areas that are underrepresented in prior benchmarks and require deep reasoning about continuity, limits, and topological structures. Each problem is built through a human-led, LLM-assisted formalization pipeline with independent expert blind review to ensure mathematical fidelity.

Category Distribution

Problems are classified following the Mathematics Subject Classification (MSC) scheme:

Category	Level I	Level II
Real Functions	44	12
Functional Analysis	15	31
Functions of a Complex Variable	19	16
Measure & Integration	13	17
Operator Theory	4	23
Sequences, Series, Summability	5	1

Evaluation

The evaluation/ package generates proofs via an OpenAI-compatible API and verifies them with Kimina Lean Server.

First, install the evaluation dependencies:

pip install -r requirements.txt

Then set up and start the Lean server (make sure it uses Lean v4.28.0):

cd kimina-lean-server

cp .env.template .env # Optional
bash setup.sh # Installs Lean, repl and mathlib4
pip install -r requirements.txt
pip install .
prisma generate
python -m server

With the Lean server running, launch the evaluation script with an OpenAI-compatible API endpoint:

EXTERNAL_API_URL=https://your-api/v1/chat/completions \
EXTERNAL_API_KEY=sk-xxx \
EVAL_MODEL_PATH=your-model-name \
NUM_SAMPLES=8 \
EVAL_MAX_TOKENS=32768 \
EVAL_TEMPERATURE=1.0 \
bash evaluation/run.sh

To run a stage manually:

# Inference only
EXTERNAL_API_URL=https://your-api/v1/chat/completions \
EXTERNAL_API_KEY=sk-xxx \
python -m evaluation.main --mode infer \
  --model_path "$EVAL_MODEL_PATH" \
  --output_path predictions.jsonl \
  --split level1 level2 --num_samples 8 \
  --eval_max_tokens 32768 --eval_temperature 1.0

# Verification only
python -m evaluation.main --mode eval \
  --predictions_path predictions.jsonl \
  --output_path results.jsonl \
  --split level1 level2 --num_samples 8

Citation

@article{ma-proofbench,
  title={MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis},
  author={Lushi Pu and Weiming Zhang and Xinheng Xie and Zixuan Fu and Bingxiang He and Hongya Lyu and Xin Li and Jie Zhou and Yudong Wang},
  year={2026},
  eprint={2606.13782},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2606.13782}, 
}

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
benchmark		benchmark
evaluation		evaluation
kimina-lean-server @ fb2393d		kimina-lean-server @ fb2393d
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

Category Distribution

Evaluation

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

Category Distribution

Evaluation

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages