English | 中文
We introduce MA-ProofBench, to the best of our knowledge, the first formal benchmark for evaluating large language models (LLMs) on theorem proving in Mathematical Analysis. It contains 200 rigorously formalized theorem-proving problems in Lean 4 (v4.28.0), split into two difficulty tiers:
| Tier | Description | Source | Count |
|---|---|---|---|
| Level I | Undergraduate | Basic Textbook Exercises | 100 |
| Level II | Ph.D. | Exam Problems from Top-Tier Universities | 100 |
The problems span 6 core topics and 27 subcategories, including measure and integration theory, complex analysis, and functional analysis, among other categories. MA-ProofBench targets areas that are underrepresented in prior benchmarks and require deep reasoning about continuity, limits, and topological structures. Each problem is built through a human-led, LLM-assisted formalization pipeline with independent expert blind review to ensure mathematical fidelity.
Problems are classified following the Mathematics Subject Classification (MSC) scheme:
| Category | Level I | Level II |
|---|---|---|
| Real Functions | 44 | 12 |
| Functional Analysis | 15 | 31 |
| Functions of a Complex Variable | 19 | 16 |
| Measure & Integration | 13 | 17 |
| Operator Theory | 4 | 23 |
| Sequences, Series, Summability | 5 | 1 |
The evaluation/ package generates proofs via an OpenAI-compatible API and verifies them with Kimina Lean Server.
First, install the evaluation dependencies:
pip install -r requirements.txtThen set up and start the Lean server (make sure it uses Lean v4.28.0):
cd kimina-lean-server
cp .env.template .env # Optional
bash setup.sh # Installs Lean, repl and mathlib4
pip install -r requirements.txt
pip install .
prisma generate
python -m serverWith the Lean server running, launch the evaluation script with an OpenAI-compatible API endpoint:
EXTERNAL_API_URL=https://your-api/v1/chat/completions \
EXTERNAL_API_KEY=sk-xxx \
EVAL_MODEL_PATH=your-model-name \
NUM_SAMPLES=8 \
EVAL_MAX_TOKENS=32768 \
EVAL_TEMPERATURE=1.0 \
bash evaluation/run.shTo run a stage manually:
# Inference only
EXTERNAL_API_URL=https://your-api/v1/chat/completions \
EXTERNAL_API_KEY=sk-xxx \
python -m evaluation.main --mode infer \
--model_path "$EVAL_MODEL_PATH" \
--output_path predictions.jsonl \
--split level1 level2 --num_samples 8 \
--eval_max_tokens 32768 --eval_temperature 1.0
# Verification only
python -m evaluation.main --mode eval \
--predictions_path predictions.jsonl \
--output_path results.jsonl \
--split level1 level2 --num_samples 8@article{ma-proofbench,
title={MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis},
author={Lushi Pu and Weiming Zhang and Xinheng Xie and Zixuan Fu and Bingxiang He and Hongya Lyu and Xin Li and Jie Zhou and Yudong Wang},
year={2026},
eprint={2606.13782},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.13782},
}This project is released under the MIT License.


