Continuous evaluation of LLM reasoning on competitive code

TheThinkbench is a benchmarking platform for continuous evaluation of large language models (LLMs) on competitive programming challenges. It focuses on measuring true reasoning, algorithmic thinking, and problem‑solving ability using standardized tasks from Codeforces, a widely used competitive programming site.
The platform targets machine learning researchers, engineers, and practitioners who need transparent, model‑to‑model comparisons for code generation under test constraints. A public dashboard and archives provide navigable results with problem identifiers, difficulty ratings, verdicts, runtime, and success ratios. Last updated: December 21, 2025.
TheThinkbench executes LLMs against a curated set of Codeforces problems spanning entry to advanced difficulty levels. For each model and configuration, the system generates candidate solutions, runs them against tests, and records the outcome. Results include the original problem identifier and title, difficulty rating, and direct links to the Codeforces statements for verification.
Each evaluation captures standardized metrics: a verdict label (e.g., checkAccepted, closeFailed), total time measured in seconds for combined generation and test execution, and a score reported as the ratio of passed cases to the total hidden tests. Results are aggregated by model, with per‑model evaluation counts and optional configuration notes such as temperature and reasoning mode when available.
| Metric | Meaning |
|---|---|
| Problem | Unique Codeforces identifier and title with a link to the challenge |
| Rating | Codeforces difficulty rating (800–3500) when available |
| Verdict | Outcome label; “Accepted” indicates all test cases passed; other labels (e.g., checkAccepted, closeFailed) are reported as shown |
| Time | Total seconds for generation and test execution |
| Score | Ratio of passed test cases to total hidden tests |
| Model Config | Notes such as temperature and reasoning level when provided |
TheThinkbench provides objective, reproducible comparisons for teams selecting or tuning models for code‑related tasks. By aligning evaluations to established competitive programming benchmarks, it exposes strengths and limitations in reasoning and algorithmic performance.
Practical applications include model selection for programming agents, monitoring regressions across model versions, studying trade‑offs between accuracy and runtime, and supporting research on failure modes in algorithmic reasoning. Educators and practitioners can use the archives to illustrate concrete examples of where models succeed or fail under rigorous test conditions.