OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

Overview

While Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, complex optimization tasks remain highly challenging. Existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation.

To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 carefully curated optimization problems spanning six distinct and often neglected domains. Our extensive experiments with 22 LLMs reveal sharp performance degradation on hard problems, where even advanced models struggle to exceed 27% accuracy.

Benchmark Details

Key Components

Dataset Size: 1,000 high-quality, curated optimization problems.
Six Domains: Mathematical Programming (MP), Combinatorial Optimization (CO), Stochastic Optimization (SO), Dynamic Optimization (DO), Optimal Control (OC), Game Optimization (GO).
Three Difficulty Levels: Easy (300), Medium (400), Hard (300).
Cross-Paradigm Versatility: Requires models to autonomously identify the paradigm and write executable code using various solvers (gurobi, scipy, cvxpy, pulp, etc.).

Taxonomy of OptiVerse benchmark — Figure 1: The hierarchical taxonomy of OptiVerse benchmark across six optimization domains.

Table 1: Comparison with Existing Benchmarks
Benchmark	Size	Table	Graph	Answer Form	MP	CO	SO	DO	OC	GO
ComplexOR	37	❌	❌	Scalar	✅	✅	❌	❌	❌	❌
NLP4LP	269	❌	❌	Scalar	✅	✅	❌	❌	❌	❌
MAMO	863	✅	✅	Scalar	✅	✅	❌	❌	❌	❌
OptiBench	605	✅	❌	Scalar	✅	✅	❌	❌	❌	❌
OptiVerse (Ours)	1000	✅	✅	Vector	✅	✅	✅	✅	✅	✅

Figure 2: Domain distribution in OptiVerse benchmark.

Dual-View Auditor Agent (DVA-Agent)

Error analysis reveals that Modeling & Logic errors remain the primary bottleneck across all LLMs. To address this, we propose the Dual-View Auditor Agent (DVA-Agent). Unlike simple syntax checkers, DVA-Agent acts as an adversarial evaluator using Semantic Triangulation:

Requirement Extraction (Text-to-Math): Extracts missing constraints directly from the problem text.
Blind Code Abstraction (Code-to-Math): Reverse-engineers the mathematical logic purely from the generated code, ignoring the original prompt.
Cross-Reference Analysis: Compares the above to produce a discrepancy set. Modification is triggered only if semantic gaps exist.

Workflow of Dual-View Auditor Agent — Figure 3: The workflow of the Dual-View Auditor Agent.

Experimental Results

We conducted an extensive empirical study evaluating 22 Large Language Models across varying scales (from 8B parameter models to flagship frontiers). We employ a robust two-stage LLM-as-a-Judge evaluation framework with a strict relative numerical error tolerance of ≤ 0.1%.

Figure 4: Distribution of error types across representative models, showing Modeling & Logic Error as the primary bottleneck.

Table 2: Main Model performance comparisons on the OptiVerse benchmark (%)
Large Language Model	Domain						Difficulty			Avg.
Large Language Model	MP	CO	SO	DO	OC	GO	Easy	Med.	Hard	Avg.
Open-Source Non-Thinking Models
Qwen3-8B-Instruct	23.98	22.69	20.83	14.38	6.41	13.73	42.67	12.25	7.67	20.00
Qwen3-30B-Instruct	38.96	36.13	31.67	34.25	19.23	15.69	70.33	21.75	14.00	34.00
Qwen3-235B-Instruct	44.41	47.06	45.83	41.10	42.31	41.18	78.33	39.75	16.67	44.40
DeepSeek-V3.2-Chat	51.23	47.48	45.00	43.15	21.79	35.29	79.67	39.25	19.00	45.30
Open-Source Thinking Models
Qwen3-30B-Thinking	49.59	46.64	46.67	46.58	30.77	43.14	80.67	40.00	20.33	46.30
DeepSeek-V3.2-Reasoner	53.68	52.52	45.00	49.32	28.21	49.02	84.33	44.50	21.33	49.50
Qwen3-235B-Thinking	53.68	51.26	49.17	52.74	43.59	49.02	87.33	47.00	21.33	51.40
Closed-Source Models
Gemini-2.5-Pro	53.68	49.16	50.00	47.95	38.46	43.14	87.00	43.75	20.00	49.60
o3	52.04	53.78	45.83	56.85	39.74	45.10	86.67	47.25	20.67	51.10
GPT-5.2	55.86	57.14	50.00	57.53	50.00	54.90	91.00	50.75	25.33	55.20
Gemini-3-Pro	58.04	57.14	56.67	56.85	42.31	50.98	89.00	52.75	27.00	55.90

OptiVerse

Overview

Benchmark Details

Key Components

Dual-View Auditor Agent (DVA-Agent)

Experimental Results

Quick Links & Resources

📄 Paper

🤗 Dataset

💻 Code