OptiVerse

A Comprehensive Benchmark towards Optimization Problem Solving

Overview

While Large Language Models (LLMs) demonstrate remarkable reasoning capabilities, complex optimization tasks remain highly challenging. Existing benchmarks focus narrowly on Mathematical Programming and Combinatorial Optimization, hindering comprehensive evaluation.


To address this, we introduce OptiVerse, a comprehensive benchmark of 1,000 carefully curated optimization problems spanning six distinct and often neglected domains. Our extensive experiments with 22 LLMs reveal sharp performance degradation on hard problems, where even advanced models struggle to exceed 27% accuracy.

Benchmark Details

Key Components

Taxonomy of OptiVerse benchmark
Figure 1: The hierarchical taxonomy of OptiVerse benchmark across six optimization domains.
Table 1: Comparison with Existing Benchmarks
Benchmark Size Table Graph Answer Form MP CO SO DO OC GO
ComplexOR37Scalar
NLP4LP269Scalar
MAMO863Scalar
OptiBench605Scalar
OptiVerse (Ours)1000Vector
Domain distribution in OptiVerse
Figure 2: Domain distribution in OptiVerse benchmark.

Dual-View Auditor Agent (DVA-Agent)

Error analysis reveals that Modeling & Logic errors remain the primary bottleneck across all LLMs. To address this, we propose the Dual-View Auditor Agent (DVA-Agent). Unlike simple syntax checkers, DVA-Agent acts as an adversarial evaluator using Semantic Triangulation:


Workflow of Dual-View Auditor Agent
Figure 3: The workflow of the Dual-View Auditor Agent.

Experimental Results

We conducted an extensive empirical study evaluating 22 Large Language Models across varying scales (from 8B parameter models to flagship frontiers). We employ a robust two-stage LLM-as-a-Judge evaluation framework with a strict relative numerical error tolerance of ≤ 0.1%.

Distribution of error types
Figure 4: Distribution of error types across representative models, showing Modeling & Logic Error as the primary bottleneck.
Table 2: Main Model performance comparisons on the OptiVerse benchmark (%)
Large Language Model Domain Difficulty Avg.
MP CO SO DO OC GO Easy Med. Hard
Open-Source Non-Thinking Models
Qwen3-8B-Instruct23.9822.6920.8314.386.4113.7342.6712.257.6720.00
Qwen3-30B-Instruct38.9636.1331.6734.2519.2315.6970.3321.7514.0034.00
Qwen3-235B-Instruct44.4147.0645.8341.1042.3141.1878.3339.7516.6744.40
DeepSeek-V3.2-Chat51.2347.4845.0043.1521.7935.2979.6739.2519.0045.30
Open-Source Thinking Models
Qwen3-30B-Thinking49.5946.6446.6746.5830.7743.1480.6740.0020.3346.30
DeepSeek-V3.2-Reasoner53.6852.5245.0049.3228.2149.0284.3344.5021.3349.50
Qwen3-235B-Thinking53.6851.2649.1752.7443.5949.0287.3347.0021.3351.40
Closed-Source Models
Gemini-2.5-Pro53.6849.1650.0047.9538.4643.1487.0043.7520.0049.60
o352.0453.7845.8356.8539.7445.1086.6747.2520.6751.10
GPT-5.255.8657.1450.0057.5350.0054.9091.0050.7525.3355.20
Gemini-3-Pro58.0457.1456.6756.8542.3150.9889.0052.7527.0055.90

Quick Links & Resources

📄 Paper

Read the full paper on arXiv

🤗 Dataset

Access OptiVerse on Hugging Face

💻 Code

View evaluation code and Prompts on GitHub