Sahara Benchmark Leaderboards

model	Sahara Score	Text Classification Tasks	Text Generation Tasks	MCCR Tasks	Tokens Level Tasks
🏆 gemini-3-pro-preview	55.67	52.67	18.24	78.58	73.18
🥈 gemini-2.5-pro	51.85	48.18	15.66	77.56	65.99
🥉 gpt-5-2025-08-07	50.03	48.19	11.10	75.77	65.07
gpt-5.1-2025-11-13	48.33	47.77	10.32	73.64	61.59
claude-4-sonnet-20250514	40.82	47.28	10.59	60.53	44.86
gemini-2.5-flash	40.61	43.95	11.18	72.41	34.91
claude-sonnet-4-5@20250929	37.11	45.15	12.04	52.41	38.85
gpt-4.1	36.04	48.07	11.06	50.98	34.05
CohereForAI/c4ai-command-a-03-2025	29.93	38.64	10.36	45.55	25.16
meta-llama/Llama-4-Scout-17B-16E-Instruct	29.19	39.02	11.45	47.36	18.94
google/gemma-3-27b-it	28.07	44.44	8.19	43.20	16.45
meta-llama/Llama-3.3-70B-Instruct	27.05	37.41	9.80	44.77	16.24
meta-llama/Llama-3.1-70B-Instruct	26.57	35.96	11.15	43.66	15.51
Qwen/Qwen3-30B-Instruct-2507	24.73	30.01	9.79	42.28	16.85
deepseek-ai/DeepSeek-R1-Distill-Llama-70B	24.42	34.96	10.91	39.50	12.31
deepseek-ai/DeepSeek-R1-Distill-Qwen-32B	24.08	29.27	9.61	39.70	17.74
google/gemma-2-27b-it	23.80	24.53	9.01	40.93	20.72
CohereForAI/c4ai-command-r-plus-08-2024	23.21	33.72	10.85	32.45	15.82
google/gemma-2-9b-it	22.80	28.10	8.54	37.88	16.68
google/gemma-3-12b-it	22.25	33.55	8.36	35.95	11.16
Qwen/Qwen3-4B-Instruct-2507	21.49	28.88	9.93	33.74	13.41
Tower-Babel/Babel-83B-Chat	21.25	22.89	7.52	37.27	17.31
meta-llama/Llama-3.1-8B-Instruct	21.08	31.41	9.92	33.88	9.11
Tower-Babel/Babel-9B-Chat	19.61	24.63	9.35	33.73	10.73
CohereForAI/c4ai-command-r7b-12-2024	18.06	23.53	5.84	28.34	14.53
CohereForAI/aya-23-35B	17.75	22.36	7.06	29.56	12.00
CohereForAI/aya-23-8B	17.30	21.94	6.55	26.86	13.86
google/gemma-2-2b-it	17.16	17.69	6.73	30.30	13.92
google/gemma-3-4b-it	17.10	14.65	6.98	32.84	13.95
microsoft/Phi-4-mini-instruct	16.78	16.50	5.10	33.73	11.78
meta-llama/Llama-3.2-3B-Instruct	16.49	18.08	6.56	29.33	12.00
meta-llama/Llama-3.2-1B-Instruct	14.64	16.34	6.39	27.96	7.87
microsoft/Phi-3.5-mini-instruct	14.19	17.96	6.03	25.23	7.52

Main Leaderboard

Citation