Built with anycoder
Real-world tested statistics and performance metrics for leading Large Language Models
Models Tested
Tests Conducted
Avg Response Time
Success Rate
| Model | MMLU | HumanEval | GSM8K | HellaSwag | Latency | Cost/1M tokens |
|---|---|---|---|---|---|---|
| GPT-4 Turbo | 86.4% | 87.1% | 92.0% | 95.3% | 1.8s | $10.00 |
| Claude 3.5 Sonnet | 88.7% | 92.0% | 91.6% | 89.0% | 1.2s | $3.00 |
| Gemini 1.5 Pro | 85.9% | 74.4% | 90.8% | 92.5% | 1.5s | $7.00 |
| Llama 3.1 405B | 85.2% | 89.0% | 95.1% | 89.2% | 2.1s | $5.00 |
| Mistral Large 2 | 84.0% | 76.2% | 87.5% | 88.4% | 0.9s | $2.00 |
| Qwen 2.5 72B |