Jul 11, 2025
Quick Evaluation of Grok 4

Neha Patel
AI Engineer
Grok 4, released by xAI on July 9 2025, is the latest in their large language model series, and aims to outperform top-tier models such as Google's Gemini 2.5 Pro and OpenAI's GPT-4o. Grok 4 is built with a strong focus on multi-step logic and advanced reasoning, and excels in tasks requiring chain of thought and academic style problem solving. Grok 4 Heavy, the next variant of the chatbot, uses a multi-agent architecture, where each works independently on the same task, and then compares results and comes up with an output. Grok 4 distinguishes itself from other large language models, such as GPT-4o and Claude 3.5, with its PhD-level fluency in areas such as physics, coding, and scientific research.
Background
Grok 4 has shown significant results across a variety of benchmark tests. Grok 4 Heavy achieved a perfect 100% score on the AIME25 (American Invitational Mathematics Examination), just surpassing OpenAI's o3, which scored 98.4%. It also scored 38.6% on Humanity' Last Exam, which is made up of 2,500 PhD-level questions, which demonstrates deep academic reasoning capabilities.
Benchmarks
A recent benchmark conducted by Alex Olteanu at DataCamp showed Grok 4's ability to perform in a variety of tasks, such as maths, coding, and long context multimodal analysis.
In a challenging math puzzle, creating three numbers from digits 0-9 such that x+y=z with each digit used only once, Grok 4 performed well. It wrote Python code to generate all permutations of the digits, and returned the 96 valid solutions. It explored alternate combinations, and searched the web to find out more about the maths puzzle and confirm its answer, taking a total of 157 seconds. This demonstrated Grok 4's strong computational reasoning and problem solving flexibility.
Next, a 167 page PDF was uploaded, and Grok 4 was prompted to analyse the report, identify the 3 most informative graphs, summarise them and return the page number. This was where Grok 4 struggled the most. It gave an incomplete analysis, where the model focused only on the first 50 pages, and misidentified pages and figures.
These tests suggest that Grok 4 is able to handle maths, code and text only problems but falls short on visual understanding at the moment.
Pricing
Grok 4 comes at a higher price point compared to other language models, at $300 per year for the standard model, and the more advanced Grok 4 priced at $3,000 per year. This pricing suggests that xAI is targeting higher-value users, who require top-tier performance. Despite this, Grok 4 consistently outperforms leading competitors on several reasoning benchmark tests, making it a compelling option for developers, researchers, and anyone working on complex, logic heavy tasks.
RECENT BLOGS

Jun 19, 2025
When Machines Guess: What "Pick a Number Between 0 and 50" Reveals About AI

Manish Patel
Chief Confuserer

Jun 10, 2025
Jiva.ai Ignites the Future of Agentic AI at Landmark Launch Event During London Tech Week

Sarah D'Souza
COO

Dec 19, 2024
Multimodal Data Management and AI Engineering with MongoDB

Manish Patel