Grok 4, released by xAI on July 9 2025, is the latest in their large language model series, and aims to outperform top-tier models such as Google's Gemini 2.5 Pro and OpenAI's GPT-4o. Grok 4 is built with a strong focus on multi-step logic and advanced reasoning, and excels in tasks requiring chain of thought and academic style problem solving. Grok 4 Heavy, the next variant of the chatbot, uses a multi-agent architecture, where each works independently on the same task, and then compares results and comes up with an output. Grok 4 distinguishes itself from other large language models, such as GPT-4o and Claude 3.5, with its PhD-level fluency in areas such as physics, coding, and scientific research.

Background

Grok 4 has shown significant results across a variety of benchmark tests. Grok 4 Heavy achieved a perfect 100% score on the AIME25 (American Invitational Mathematics Examination), just surpassing OpenAI's o3, which scored 98.4%. It also scored 38.6% on Humanity' Last Exam, which is made up of 2,500 PhD-level questions, which demonstrates deep academic reasoning capabilities.

Benchmarks

A recent benchmark conducted by Alex Olteanu at DataCamp showed Grok 4's ability to perform in a variety of tasks, such as maths, coding, and long context multimodal analysis.

In a challenging math puzzle, creating three numbers from digits 0-9 such that x+y=z with each digit used only once, Grok 4 performed well. It wrote Python code to generate all permutations of the digits, and returned the 96 valid solutions. It explored alternate combinations, and searched the web to find out more about the maths puzzle and confirm its answer, taking a total of 157 seconds. This demonstrated Grok 4's strong computational reasoning and problem solving flexibility.

Next, a 167 page PDF was uploaded, and Grok 4 was prompted to analyse the report, identify the 3 most informative graphs, summarise them and return the page number. This was where Grok 4 struggled the most. It gave an incomplete analysis, where the model focused only on the first 50 pages, and misidentified pages and figures.

These tests suggest that Grok 4 is able to handle maths, code and text only problems but falls short on visual understanding at the moment.

Pricing

Grok 4 comes at a higher price point compared to other language models, at $300 per year for the standard model, and the more advanced Grok 4 priced at $3,000 per year. This pricing suggests that xAI is targeting higher-value users, who require top-tier performance. Despite this, Grok 4 consistently outperforms leading competitors on several reasoning benchmark tests, making it a compelling option for developers, researchers, and anyone working on complex, logic heavy tasks.

Quick Evaluation of Grok 4

Background

Benchmarks

Pricing

Your AI Building Assistant