Hands on with Gemini 2.5 Pro: why it might be the most useful reasoning model yet

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Unfortunately for Google, the release of its latest flagship language model, Gemini 2.5 Pro, got buried under the Studio Ghibli AI image storm that sucked the air out of the AI space. And perhaps fearful of its previous failed launches, Google cautiously presented it as “Our most intelligent AI model” instead of the approach of other AI labs, which introduce their new models as the best in the world.

However, practical experiments with real-world examples show that Gemini 2.5 Pro is really impressive and might currently be the best reasoning model. This opens the way for many new applications and possibly puts Google at the forefront of the generative AI race.

Polymarket AI race — *Source: Polymarket*

Long context with good coding capabilities

The outstanding feature of Gemini 2.5 Pro is its very long context window and output length. The model can process up to 1 million tokens (with 2 million coming soon), making it possible to fit multiple long documents and entire code repositories into the prompt when necessary. The model also has an output limit of 64,000 tokens instead of around 8,000 for other Gemini models.

The long context window also allows for extended conversations, as each interaction with a reasoning model can generate tens of thousands of tokens, especially if it involves code, images and video (I’ve run into this issue with Claude 3.7 Sonnet, which has a 200,000-token context window).

For example, software engineer Simon Willison used Gemini 2.5 Pro to create a new feature for his website. Willison said in a blog, “It crunched through my entire codebase and figured out all of the places I needed to change—18 files in total, as you can see in the resulting PR. The whole project took about 45 minutes from start to finish—averaging less than three minutes per file I had to modify. I’ve thrown a whole bunch of other coding challenges at it, and the bottleneck on evaluating them has become my own mental capacity to review the resulting code!”

Impressive multimodal reasoning

Gemini 2.5 Pro also has impressive reasoning abilities over unstructured text, images and video. For example, I provided it with the text of my recent article about sampling-based search and prompted it to create an SVG graphic that depicts the algorithm described in the text. Gemini 2.5 Pro correctly extracted key information from the article and created a flowchart for the sampling and search process, even getting the conditional steps correctly. (For reference, the same task took multiple interactions with Claude 3.7 Sonnet and I eventually maxed out the token limit.)

The rendered image had some visual errors (arrowheads are misplaced). It could use a facelift, so I next tested Gemini 2.5 Pro with a multi-modal prompt, giving it a screenshot of the rendered SVG file along with the code and prompting it to improve it. The results were impressive. It corrected the arrowheads and improved the visual quality of the diagram.

Other users have had similar experiences with multimodal prompts. For example, in their tests, DataCamp replicated the runner game example presented in the Google Blog, then provided the code and a video recording of the game to Gemini 2.5 Pro and prompted it to make some changes to the game’s code. The model could reason over the visuals, find the part of the code that needed to be changed, and make the correct modifications.

It is worth noting, however, that like other generative models, Gemini 2.5 Pro is prone to making mistakes such as modifying unrelated files and code segments. The more precise your instructions are, the lower the risk of the model making incorrect changes.

Data analysis with useful reasoning trace

Finally, I tested Gemini 2.5 Pro on my classic messy data analysis test for reasoning models. I provided it with a file containing a mix of plain text and raw HTML data I had copied and pasted from different stock history pages in Yahoo! Finance. Then I prompted it to calculate the value of a portfolio that would invest $140 at the beginning of each month, spread evenly across the Magnificent 7 stocks, from January 2024 to the latest date in the file.

The model correctly identified which stocks it had to pick from the file (Amazon, Apple, Nvidia, Microsoft, Tesla, Alphabet and Meta), extracted the financial information from the HTML data, and calculated the value of each investment based on the price of the stocks at the beginning of each month. It responded to a well-formatted table with stock and portfolio value at each month and provided a breakdown of how much the entire investment was worth at the end of the period.

More importantly, I found the reasoning trace to be very useful. It is not clear whether Google reveals the raw chain-of-thought (CoT) tokens for Gemini 2.5 Pro, but the reasoning trace is very detailed. You can clearly see how the model is reasoning over the data, extracting different bits of information, and calculating the results before generating the answer. This can help troubleshoot the model’s behavior and steer it in the right direction when it makes mistakes.

Enterprise-grade reasoning?

One concern about Gemini 2.5 Pro is that it is only available in reasoning mode, which means the model always goes through the “thinking” process even for very simple prompts that can be answered directly.

Gemini 2.5 Pro is currently in preview release. Once the full model is released and pricing information is available, we will have a better understanding of how much it will cost to build enterprise applications over the model. However, as inference costs continue to fall, we can expect it to become practical at scale.

Gemini 2.5 Pro might not have had the splashiest debut, but its capabilities demand attention. Its massive context window, impressive multimodal reasoning and detailed reasoning chain offer tangible advantages for complex enterprise workloads, from codebase refactoring to nuanced data analysis.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link

Long context with good coding capabilities

Impressive multimodal reasoning

Data analysis with useful reasoning trace

Enterprise-grade reasoning?

Start typing and press enter to search