DeepMind’s new inference-time scaling technique improves planning accuracy in LLMs

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Inference-time scaling is one of the big themes of artificial intelligence in 2025, and AI labs are attacking it from different angles. In its latest research paper, Google DeepMind introduced the concept of “Mind Evolution,” a technique that optimizes responses of large language models (LLMs) for planning and reasoning tasks.

Inference-time scaling techniques try to improve LLMs’ performance by allowing them to “think” more when generating their answers. Practically, this means that instead of generating its answer in one go, a model is allowed to generate several answers, review and correct its answers, and explore different ways to solve the problem.

Evolving LLM responses

Mind Evolution relies on two key components: search and genetic algorithms. Search algorithms are a common component in many inference-time scaling techniques. They allow LLMs to find the best reasoning path for the optimal solution. Genetic algorithms are inspired by natural selection. They create and evolve a population of candidate solutions to optimize a goal, often referred to as the “fitness function.”

Mind Evolve — Mind Evolution algorithm (source: arXiv)

Mind Evolution starts by creating a population of candidate solutions expressed in natural language. The solutions are generated by an LLM that has been given a description of the problem along with useful information and instructions. The LLM then evaluates each candidate and improves it if it does not meet the criteria for the solution.

The algorithm then selects the parents for the next generation of solutions by sampling from the existing population, with higher-quality solutions having a greater chance of being selected. It next creates new solutions through crossover (choosing parent pairs and combining their elements to create a new solution) and mutation (making random changes to newly created solutions). It reuses the evaluation method to refine the new solutions.

The cycle of evaluation, selection and recombination continues until the algorithm reaches the optimal solution or exhausts a preset number of iterations.

image 1e6d6d — Refinement process for proposed solutions in the Mind Evolution algorithm (source: arXiv)

One of the important parts of Mind Evolution is the evaluation function. Evaluators of inference-time scaling techniques often require the problem to be formalized from natural language into a structured, symbolic representation that can be processed by a solver program. Formalizing a problem can require significant domain expertise and a deep understanding of the problem to identify all the key elements that need to be represented symbolically and how they relate to one another, which limits its applicability.

In Mind Evolution, the fitness function is designed to work with natural language planning tasks where solutions are expressed in natural language. This allows the system to avoid formalizing problems, as long as a programmatic solution evaluator is available. It also provides textual feedback in addition to a numerical score, which allows the LLM to understand specific issues and make targeted improvements.

“We focus on evolving solutions in natural language spaces instead of formal spaces. This removes the requirement of task formalization, which requires significant effort and expert knowledge for each task instance,” the researchers write.

Mind Evolution also uses an “island” approach to make sure it explores a diverse set of solutions. At each stage, the algorithm creates separate groups of solutions that evolve within themselves. It then “migrates” optimal solutions from one group to another to combine and create new ones.

Mind Evolution in planning tasks

The researchers tested Mind Evolution against baselines such as 1-pass, where the model generates only one answer; Best-of-N, where the model generates multiple answers and chooses the best one; and Sequential Revisions+, a revision technique where 10 candidate solutions are proposed independently, then revised separately for 80 turns. Sequential Revisions+ is the closest to Mind Evolution, though it does not have the genetic algorithm component to combine the best parts of the discovered solution. For reference, they also include an additional 1-pass baseline that uses OpenAI o1-preview.

image 7abfde — Performance on the Trip Planning benchmark. As the complexity of the task increases, the gap between Mind Evolution and other methods grows (source: arXiv).

The researchers carried out most tests on the fast and affordable Gemini 1.5 Flash. They also explored a two-stage approach, where the Gemini 1.5 Pro model is used when the Flash model can’t address the problem. This two-stage approach provides better cost-efficiency than using the Pro model on every problem instance.

The researchers tested Mind Evolution on several natural-language planning benchmarks for tasks such as trip and meeting planning. Previous research shows that LLMs can’t achieve good performance on these tasks without the aid of formal solvers.

For example, Gemini 1.5 Flash and o1-preview achieve a success rate of only 5.6% and 11.7% on TravelPlanner, a benchmark that simulates organizing a trip plan based on user preferences and constraints expressed in natural language. Even exploiting Best-of-N over 800 independently generated responses, Gemini 1.5 Flash only achieves 55.6% success on TravelPlanner.

image d426a5 — Performance on the TravelPlanner benchmark. As the complexity of the task increases, Mind Evolution remains consistently high-performing while other methods falter (source: arXiv).

In all their tests, Mind Evolution outperformed the baselines by a wide margin, especially as the tasks got more difficult.

For example, Mind Evolution achieves a 95% success rate on TravelPlanner. On the Trip Planning benchmark, which involves creating an itinerary of cities to visit with a number of days in each, Mind Evolution achieved 94.1% on the test instances while other methods reached a maximum of 77% success rate. Interestingly, the gap between Mind Evolution and other techniques increases as the number of cities grows, indicating its ability to handle more complex planning tasks. With the two-stage process, Mind Evolution reached near-perfect success rates on all benchmarks.

Mind Evolution also proved a cost-effective approach for solving natural-language planning problems, using a fraction of the number of tokens used by Sequential-Revision+, the only other technique that comes close to its performance.

“Overall, these results demonstrate a clear advantage of an evolutionary strategy that combines a broad search, through stochastic exploration, with a deep search that leverages an LLM for solution refinement,” the researchers write.

Daily insights on business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

Read our Privacy Policy

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

Source link

DeepMind’s new inference-time scaling technique improves planning accuracy in LLMs

Evolving LLM responses

Mind Evolution in planning tasks

About The Author

Angelica Long

Evolving LLM responses

Mind Evolution in planning tasks

About The Author

Angelica Long

Start typing and press enter to search