baby steps, baby steps

Telling AI model to “take a deep breath” causes math scores to soar in study

DeepMind used AI models to optimize their own prompts, with surprising results.

Benj Edwards – Sep 19, 2023 5:38 pm | 93

Credit: Getty Images

Google DeepMind researchers recently developed a technique to improve math ability in AI language models like ChatGPT by using other AI models to improve prompting—the written instructions that tell the AI model what to do. It found that using human-style encouragement improved math skills dramatically, in line with earlier results.

In a paper called "Large Language Models as Optimizers" listed this month on arXiv, DeepMind scientists introduced Optimization by PROmpting (OPRO), a method to improve the performance of large language models (LLMs) such as OpenAI’s ChatGPT and Google’s PaLM 2. This new approach sidesteps the limitations of traditional math-based optimizers by using natural language to guide LLMs in problem-solving. "Natural language" is a fancy way of saying everyday human speech.

"Instead of formally defining the optimization problem and deriving the update step with a programmed solver," the researchers write, "we describe the optimization problem in natural language, then instruct the LLM to iteratively generate new solutions based on the problem description and the previously found solutions."

Typically, in machine learning, techniques using algorithms such as derivative-based optimizers act as a guide for improving an AI model's performance. Imagine a model's performance as a curve on a graph: The goal is to find the lowest point on this curve because that's where the model makes the fewest mistakes. By using the slope of the curve to make adjustments, the optimizer helps the model get closer and closer to that ideal low point, making it more accurate and efficient at whatever task it's designed to do.

Rather than relying on formal mathematical definitions to perform this task, OPRO uses "meta-prompts" described in natural language to set the stage for the optimization process. The LLM then generates candidate solutions based on the problem’s description and previous solutions, and it tests them by assigning each a quality score.

In OPRO, two large language models play different roles: a scorer LLM evaluates the objective function such as accuracy, while an optimizer LLM generates new solutions based on past results and a natural language description. Different pairings of scorer and optimizer LLMs are evaluated, including models like PaLM 2 and GPT variants. OPRO can optimize prompts for the scorer LLM by having the optimizer iteratively generate higher-scoring prompts. These scores help the system identify the best solutions, which are then added back into the 'meta-prompt' for the next round of optimization.

“Take a deep breath and work on this step by step”

Perhaps the most intriguing part of the DeepMind study is the impact of specific phrases on the output. Phrases like "let's think step by step" prompted each AI model to produce more accurate results when tested against math problem data sets. (This technique became widely known in May 2022 thanks to a now-famous paper titled "Large Language Models are Zero-Shot Reasoners.")

Consider a simple word problem, such as, "Beth bakes four two-dozen batches of cookies in a week. If these cookies are shared among 16 people equally, how many cookies does each person consume?" The 2022 paper discovered that instead of just feeding a chatbot a word problem like this by itself, you'd instead prefix it with "Let's think step by step" and then paste in the problem. The accuracy of the AI model's results almost always improves, and it works well with ChatGPT.

Interestingly, in this latest study, DeepMind researchers found "Take a deep breath and work on this problem step by step" to be the most effective prompt when used with Google's PaLM 2 language model. The phrase achieved the top accuracy score of 80.2 percent in tests against GSM8K, which is a data set of grade-school math word problems. By comparison, PaLM 2, without any special prompting, scored only 34 percent accuracy on GSM8K, and the classic "Let’s think step by step" prompt scored 71.8 percent accuracy.

So why does this work? Obviously, large language models can't take a deep breath because they don't have lungs or bodies. They don't think and reason like humans, either. What "reasoning" they do (and "reasoning" is a contentious term among some, though it is readily used as a term of art in AI) is borrowed from a massive data set of language phrases scraped from books and the web. That includes things like Q&A forums, which include many examples of "let's take a deep breath" or "think step by step" before showing more carefully reasoned solutions. Those phrases may help the LLM tap into better answers or produce better examples of reasoning or problem-solving from the data set it absorbed into its neural network during training.

Even though working out the best ways to give LLMs human-like encouragement is slightly puzzling to us, that's not a problem for OPRO because the technique utilizes large language models to discover these more effective prompting phrases. DeepMind researchers think that the biggest win for OPRO is its ability to sift through many possible prompts to find the one that gives the best results for a specific problem. This could allow people to produce far more useful or accurate results from LLMs in the future.

Listing image: Getty Images

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

93 Comments