Large reasoning models research at COLM 2025

State of research in scaling reasoning, the current paradigm for improving LLMs

Sep 13, 2025

Conference on Language Models (COLM) will take place in Montreal from October 7-10, 2025. There are around 70, out of over 400, research papers related to “reasoning”1. Find all the accepted papers with a reasoning tag, abstracts, authors, PDF links, and review scores in this Google Sheet.

Peer-reviewed papers from an upcoming top LLM conference must give a broad sense of the research area. As Large Reasoning Models (LRMs)2 are undeniably the cutting edge of language models, this first post on COLM 2025 papers introduces the latest research on LRMs, a subset of 70 reasoning papers, covering research:

Improving reasoning in LRMs,
Applications of LRMs in different domains & tasks,
Multi-modal reasoning enabled by LRMs,
Many limitations of LRMs, and
Understanding the reasoning in LRMs.

I aim to provide clear segues with friendly context to different research papers, particularly, how they tie into the narrative of “scaling reasoning is the paradigm that pushed the frontier this time around”.

0. LRMs represent the cutting edge of Language Models

Large Language Models (LLMs) have been improving rapidly since 2020. The initial progress came from the paradigm of scaling pretraining, more data & model size, giving LLMs robust natural language understanding and instruction following, with models like GPT-4. While these models had never-seen-before reasoning capabilities for an AI, they were not human-level in many reasoning and planning tasks. Scaling pretraining didn’t help with this, as the gains have plateaued.

Scaling reasoning/thinking tokens with inference-time compute has emerged as the new paradigm that improved LLMs, evident with pioneering Large Reasoning Models (LRMs) OpenAI O1 and DeepSeek R1. This improved the capability to perform complex multi-step reasoning tasks and extended tool-use, enabling several applications of LLM Agents, including for software engineering and computer use. We still didn’t see a plateau in scaling reasoning, but we could be racing towards it!

Unlike speculation, research on LRMs at COLM 2025 will shed some light on where we stand. In summary: Several methods have been proposed to improve reasoning and inference-time compute, suggesting that scaling reasoning might still have enough headroom. Reasoning models have been trained for several tasks in multiple domains, justifying being an effective technique. Also, reasoning was key to enabling a few multi-modal reasoning tasks, suggesting the versatility. Notwithstanding spectacular achievements like earning a gold medal at IMO 2025, LRMs have many limitations, suggesting that scaling reasoning might not be a panacea. Fortunately, there are several investigations into scientifically understanding LRMs, which should be a guiding light to understand limitations, improve, or look for the next paradigm.

1. Improving the reasoning of LRMs

Striking and first evidence with OpenAI O1 that scaling reasoning will be the next frontier is self-verification and backtracking in language models’ reasoning chains. Hence, Large Reasoning Models. Also, the aha-moment from DeepSeek R1, which showed how to achieve this long reasoning that others are still following. There are avenues to improve reasoning still.

Algorithmic synthetic data for long reasoning

Spontaneous self-correction method distills self-verification capability from multi-turn, multi-agent collaboration into a single turn generation. After distillation with Supervised Fine-Tuning (SFT), online RL is used to improve self-verification.

Another work generates diverse reasoning traces on graph problems that underscore broad reasoning abilities required for math and scientific reasoning. Training on these reasoning traces also showed ~20% improvement in non-mathematical reasoning tasks like logical and commonsense reasoning.

Similarly, Step-Wise RL (SWiRL) applies synthetic data filtering and RL optimisation to improve math reasoning and question answering with retrieval.

Verifiers and multi-agent systems for better reasoning

R1-type of training incentivises long reasoning with rewards and RL (GRPO). Another way to obtain self-verification capability is to have multi-agent systems with generators & verifiers. This is long reasoning in effect.

Putting the value back in RL. This augments RL methods without value functions, like GRPO, with generative verifiers to enable 8-32x efficient test-time compute scaling.

Another work just does away with RL and gets comparable performance with just pair-wise preference optimisation training, both generator and verifier - Iterative DPO

MeMAD enables self-verification with a separate reflection module and memory bank during the debate of multiple agents.

Novel approaches for reasoning

Most LRMs reason by generating natural language tokens. We don’t know whether this is an effective and comprehensive way of thinking in language models. COCONUT proposes a new way of reasoning in latent space: generating a sequence of continuous vectors, but not generating or training to generate natural language tokens. The authors showed that continuous thought is better than NL thought when substantial search is needed in a reasoning task, like finding a valid path in a big graph.

While reasoning is generally serialized, Adaptive Parallel Reasoning (APR) tries to spawn multiple reasoning chains and learns to coordinate among them with a new end-to-end RL strategy.

Apparently, hidden states of reasoning tokens in LRMs have information about answer correctness and different phases like execution, reflection, and transition thoughts. SEAL reduces excessive reflection and transition thoughts with reasoning steering vectors, and the other work early exits reasoning by probing correctness from hidden states.

2. Applications of LRMs in different domains & tasks

DeepSeek R1-style of training to incentivise reasoning has become the norm to improve complex reasoning. Long thoughts, more reasoning steps, and extended tool calls underlie improved performance in all the following tasks.

LRMs for Medical knowledge reasoning. FineMedLM-o1 is trained on medical long-form reasoning data to enable advanced dialogue and deep reasoning capabilities supporting differential diagnosis and medication recommendations.

LRMs for Law. Towards reasoning-aware legal AI systems, LawFlow collects lawyers’ thought processes and compares them to LRM’s thought processes.

LRMs for logical reasoning. Performing algorithmic reasoning for the 3-SAT problem, the authors found that “Unlike other LLMs (GPT-4o & DeepSeek V3), R1 shows signs of having learned the underlying reasoning“. LRMs internalized search process required for algorithmic reasoning!

LRMs for story generation. Trained with RL with verifiable rewards, similar to R1, to reason over a story and plan for the next chapter, enabling the generation of high-quality stories spanning thousands of tokens.

LRMs for search. Search-R1. “LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval.”

LRMs for Text-to-SQL. Reasoning-SQL. Improving reasoning in Text-to-SQL with rewards tailored to SQL generation, such as syntax checks, schema-linking, and n-gram similarity. RL-trained 14B model outperforms O3-mini by 4%.

LRMs for summarisation. This work proposes a long-COT dataset with reflective reasoning, SumFeed-CoT. Trained on this, ReFeed enhances summarization refinement in multiple dimensions through reflective reasoning on feedback.

3. Multi-modal reasoning enabled by LRMs

Long thoughts are a characteristic of system-2 reasoning, and it is shown to improve visual reasoning as well, which is thought to be system-1 heavy. (pun intended :))

LongPerceptualThoughts are long-thought traces for perceptual tasks, made verifiable with their three-state framework, starting from multiple-choice questions to CoTs. Training on these thoughts achieves +3.4 points over 5 vision benchmarks.

BigCharts-R1 is the SOTA chart reasoning model trained in R1-style. The authors tackle this hard-to-verify task by rendering diverse chart images with real-world charts and data. This, by definition, has ground truths, thus providing rewards.

4. Many limitations of LRMs

With all the improvements in reasoning and new tasks that it enables, it is easier to intuit that long reasoning is enough for any task. At least not yet. LRMs fall short in several abilities as a language model in general and sometimes as a reasoning model specifically.

Reasoning is not all you need?! Corrupted by Reasoning finds that LRMs struggle significantly with cooperation compared to traditional LLMs. In Language model as a judge settings, although LRMs are better at fact-checking, LRMs show a “superficial reflection bias“ where phrases mimicking reasoning (e.g., “wait, let me think…“) significantly influence model judgements.

Limited in uncertainty awareness. Despite having reasoning tokens, language models struggle with uncertainty-awareness. This leads to unfaithful reasoning traces and divergent conclusions given logically inconsistent knowledge, only to be disambiguated by uncertainty. Two similar works use formal reasoning topology and weights of logical rules for uncertainty measurement and find that SOTA LLMs/LRMs are limited.

Limited in mathematical reasoning. Really?! Despite unparalleled focus on mathematical reasoning in scaling reasoning, Brains vs. Bytes found that O3-mini drops from 48.3% to 14.3% if evaluated on complete proofs, not just the final answer. Through manual expert analysis on 455 IMO problems, they found that accuracy dropped for models including DeepSeek, Gemini, and OpenAI.

Limited in code reasoning. O3-mini(high) couldn’t identify counterexamples for over 90% of incorrect solutions where expert humans can. In long-horizon SWE agent tasks, LLMs are not robust to corrupted functions and fail at system-level reasoning.

Language models are categorically worse than humans in visual cognition, social reasoning, and spatial reasoning. Also, LLMs are limited in linguistic reasoning in low-resource languages.

5. Understanding the reasoning in LRMs

Soon after the success of R1, several related RL training experiments were spawned to incentivise long reasoning, in pursuit of understanding its essence. Several works found that long reasoning is only possible if base models themselves have properties like verification, subgoal setting, and backtracking, like Qwen 2.5 models as opposed to Llama 3.2 models.

SimpleRL-Zoo finds that by adjusting format reward and controlling query difficulty, most model families show long reasoning performance boosts, albeit with distinct patterns during training.

Style over substance also found that small models learn reasoning style from distillation rather than substance, and thus achieve similar performance during training with correct or incorrect answers. A framework to investigate factors that enable long reasoning, also found that priming with few-shot examples with reasoning patterns also elicited long reasoning in most models, even when long reasoning patterns led to incorrect answers.

As long into the past as the content vs form debate goes, so does it into many domains! This might give a plausible hypothesis for other limitations of LLMs at COLM 2025: LRMs go on to overthink even with a missing premise. Maybe they are just following the reasoning style. LRMs couldn’t identify inconsistencies in FlawedFictions as well, maybe because the high-level linguistic structure still makes sense.

With all the findings and conclusions from scientific investigations in this post, it is imperative to give a shout-out to all the reproducible efforts in any paper. Related work, Sober look at progress, proposed an evaluation framework to understand the sensitivity of findings to subtle implementation choices, including random seeds, prompt formatting, and decoding parameters. It found that SFT shows consistently stronger generalisation than RL, contrary to the popular belief.

§. Looking forward

Autonomous scientific discovery to win a Nobel prize would be a great goal for superhuman AI. Can LRMs alone take us there? (including multi-agent systems powered by them) Or do we need a new paradigm or two that push the frontier?

AI PaperTrails

Discussion about this post

Ready for more?