The frontier of LLMs @ NeurIPS '25
5 research themes and 70+ papers to understand the jagged line of what's now and what's next
We are increasingly talking about AGI/ASI1 as LLM systems achieve breakthroughs in mathematics, coding, and even scientific discovery. However, the techniques underlying these breakthroughs, the current state of LLM capabilities, and the diverse research directions that push the frontier remain relatively inaccessible.
In this blog post, I try to surface interesting research to shed light on the frontier of LLMs. I do that with the latest peer-reviewed research published at NeurIPS ‘25. Around ~6000 papers are published at this conference on virtually every topic in ML. You can use Paper Finder to find specific NeurIPS papers by topic and poster session, along with their posters/slides.
These are the topics I explore:
Scaling reasoning and RL environments
Memory & long-context LLMs
Self-improving LLM systems
Computer use & web Agents
AI-driven scientific discovery
1. Scaling reasoning and RL environments
While the recipe of DeepSeek-R1 is prominent for eliciting reasoning, the most cited paper at the conference observed that RL doesn’t really incentivize reasoning capacity beyond the base model. It turns out the performance gains from RL could be recovered if one samples ~1000 responses from the base model. The authors, in turn, call for scaling RL data & compute, and for agents to use tools & external info, among other directions that improve reasoning with RL.
It just so happens that other papers at the conference showed improvement exactly in these directions. Reasoning Gym created 100+ RL reasoning environments with verifiable rewards to do RL for reasoning. This paper created a cross-domain RL-for-reasoning dataset and showed the efficacy of a mixed-domain RL training approach.
There were other dimensions, RL for reasoning is improved: Open-Reasoner-Zero, with an alternative to GRPO, showed improvements compared to DeepSeek-R1-Zero. ProRL proposed a variant of GRPO with KL divergence control. Thinking vs Doing improves agent reasoning by scaling environment interaction steps. General-Reasoner elicited reasoning in non-verifiable domains with an LLM-based verifier. This and this work investigated learning token-efficient reasoning, while, amusingly, this paper trained models for reasoning with just one training example.
Similar recipe of eliciting reasoning is adapted for many more tasks, including long-horizon reasoning in generative reward models (Think-RM), tool-integrated reasoning, reasoning on long video, omnimodal reasoning (Omni-R1), vision-language model reasoning (VL-Rethinker), reasoning for real-world software engineering (SWE-RL), visual perception (Perception-R1), SQL (SQL-R1), and embodied reasoning (Robot-R1).
2. Self-improving LLMs
A lot of research showed self-improvement on several tasks. The recipe usually goes like this:
1. LLM generates a full or partial response
2. LLM self-corrects or self-verifies itself (also could be rewards or preferences)
3. Train the LLM on positive data or preference learning on positive and negative data.
At NeurIPS, there are many papers showing self-improvement. Self-improving embodied foundations models by learning with steps-to-go prediction objectives (self-verify). Guided-ReST self-improves with guided, reinforced self-training. (training on positive predicted examples). Self-adapting LLMs create data and train themselves based on new input. Self-challenging LLM agents that generate Code-as-Task data and a verification function, and train themselves on correct samples. SPC fine-tunes a sneaky generator and critic that enables improvement with self-play. Sherlock self-corrects its visual reasoning and improves. MM-UPT defines a continual self-improvement loop with a self-rewarding mechanism based on majority voting. ExPO unlocks hard reasoning with self-explanations. SPRO uses self-play for improving image generation. SwS is even self-aware of its weaknesses and synthesizes problems and trains itself. on and on and on.
If self-improvement is possible, where is the singularity? Like the most-rated paper at NeurIPS and other related work, there is evidence that RL training or self-improvement only brings LLM performance at k=n to k=1, i.e, the better performance with n responses of LLM to the performance of one response of an LLM. Intuitively, in the base model, the better performance with finite n responses is not infinite! So there is no singularity yet. However, better self-improvement methods might leap that barrier, only if not for the problem below.
Feedback Friction paper shows that LLMs consistently show resistance to feedback, even with correct feedback. This is a clear limitation that stunts self-improvement.
3. Memory & long-context LLMs
The positioning for the recently released GPT-5.1-Codex-Max is that it can work on a task for more than 24 hours continuously over millions of tokens, which is more than the context length of the model. This is possible because it “automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.“
Scaling the effective context length and designing “memory“ architecture is an important line of research. An investigation at NeurIPS shows that long-context model training improves reasoning for even tasks with short input lengths, showing the importance of long-context capabilities. Several works argue that, despite the perceived importance, the evaluation of long-context abilities is not objective and comprehensive. MemSim uses a Bayesian Relation Network (BRNet) to automatically create a dataset to evaluate LLM hallucinations and the capability of memorizing information from user messages. LongBioBench with artificial biography aims to comprehensively evaluate the long-context capabilities of text models, while MMLongBench does that for vision-language models.
There are many types of memory architectures:
1. Templatic compression of long context. Agentic plan caching reuses structured plan templates from planning states of agent applications to enable the memory feature. AdmTree compresses lengthy context into hierarchical memory while summarizing and storing it as leaves in a semantic binary tree.
2. Indexing the knowledge snippets that LLM could query. A-Mem creates queryable interconnected knowledge networks with the Zettelkasten method. G-memory organizes memories with a three-tier graph hierarchy and performs bi-directional memory traversal to retrieve different levels. This is specifically to encode prior collaboration experiences of multi-agent systems.
3. Memory retrieval based on model activations. PaceLLM, inspired by mechanisms in the prefrontal cortex, designs Persistent Activity (PA) to retrieve previous states and Cortical Expert (CE) to reorganize previous states into semantic modules. This work uses a Vision-Language Model (VLM) itself to encode and retrieve memories for a VLM, while Memory Decoder uses a plug-and-play trained memory decoder module. Similarly, 3DLLM-Mem retrieves past interactions based on current observations.
There is also research that adaptively attends to different heads in self-attention to simulate learned memory retrieval and forgetting. Coincidentally, at NeurIPS, both MoBA and SeerAttention propose this type of gating.
4. Computer use/GUI and web agents
There is a lot of great research on this topic at NeurIPS: building datasets and benchmarks for computer use agents, methods to train and improve computer use and web agents, and investigations into safety risks with these agents.
Datasets and Benchmarks. To evaluate a computer use agent comprehensively, the dataset should contain demonstrations spanning several operating systems, applications, and websites. OpenCUA presents data with 3 operating systems and 200+ applications and websites, along with an annotation infrastructure. OSWorld-G creates a training dataset specific to different interface elements that enables compositional generalization to novel interfaces. macOSWorld creates a multi-lingual benchmark with 5 languages. REAL creates a dataset with real websites and programmatic checks as deterministic success criteria. TheAgentCompany simulates real-world tasks of a digital worker, including communicating with other coworkers.
Building computer use/GUI agents. GUI-G1 sets up DeepSeek-R1-Zero type of training, where models surpass all prior models of similar size. R1-style training everywhere! GUI Exploration Lab improves agents with multi-turn reinforcement learning as opposed to single-turn. UI-Genie takes this a step forward and defines are self-improving loop with a reward model, UI-Genie-RM. GUI-Rise uses structured reasoning with GRPO, along with history summarization and specialized rewards. BTL-UI proposes a brain-inspired framework with Blink, Think, and Link that demonstrates competitive performance with other methods.
Safety. WASP investigates prompt injection attacks on autonomous UI agents and finds “even top-tier AI models can be deceived by simple, low-effort human-written injections in very realistic scenarios“. AgentDAM investigates the inadvertent use of unnecessary sensitive information. OS-Harm tests models for deliberate user data misuse, prompt injection attacks, and model misbehavior. RiOSWorld investigations risk associated with computer-use agents in two major categories: user-originated risks and environmental risks. MIP against Agent uses adversarially perturbed image patches to test the robustness of multimodel OS agents.
5. AI-driven scientific discovery
A position paper at NeurIPS argues “that foundation models are driving a fundamental shift in the scientific process”. From paradigm enhancement to paradigm transition. To align with that, there is a lot of work at NeurIPS ‘25 that evaluates autonomous scientific discovery with LLMs, first data science and machine learning research, then biology, physics, mathematics, and finance. The results right now are mixed, but with a scope for improvement.
Not surprisingly, ML researchers first want to automate their tasks! AI-Researcher defines complete research pipelines, from literature review and hypothesis generation to algorithm implementation and publication-ready manuscript preparation, and proposes Scientist-Bench. The authors claim that the AI-Researcher approach achieves human-level quality on this benchmark. Another work proposes better performance methods on MLE-Bench, which comprises problems from Kaggle competitions. Similarly, but with a multi-modal pipeline, this work proposes a system for automated model discovery for a given dataset.
MLR-Bench sources 201 research tasks from NeurIPS, ICLR, and ICML workshops, which are evaluated by MLR-Judge. Similarly, MLRC-Bench curates a suite of 7 competition tasks that reveal significant challenges for LLM agents. It shows that even the best agent closes only 9.3% of the gap between baseline and the top human participant. Another work investigated language modeling itself by using LLMs to come up with efficient architectures for LLMs. LLM Speedrunning benchmark tests LLM agents against reproducing NanoGPT improvements. It concludes with “LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark.“ Another interesting test for LLM was predicting empirical AI research outcomes. Surprisingly, a system with fine-tuned GPT-4.1 and a paper retrieval agent beats human experts by a large margin!
LLMs are also evaluated for R&D and scientific discovery in other domains. R&D-Agent-Quant for quantitative finance, scPilot for single-cell analysis, PhysGym for interactive physics discovery, CIDD for drug design, AstroVisBench for scientific computing and visualization in Astronomy. LabUtopia for scientific embodied agent testing LLM or VLA models in laboratory settings.
Beyond benchmarks with deterministic success criteria, a large part of R&D is non-verifiable and should be evaluated by subject matter experts. SciArena built a platform similar to Chatbot Arena, where human researchers across diverse scientific domains judge the answer quality of different LLMs on scientific literature-grounded tasks.
Outro
I am excited about all the interesting papers at NeurIPS and can’t wait to visit the posters for the above papers in San Diego next week.
AGI: Artificial General Intelligence; ASI: Artificial Super Intelligence. The community has used many practically interchangeable terms. Another is Human-level AI or Transformative AI by Sam Bowman, talking about a checklist for AI safety. There are a lot of definitions floating around, but let’s go with Sam Bowman’s definition: “AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.”


