AI PaperTrails

The frontier of LLMs @ NeurIPS '25

Prakash Kagitha — Sun, 30 Nov 2025 15:59:12 GMT

We are increasingly talking about AGI/ASI1 as LLM systems achieve breakthroughs in mathematics, coding, and even scientific discovery. However, the techniques underlying these breakthroughs, the current state of LLM capabilities, and the diverse research directions that push the frontier remain relatively inaccessible.

In this blog post, I try to surface interesting research to shed light on the frontier of LLMs. I do that with the latest peer-reviewed research published at NeurIPS ‘25. Around ~6000 papers are published at this conference on virtually every topic in ML. You can use Paper Finder to find specific NeurIPS papers by topic and poster session, along with their posters/slides.

These are the topics I explore:

Scaling reasoning and RL environments
Memory & long-context LLMs
Self-improving LLM systems
Computer use & web Agents
AI-driven scientific discovery

1. Scaling reasoning and RL environments

While the recipe of DeepSeek-R1 is prominent for eliciting reasoning, the most cited paper at the conference observed that RL doesn’t really incentivize reasoning capacity beyond the base model. It turns out the performance gains from RL could be recovered if one samples ~1000 responses from the base model. The authors, in turn, call for scaling RL data & compute, and for agents to use tools & external info, among other directions that improve reasoning with RL.

It just so happens that other papers at the conference showed improvement exactly in these directions. Reasoning Gym created 100+ RL reasoning environments with verifiable rewards to do RL for reasoning. This paper created a cross-domain RL-for-reasoning dataset and showed the efficacy of a mixed-domain RL training approach.

There were other dimensions, RL for reasoning is improved: Open-Reasoner-Zero, with an alternative to GRPO, showed improvements compared to DeepSeek-R1-Zero. ProRL proposed a variant of GRPO with KL divergence control. Thinking vs Doing improves agent reasoning by scaling environment interaction steps. General-Reasoner elicited reasoning in non-verifiable domains with an LLM-based verifier. This and this work investigated learning token-efficient reasoning, while, amusingly, this paper trained models for reasoning with just one training example.

Similar recipe of eliciting reasoning is adapted for many more tasks, including long-horizon reasoning in generative reward models (Think-RM), tool-integrated reasoning, reasoning on long video, omnimodal reasoning (Omni-R1), vision-language model reasoning (VL-Rethinker), reasoning for real-world software engineering (SWE-RL), visual perception (Perception-R1), SQL (SQL-R1), and embodied reasoning (Robot-R1).

2. Self-improving LLMs

A lot of research showed self-improvement on several tasks. The recipe usually goes like this:
1. LLM generates a full or partial response
2. LLM self-corrects or self-verifies itself (also could be rewards or preferences)
3. Train the LLM on positive data or preference learning on positive and negative data.

At NeurIPS, there are many papers showing self-improvement. Self-improving embodied foundations models by learning with steps-to-go prediction objectives (self-verify). Guided-ReST self-improves with guided, reinforced self-training. (training on positive predicted examples). Self-adapting LLMs create data and train themselves based on new input. Self-challenging LLM agents that generate Code-as-Task data and a verification function, and train themselves on correct samples. SPC fine-tunes a sneaky generator and critic that enables improvement with self-play. Sherlock self-corrects its visual reasoning and improves. MM-UPT defines a continual self-improvement loop with a self-rewarding mechanism based on majority voting. ExPO unlocks hard reasoning with self-explanations. SPRO uses self-play for improving image generation. SwS is even self-aware of its weaknesses and synthesizes problems and trains itself. on and on and on.

If self-improvement is possible, where is the singularity? Like the most-rated paper at NeurIPS and other related work, there is evidence that RL training or self-improvement only brings LLM performance at k=n to k=1, i.e, the better performance with n responses of LLM to the performance of one response of an LLM. Intuitively, in the base model, the better performance with finite n responses is not infinite! So there is no singularity yet. However, better self-improvement methods might leap that barrier, only if not for the problem below.

Feedback Friction paper shows that LLMs consistently show resistance to feedback, even with correct feedback. This is a clear limitation that stunts self-improvement.

3. Memory & long-context LLMs

The positioning for the recently released GPT-5.1-Codex-Max is that it can work on a task for more than 24 hours continuously over millions of tokens, which is more than the context length of the model. This is possible because it “automatically compacts its session when it approaches its context window limit, giving it a fresh context window. It repeats this process until the task is completed.“

Scaling the effective context length and designing “memory“ architecture is an important line of research. An investigation at NeurIPS shows that long-context model training improves reasoning for even tasks with short input lengths, showing the importance of long-context capabilities. Several works argue that, despite the perceived importance, the evaluation of long-context abilities is not objective and comprehensive. MemSim uses a Bayesian Relation Network (BRNet) to automatically create a dataset to evaluate LLM hallucinations and the capability of memorizing information from user messages. LongBioBench with artificial biography aims to comprehensively evaluate the long-context capabilities of text models, while MMLongBench does that for vision-language models.

There are many types of memory architectures:

1. Templatic compression of long context. Agentic plan caching reuses structured plan templates from planning states of agent applications to enable the memory feature. AdmTree compresses lengthy context into hierarchical memory while summarizing and storing it as leaves in a semantic binary tree.

2. Indexing the knowledge snippets that LLM could query. A-Mem creates queryable interconnected knowledge networks with the Zettelkasten method. G-memory organizes memories with a three-tier graph hierarchy and performs bi-directional memory traversal to retrieve different levels. This is specifically to encode prior collaboration experiences of multi-agent systems.

3. Memory retrieval based on model activations. PaceLLM, inspired by mechanisms in the prefrontal cortex, designs Persistent Activity (PA) to retrieve previous states and Cortical Expert (CE) to reorganize previous states into semantic modules. This work uses a Vision-Language Model (VLM) itself to encode and retrieve memories for a VLM, while Memory Decoder uses a plug-and-play trained memory decoder module. Similarly, 3DLLM-Mem retrieves past interactions based on current observations.

There is also research that adaptively attends to different heads in self-attention to simulate learned memory retrieval and forgetting. Coincidentally, at NeurIPS, both MoBA and SeerAttention propose this type of gating.

4. Computer use/GUI and web agents

There is a lot of great research on this topic at NeurIPS: building datasets and benchmarks for computer use agents, methods to train and improve computer use and web agents, and investigations into safety risks with these agents.

Datasets and Benchmarks. To evaluate a computer use agent comprehensively, the dataset should contain demonstrations spanning several operating systems, applications, and websites. OpenCUA presents data with 3 operating systems and 200+ applications and websites, along with an annotation infrastructure. OSWorld-G creates a training dataset specific to different interface elements that enables compositional generalization to novel interfaces. macOSWorld creates a multi-lingual benchmark with 5 languages. REAL creates a dataset with real websites and programmatic checks as deterministic success criteria. TheAgentCompany simulates real-world tasks of a digital worker, including communicating with other coworkers.

Building computer use/GUI agents. GUI-G1 sets up DeepSeek-R1-Zero type of training, where models surpass all prior models of similar size. R1-style training everywhere! GUI Exploration Lab improves agents with multi-turn reinforcement learning as opposed to single-turn. UI-Genie takes this a step forward and defines are self-improving loop with a reward model, UI-Genie-RM. GUI-Rise uses structured reasoning with GRPO, along with history summarization and specialized rewards. BTL-UI proposes a brain-inspired framework with Blink, Think, and Link that demonstrates competitive performance with other methods.

Safety. WASP investigates prompt injection attacks on autonomous UI agents and finds “even top-tier AI models can be deceived by simple, low-effort human-written injections in very realistic scenarios“. AgentDAM investigates the inadvertent use of unnecessary sensitive information. OS-Harm tests models for deliberate user data misuse, prompt injection attacks, and model misbehavior. RiOSWorld investigations risk associated with computer-use agents in two major categories: user-originated risks and environmental risks. MIP against Agent uses adversarially perturbed image patches to test the robustness of multimodel OS agents.

5. AI-driven scientific discovery

A position paper at NeurIPS argues “that foundation models are driving a fundamental shift in the scientific process”. From paradigm enhancement to paradigm transition. To align with that, there is a lot of work at NeurIPS ‘25 that evaluates autonomous scientific discovery with LLMs, first data science and machine learning research, then biology, physics, mathematics, and finance. The results right now are mixed, but with a scope for improvement.

Not surprisingly, ML researchers first want to automate their tasks! AI-Researcher defines complete research pipelines, from literature review and hypothesis generation to algorithm implementation and publication-ready manuscript preparation, and proposes Scientist-Bench. The authors claim that the AI-Researcher approach achieves human-level quality on this benchmark. Another work proposes better performance methods on MLE-Bench, which comprises problems from Kaggle competitions. Similarly, but with a multi-modal pipeline, this work proposes a system for automated model discovery for a given dataset.

MLR-Bench sources 201 research tasks from NeurIPS, ICLR, and ICML workshops, which are evaluated by MLR-Judge. Similarly, MLRC-Bench curates a suite of 7 competition tasks that reveal significant challenges for LLM agents. It shows that even the best agent closes only 9.3% of the gap between baseline and the top human participant. Another work investigated language modeling itself by using LLMs to come up with efficient architectures for LLMs. LLM Speedrunning benchmark tests LLM agents against reproducing NanoGPT improvements. It concludes with “LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark.“ Another interesting test for LLM was predicting empirical AI research outcomes. Surprisingly, a system with fine-tuned GPT-4.1 and a paper retrieval agent beats human experts by a large margin!

LLMs are also evaluated for R&D and scientific discovery in other domains. R&D-Agent-Quant for quantitative finance, scPilot for single-cell analysis, PhysGym for interactive physics discovery, CIDD for drug design, AstroVisBench for scientific computing and visualization in Astronomy. LabUtopia for scientific embodied agent testing LLM or VLA models in laboratory settings.

Beyond benchmarks with deterministic success criteria, a large part of R&D is non-verifiable and should be evaluated by subject matter experts. SciArena built a platform similar to Chatbot Arena, where human researchers across diverse scientific domains judge the answer quality of different LLMs on scientific literature-grounded tasks.

Outro

I am excited about all the interesting papers at NeurIPS and can’t wait to visit the posters for the above papers in San Diego next week.

AGI: Artificial General Intelligence; ASI: Artificial Super Intelligence. The community has used many practically interchangeable terms. Another is Human-level AI or Transformative AI by Sam Bowman, talking about a checklist for AI safety. There are a lot of definitions floating around, but let’s go with Sam Bowman’s definition: “AI that could form as a drop-in replacement for humans in all remote-work-friendly jobs, including AI R&D.”

Chain of thought is computation

Prakash Kagitha — Fri, 07 Nov 2025 07:04:02 GMT

Assume there is a giant directed graph and we were to find the shortest path between node A to node B. If we try to write down the first edge of the optimal path, that would be the hardest edge to choose. We need to know what is the optimal path from second node to node B to choose the first node optimally. Classic dynamic programming. In other words, we must have done most of the computation even before identifying the first edge of the optimal path. This computation scales linearly with the problem size, linear-time computation.

Analogically, if an LLM tries to generate the final answer in one shot, it would have to somehow compress a multi-step computation into a single fixed-depth forward pass (when generating the first token). But if LLMs were to generate a variable size chain of thought before generating an answer, linear-time computation is possible. Chain of thought is computation.

Dale Schuurman (U. Alberta & Google DeepMind, in his keynote at Reinforcement Learning Conference, offered a perspective to understand LLMs as compute engines. Specifically, it explains why chain-of-thought is necessary for reasoning and how it can be treated as computation. This perspective answers core questions like “Can LLMs reason?” and offers a framework for understanding pre-training and post-training in LLMs. It also highlights the inherent limits of learning a formal program from finite data and raises questions about out-of-distribution generalization. The following depicts a characterization of the framework:

Linear-time ≠ one-shot

We cannot distill linear-time reasoning into constant-time architectures. As seen above, shortest paths need dynamic programming type of computation and we can’t compress this inherently iterative computation into a single constant-depth forward pass. In a similar case of linear-time computation problem, propositional logic (shown in the below figure), training with CoT that follow back substiution improved the performance considerably above random.

As Dale says, training input-output pairs (without COT) to produce the answer “in one pass” was doomed:

“This cannot work. It’s an impossibility… It’s not a statistical problem. It’s a computer science problem. […] If you want to solve linear-time problems, give the model a budget to express linear-time computations.”

There is an alternative, as seen in NeurIPS 2025 papers, to token-by-token CoT: continuous, latent-space thinking. Instead of reasoning tokens, the model allocates computation inside its vectors.

Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought argues that continuous CoT can encode multiple search frontiers in superposition, enabling something like parallel BFS with fewer explicit steps than discrete CoT. This doesn’t make compute vanish, it changes where the compute lives (in vectors rather than many printed tokens) and shows how a small number of “thinking updates” can still represent multi-path exploration.

Scaling up Test-Time Compute with Latent Reasoning proposes iterating a recurrent block at test time, no special CoT data, so the model scales depth as needed (adaptive, budgeted compute) without spitting out long text traces.

Read together, these results pressure-test token CoT by showing that compute doesn’t disappear, it moves: from explicit text to a handful of latent updates. They don’t refute “linear-time ≠ one-shot”; they offer tighter allocation and compression of the required iterative compute.

Pre-training and post-training

If the internet is a giant directed graph, pre-training learns the edges (token-transition statistics). Post-training learns how to traverse the graph for a goal. Schuurmans’s metaphor is crisp: “Pre-training is learning the graph… Post-training is about turning that into a policy that finds paths.” This elicits reasoning which executes the policy learned. Talking about the performance of reasoning models for logical reasoning tasks, this paper says:

“Unlike other LLMs (GPT-4o & DeepSeek V3), R1 shows signs of having learned the underlying reasoning“

Out-of-distribution generalization & Correctness

Even though an LLM is universal in principle, program induction from finite data guarantees there will be inputs just outside what the learned program covers. Schuurmans makes the broader point that machine learning promises are mostly in-distribution, while program execution aims for all instances. If you ignore the difference, you misdiagnose computational failures as “OOD coverage problems.”

Also, how can we characterise CoTs with self-verification & backtracking? Are they verifying the program and choosing a different program to run? Isn’t the case program correctness is undecidable?

There are many interesting questions with this framework of treating LLM as a computer. In Dale Schuurmans’s words:

“machine learning is awesome… but computer science matters… If you ignore those laws, I predict disappointment.”

Large reasoning models research at COLM 2025

Prakash Kagitha — Sat, 13 Sep 2025 23:48:38 GMT

Conference on Language Models (COLM) will take place in Montreal from October 7-10, 2025. There are around 70, out of over 400, research papers related to “reasoning”1. Find all the accepted papers with a reasoning tag, abstracts, authors, PDF links, and review scores in this Google Sheet.

Peer-reviewed papers from an upcoming top LLM conference must give a broad sense of the research area. As Large Reasoning Models (LRMs)2 are undeniably the cutting edge of language models, this first post on COLM 2025 papers introduces the latest research on LRMs, a subset of 70 reasoning papers, covering research:

Improving reasoning in LRMs,
Applications of LRMs in different domains & tasks,
Multi-modal reasoning enabled by LRMs,
Many limitations of LRMs, and
Understanding the reasoning in LRMs.

I aim to provide clear segues with friendly context to different research papers, particularly, how they tie into the narrative of “scaling reasoning is the paradigm that pushed the frontier this time around”.

0. LRMs represent the cutting edge of Language Models

Large Language Models (LLMs) have been improving rapidly since 2020. The initial progress came from the paradigm of scaling pretraining, more data & model size, giving LLMs robust natural language understanding and instruction following, with models like GPT-4. While these models had never-seen-before reasoning capabilities for an AI, they were not human-level in many reasoning and planning tasks. Scaling pretraining didn’t help with this, as the gains have plateaued.

Scaling reasoning/thinking tokens with inference-time compute has emerged as the new paradigm that improved LLMs, evident with pioneering Large Reasoning Models (LRMs) OpenAI O1 and DeepSeek R1. This improved the capability to perform complex multi-step reasoning tasks and extended tool-use, enabling several applications of LLM Agents, including for software engineering and computer use. We still didn’t see a plateau in scaling reasoning, but we could be racing towards it!

Unlike speculation, research on LRMs at COLM 2025 will shed some light on where we stand. In summary: Several methods have been proposed to improve reasoning and inference-time compute, suggesting that scaling reasoning might still have enough headroom. Reasoning models have been trained for several tasks in multiple domains, justifying being an effective technique. Also, reasoning was key to enabling a few multi-modal reasoning tasks, suggesting the versatility. Notwithstanding spectacular achievements like earning a gold medal at IMO 2025, LRMs have many limitations, suggesting that scaling reasoning might not be a panacea. Fortunately, there are several investigations into scientifically understanding LRMs, which should be a guiding light to understand limitations, improve, or look for the next paradigm.

1. Improving the reasoning of LRMs

Striking and first evidence with OpenAI O1 that scaling reasoning will be the next frontier is self-verification and backtracking in language models’ reasoning chains. Hence, Large Reasoning Models. Also, the aha-moment from DeepSeek R1, which showed how to achieve this long reasoning that others are still following. There are avenues to improve reasoning still.

Algorithmic synthetic data for long reasoning

Spontaneous self-correction method distills self-verification capability from multi-turn, multi-agent collaboration into a single turn generation. After distillation with Supervised Fine-Tuning (SFT), online RL is used to improve self-verification.

Another work generates diverse reasoning traces on graph problems that underscore broad reasoning abilities required for math and scientific reasoning. Training on these reasoning traces also showed ~20% improvement in non-mathematical reasoning tasks like logical and commonsense reasoning.

Similarly, Step-Wise RL (SWiRL) applies synthetic data filtering and RL optimisation to improve math reasoning and question answering with retrieval.

Verifiers and multi-agent systems for better reasoning

R1-type of training incentivises long reasoning with rewards and RL (GRPO). Another way to obtain self-verification capability is to have multi-agent systems with generators & verifiers. This is long reasoning in effect.

Putting the value back in RL. This augments RL methods without value functions, like GRPO, with generative verifiers to enable 8-32x efficient test-time compute scaling.

Another work just does away with RL and gets comparable performance with just pair-wise preference optimisation training, both generator and verifier - Iterative DPO

MeMAD enables self-verification with a separate reflection module and memory bank during the debate of multiple agents.

Novel approaches for reasoning

Most LRMs reason by generating natural language tokens. We don’t know whether this is an effective and comprehensive way of thinking in language models. COCONUT proposes a new way of reasoning in latent space: generating a sequence of continuous vectors, but not generating or training to generate natural language tokens. The authors showed that continuous thought is better than NL thought when substantial search is needed in a reasoning task, like finding a valid path in a big graph.

While reasoning is generally serialized, Adaptive Parallel Reasoning (APR) tries to spawn multiple reasoning chains and learns to coordinate among them with a new end-to-end RL strategy.

Apparently, hidden states of reasoning tokens in LRMs have information about answer correctness and different phases like execution, reflection, and transition thoughts. SEAL reduces excessive reflection and transition thoughts with reasoning steering vectors, and the other work early exits reasoning by probing correctness from hidden states.

2. Applications of LRMs in different domains & tasks

DeepSeek R1-style of training to incentivise reasoning has become the norm to improve complex reasoning. Long thoughts, more reasoning steps, and extended tool calls underlie improved performance in all the following tasks.

LRMs for Medical knowledge reasoning. FineMedLM-o1 is trained on medical long-form reasoning data to enable advanced dialogue and deep reasoning capabilities supporting differential diagnosis and medication recommendations.

LRMs for Law. Towards reasoning-aware legal AI systems, LawFlow collects lawyers’ thought processes and compares them to LRM’s thought processes.

LRMs for logical reasoning. Performing algorithmic reasoning for the 3-SAT problem, the authors found that “Unlike other LLMs (GPT-4o & DeepSeek V3), R1 shows signs of having learned the underlying reasoning“. LRMs internalized search process required for algorithmic reasoning!

LRMs for story generation. Trained with RL with verifiable rewards, similar to R1, to reason over a story and plan for the next chapter, enabling the generation of high-quality stories spanning thousands of tokens.

LRMs for search. Search-R1. “LLM learns to autonomously generate (multiple) search queries during step-by-step reasoning with real-time retrieval.”

LRMs for Text-to-SQL. Reasoning-SQL. Improving reasoning in Text-to-SQL with rewards tailored to SQL generation, such as syntax checks, schema-linking, and n-gram similarity. RL-trained 14B model outperforms O3-mini by 4%.

LRMs for summarisation. This work proposes a long-COT dataset with reflective reasoning, SumFeed-CoT. Trained on this, ReFeed enhances summarization refinement in multiple dimensions through reflective reasoning on feedback.

3. Multi-modal reasoning enabled by LRMs

Long thoughts are a characteristic of system-2 reasoning, and it is shown to improve visual reasoning as well, which is thought to be system-1 heavy. (pun intended :))

LongPerceptualThoughts are long-thought traces for perceptual tasks, made verifiable with their three-state framework, starting from multiple-choice questions to CoTs. Training on these thoughts achieves +3.4 points over 5 vision benchmarks.

BigCharts-R1 is the SOTA chart reasoning model trained in R1-style. The authors tackle this hard-to-verify task by rendering diverse chart images with real-world charts and data. This, by definition, has ground truths, thus providing rewards.

4. Many limitations of LRMs

With all the improvements in reasoning and new tasks that it enables, it is easier to intuit that long reasoning is enough for any task. At least not yet. LRMs fall short in several abilities as a language model in general and sometimes as a reasoning model specifically.

Reasoning is not all you need?! Corrupted by Reasoning finds that LRMs struggle significantly with cooperation compared to traditional LLMs. In Language model as a judge settings, although LRMs are better at fact-checking, LRMs show a “superficial reflection bias“ where phrases mimicking reasoning (e.g., “wait, let me think…“) significantly influence model judgements.

Limited in uncertainty awareness. Despite having reasoning tokens, language models struggle with uncertainty-awareness. This leads to unfaithful reasoning traces and divergent conclusions given logically inconsistent knowledge, only to be disambiguated by uncertainty. Two similar works use formal reasoning topology and weights of logical rules for uncertainty measurement and find that SOTA LLMs/LRMs are limited.

Limited in mathematical reasoning. Really?! Despite unparalleled focus on mathematical reasoning in scaling reasoning, Brains vs. Bytes found that O3-mini drops from 48.3% to 14.3% if evaluated on complete proofs, not just the final answer. Through manual expert analysis on 455 IMO problems, they found that accuracy dropped for models including DeepSeek, Gemini, and OpenAI.

Limited in code reasoning. O3-mini(high) couldn’t identify counterexamples for over 90% of incorrect solutions where expert humans can. In long-horizon SWE agent tasks, LLMs are not robust to corrupted functions and fail at system-level reasoning.

Language models are categorically worse than humans in visual cognition, social reasoning, and spatial reasoning. Also, LLMs are limited in linguistic reasoning in low-resource languages.

5. Understanding the reasoning in LRMs

Soon after the success of R1, several related RL training experiments were spawned to incentivise long reasoning, in pursuit of understanding its essence. Several works found that long reasoning is only possible if base models themselves have properties like verification, subgoal setting, and backtracking, like Qwen 2.5 models as opposed to Llama 3.2 models.

SimpleRL-Zoo finds that by adjusting format reward and controlling query difficulty, most model families show long reasoning performance boosts, albeit with distinct patterns during training.

Style over substance also found that small models learn reasoning style from distillation rather than substance, and thus achieve similar performance during training with correct or incorrect answers. A framework to investigate factors that enable long reasoning, also found that priming with few-shot examples with reasoning patterns also elicited long reasoning in most models, even when long reasoning patterns led to incorrect answers.

As long into the past as the content vs form debate goes, so does it into many domains! This might give a plausible hypothesis for other limitations of LLMs at COLM 2025: LRMs go on to overthink even with a missing premise. Maybe they are just following the reasoning style. LRMs couldn’t identify inconsistencies in FlawedFictions as well, maybe because the high-level linguistic structure still makes sense.

With all the findings and conclusions from scientific investigations in this post, it is imperative to give a shout-out to all the reproducible efforts in any paper. Related work, Sober look at progress, proposed an evaluation framework to understand the sensitivity of findings to subtle implementation choices, including random seeds, prompt formatting, and decoding parameters. It found that SFT shows consistently stronger generalisation than RL, contrary to the popular belief.

§. Looking forward

Autonomous scientific discovery to win a Nobel prize would be a great goal for superhuman AI. Can LRMs alone take us there? (including multi-agent systems powered by them) Or do we need a new paradigm or two that push the frontier?

ICML 2024 Orals (top 10%): Summaries of interesting papers

Prakash Kagitha — Tue, 23 Jul 2024 16:14:59 GMT

ICML, one of the top machine learning conferences, is happening this week. I am excited about a few tutorials, orals & posters, and most importantly workshops towards the end of the week. I found the following oral papers interesting and wrote friendly summaries.

Created by author

Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision (From OpenAI)

Currently, LLMs are aligned with Reinforcement Learning with Human Feedback (RLHF), i.e. we label whether a response followed the human intent in the query or whether a response is safe. But when LLMs become superhuman we cannot label the responses of superhuman LLMs to make them better or aligned to human values. For instance, if an LLM generated a code repository of 1 million lines, we cannot easily label the code that it is safe or that it followed the user intent. In this scenario, humans, weak agents, have to somehow align strong agents, superhuman LLMs.

This paper from OpenAI presents empirical investigations in answering this question. Analogical to using human labels to align/train superhuman LLMs, the authors set up their experiments to train a strong model, GPT-4, trained with labels of GPT-2. If GPT-4 learns from the GPT-2 labels, they call it a weak-to-strong generalization/performance.

The paper starts by defining three concepts. 1, Weak performance: GPT-2 trained on ground truth labels for a task 2, Strong ceiling performance: GPT-4 trained on ground truth labels 3, Weak-to-strong performance: GPT-4 trained with labels from GPT-2 (First, GPT-2 is trained on a task and it labels data points from the held-out dataset. This labeled dataset is used to train GPT-4.)

Now, the below passage from the paper defines the performance gap recovered (PGR), the extent strong student (GPT-4) recovered the strong ceiling performance with the labels from a weak supervisor (GPT-2). Hig PGR means that the weak supervisor can train strong students, or humans could align superhuman LLMs in the future.

Fortunately, weak-to-strong generalization is possible as shown across diverse tasks. 80% in NLP tasks and over 60 % in reward modeling tasks as seen in the below figure. Chess Puzzles saw the least weak-to-strong generalization with below 40 percent.

The paper also proposed approaches to improve weak-to-strong generalization:

Auxiliary loss based on strong student’s confidence: While training with labels from a weak supervisor (GPT-2), sometimes a strong student (GPT-4) might overfit to the mistakes of a weak supervisor. The paper proposed to add a loss term based on the confidence of the strong student helping it to stick to its own prediction when in conflict with the weak supervisor.

Bootstrapping: Instead of a weak model (GPT-2) supervising the strong model (GPT-4) directly, this approach utilizes intermediate models in between the weak and strong. For instance, if the weak model is 1B and the strong model is 100B, a sequence of models that are 2B, 4B, 8B, 16B, 32B, and 64B would be used. The bootstrap process starts by training the 2B model with weak labels, the 4B model with 2B model labels, the 8B model with 4B model, and so on. The strong model would be trained with labels of the model before it in the above sequence.

Unsupervised generative fine-tuning: Sometimes just training with the next token prediction on the training data without labels would help in learning the right representation or eliciting the required behavior for that task. This approach includes unsupervised fine-tuning with all data points without labels before fine-tuning.

The paper showed that the above approach improved weak-to-strong generalization compared to standard fine-tuning with weak labels. (Notice the difference between dotted lines and solid lines)

Interpreting and Improving Large Language Models in Arithmetic Calculation

Although Large Language Models (LLMs) show impressive performance in solving math word problems and perform descent arithmetic calculations, we don’t yet understand how they do mathematical reasoning. This paper shines a light on the mechanisms that underlie mathematical reasoning.

For templates of arithmetic calculations shown in the above figure, the authors tested LLaMA2–7B and 13B models where input, for instance, would be ‘3 + 5 =’ and the output would be ‘8’. The first step in their investigations is to understand which attention heads are affecting the prediction of token 8, the result.

Through a technique called path patching, they observed that only <5% of attention heads are important in performing arithmetic calculation. You can see in the figure below that there are very few heads or MLPs with darker colors.

They also observed that removing identified “arithmetic” heads affected the performance catastrophically compared to removing the same amount of random attention heads.

Figure 2: Only a few attention heads and MLPs affect the prediction. (darker color units) Figure 3: Performance drops when the identified “arithmetic” heads are removed as opposed to random attention heads

Interestingly, they observed specific attention heads that attend to operators (+,-,..) and operands (numbers). Figure 5 below shows attention weights being high at positions with operators and operands respectively.

The authors also investigated where the same “arithmetic” heads play a crucial role in mathematical reasoning in other datasets. Figure 4 below shows that after removing (knockout) the “arithmetic” heads the LLM produces wrong answers for data points that were predicted correctly before.

Having identified the components that perform arithmetic reasoning, they proposed to fine-tune only the “arithmetic” heads (10% parameters) and improve the performance on mathematical reasoning tasks like GSM8k and SVAMP. Dubbed as Precise SFT in the below figure, it shows improvements compared to full SFT with 3x training speed. It is interesting to note that precise finetuning on mathematical tasks didn’t affect performance on generic tasks as much as full fine-tuning.

Monitoring AI-Modified Content at Scale: A Case Study on the Impact of ChatGPT on AI Conference Peer Reviews

Motivated by a disproportionate increase in adjectives such as “intricate” & “meticulous” in peer reviews, the authors devised a statistical parameter inference approach to estimate the ratio of AI-generated text at prominent deep learning conferences and Nature portfolio journals. As suspected, they estimated that 10.6% of ICLR 2024 review sentences and 16.9% of EMNLP are completely AI-generated.

The paper started by explaining the extreme difficulty of identifying LLM-generated content instance by instance. For example, techniques like zero-shot LLM detection, fine-tuned models for classifying LLM-generated content, or LLM watermarking classify each document as LLM-generated or not. However, these techniques are shown to be similar to random predictors or such mechanisms reduce the coherence of LLM generation.

Alternatively, the authors proposed to estimate the ratio of AI-generated sentences/documents in a corpus rather than classifying them at the instance level. They assumed that a corpus would be generated from the mixture distribution of P (human-generated) and Q (LLM-generated) as in the below figure, with alpha as the ratio of LLM-generated content. The log-likelihood of any corpus could be formulated with an alpha parameter like below.

The authors estimated the true P and Q distribution with training corpora with a known ratio of human vs AI-generated content (alpha). Knowing the P and Q distribution, we could then estimate the alpha for any corpus by maximum likelihood estimation (MLE). (In simple terms, perturb the alpha many times to get the right alpha that fits the given corpus).

The entire process is depicted in the following figure: 1, Create human and LLM-generated corpus 2, Temporally split training corpus and validation corpus 3, Estimate the true P and Q distributions with training corpus 4, Validate the estimated distributions on validation corpus where alpha is known 5, Finally, estimate the alpha on the corpus of interest, here ICLR 2024 or NeurIPS 2023 and so on. This gives us the ratio of LLM-generated sentences/documents.

The paper also has interesting correlations between the ratio of LLM-generated content and different aspects of peer review. The estimated fraction of LLM-generated text is higher in reviews that don’t have scholarly citations (reference effect) and from reviewers who are less likely to respond to author rebuttals (lower reply rate effect). The LLM-generated text also tends to be homogeneous, which reduces the value-add of multiple reviews per paper (homogenization effect).

The paper also has a nice Box with all the main findings.

Other interesting papers:

Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo

Learning Useful Representations of Recurrent Neural Network Weight Matrices

Evolution of Heuristics: Towards Efficient Automatic Algorithm Design Using Large Language Model

Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks

GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

Compressible Dynamics in Deep Overparameterized Low-Rank Learning & Adaptation

Bottleneck-Minimal Indexing for Generative Document Retrieval

How do Large Language Models Navigate Conflicts between Honesty and Helpfulness?

Theoretical Analysis of Learned Database Operations under Distribution Shift through Distribution Learnability

Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study

Discovering Environments with XRM

I/O Complexity of Attention, or How Optimal is FlashAttention?

Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling

ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking

CompeteAI: Understanding the Competition Dynamics of Large Language Model-based Agents

Repoformer: Selective Retrieval for Repository-Level Code Completion

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation

PRISE: LLM-Style Sequence Compression for Learning Temporal Action Abstractions in Control

LCA-on-the-Line: Benchmarking Out of Distribution Generalization with Class Taxonomies

Learning to Model the World With Language

AI Control: Improving Safety Despite Intentional Subversion

Offline Actor-Critic Reinforcement Learning Scales to Large Models

MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

Stealing part of a production language model

Debating with More Persuasive LLMs Leads to More Truthful Answers

Image Clustering with External Guidance

A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity

Genie: Generative Interactive Environments

Test-Time Model Adaptation with Only Forward Passes

GPTSwarm: Language Agents as Optimizable Graphs

Knowledge Distillation Research Review

Prakash Kagitha — Fri, 05 Feb 2021 14:46:44 GMT

The latest advances and insights for Neural Network model compression with Knowledge Distillation

Image reproduced from the paper.

Knowledge Distillation is a process where a smaller/less complex model is trained to imitate the behavior of a larger/more complex model.

Particularly when deploying NN models on mobiles or edge devices, Pruning, and model compression in general, is desirable and often the only plausible way to deploy as the memory and computational budget of these devices is very limited.

Why not use potential infinite virtual memory and computational power from cloud machines? While a lot of NN models are running on the cloud even now, latency is not low enough for mobile/edge devices, which hinders utility and requires data to be transferred to the cloud which rises a lot of privacy concerns.

Preliminaries of Neural Network Knowledge Distillation

Response (Output logits) Distillation vs. Feature Distillation vs. Relation Distillation: This distinction is based on the knowledge that is transferred from a teacher network to a student network. If a student network is trained to reproduce the output logits distribution it is called Response Distillation. Instead, if a student network is trained to reproduce the intermediate representations, then it is called Feature Distillation. If a student network is trained to reproduce the relative response on a pair of inputs it is called Relation Distillation.

Offline Distillation vs. Online Distillation: If the teacher model is already trained and a student model is trained after transferring the knowledge from the teacher, then it is called Offline Distillation. If both the teacher and student models are trained in parallel, then it is called Online Distillation. (Online Distillation could be between models of the same size and same complexity where there are no teachers or students).

Distilling from ensembles of models: In this paradigm, a single model is obtained distilling the knowledge from an ensemble of models to lower the inference costs.

Self-distillation: Sometimes a model is trained to mimic the feature outputs of the Nth-layer at (N/3)th-layer or at (N/2)th-layer. Now, only N/2 or N/3 layers of the model are sufficient to produce the results of a full N-layer network. This can be seen as distilling a smaller model from a bigger model which is itself.

Contrastive Representation Distillation (ICLR 2020)

Outstanding problem: Instead of training a student model to imitate just the class independent output probabilities of a teacher model, training to imitate the representations of the teacher model transfers more knowledge from the teacher model to a student model.

Even though previous methods dealt with transferring representations of the teacher model, their loss functions are not designed to model the correlations and higher-order dependencies in the representational space using just a dot product between representations. This constrains the feature vectors of the student and the teacher to be of the same size, which doesn’t allow the possibility of having smaller student networks without any constraints on their architecture, which is most desired.

This work employs a contrastive learning approach based on mutual information to learn correlations in the representational space. The similarity between representations is estimated with a critic model that is trained along with the student model which takes two representations and outputs a similarity between 0 to 1, making it possible to have different sizes of feature vectors in teacher and student networks.

Proposed solution: Instead of just maximizing the dot product between the representations of a teacher model and a student model or minimizing L1 between them, this work proposes that we use mutual information as the criterion to maximize.

Mutual information is a measure between two variables that tells how much information is present in one variable about the other.

Adopting a contrastive learning framework, the authors train positive pairs which are student and teacher representations for the same input by increasing the mutual information between those representations. They also trained negative pairs which are student and teacher representations for different inputs by decreasing the mutual information between those representations.

Results and Conclusions: It is shown that the output probabilities of the student network trained with the proposed methods achieve higher correlations with the output probabilities of the teacher network.

Image reproduced from the paper.

The proposed method outperforms many of the recently proposed Knowledge Distillations methods.

Image reproduced from the paper.

Also, the proposed method outperforms all other methods in cases where the architectures of the teacher and student models are very different.

Image reproduced from the paper.

Feature-map-level Online Adversarial Knowledge Distillation (ICML 2020)

Outstanding problem: Even though it is established that transferring knowledge from feature maps is more efficient than just transferring the knowledge of output logit distributions, no method exists to transfer the feature-map-level in an online knowledge distillation setting (where multiple models are trained simultaneously to predict the ground truth label and also to mimic each other’s behavior).

It is particularly hard to transfer the feature-map-level knowledge because the feature maps change more rapidly than the output logits and pose a major problem to distillation methods having to transfer representations from this moving target.

The current method overcomes this challenge with adversarial training by trying to make the representational distributions similar rather than individual representations.

Proposed solution: Given two networks, this work proposes three types of losses in the entire objective. 1. Cross-entropy between predicted and ground truth labels for each network. 2. KL divergence losses between the output logits of the two networks (both ways). 3. Discriminator loss where the representations of a particular network are treated as fake and representations from the other network are treated as real.

Image reproduced from the paper.

Cross-entropy loss brings in the information from the dataset. KL divergence acts as the standard knowledge distillation of output logits. And, finally, the discriminator losses would help the network to learn from other networks even though the actual representations are changing by pushing for the entire distributions to be similar.

Results and Conclusions: The proposed method outperforms the direct alignment-based methods (L1).

Image reproduced from the paper.

The proposed method also outperforms the previous online distillation methods in both cases with the same networks and different networks.

Image reproduced from the paper.

MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers (NeurIPS 2020)

Outstanding problem: Existing methods dealt with distilling the task-specific models where the pre-trained transformer-based language models are fine-tuned, but no work exists which deals with task-agnostic distillations of pre-trained models which are then fine-tuned with fewer resources compared to the large pre-trained models.

The existing distillation methods for transformers either constrain the number of layers, or the hidden representation size of the student model, or adopts layer-to-layer distillation which needs extra parameters.

Image reproduced from the paper.

Proposed solution: This proposes a distillation method that mimics the behavior of the last layer of a transformer which removes the constraint on the number of layers of a student network.

The novelty of this work is that it also distills the so-called value-relational knowledge which mimics the value-value scaled dot product of a teacher model along with attention weights distribution of the self-attention layer as previous methods do. See the figure below.

Image reproduced from the paper.

As the value-relational knowledge is a pair-wise relational knowledge, it removes the constraint on the hidden representation size of a student model.

Results and Conclusions: The student model trained with this model, when fine-tuned, outperforms the task-specific distilled methods for a range of downstream tasks where all of the models are six-layer transformer models with a hidden representation size of 768.

Image reproduced from the paper.

The proposed method also outperforms different methods for 3-, 4-, and 6-layer transformers for small hidden representation sizes as well.

Image reproduced from the paper.

Residual Distillation: Towards Portable Deep Neural Networks without Shortcuts (NeurIPS 2020)

Outstanding problem: Residual connections (a layer takes the output of the last layer as well as the input of the last layer, i.e. F(x)+x) avoid vanishing gradient problem and allow us to train very deep neural networks (ResNet-152!). But these residual connections occupy 40% of the total memory usage because of the need to store more activations and gradients.

Plain-CNN would have low latency and a low memory footprint due to the removal of residual connections with a 20–30% reduction. See figure below.

A distillation method that learns a network without these residual connections, but with the same accuracy, is very desirable.

Proposed solution: This work proposes to distill the knowledge from a teacher ResNet with residual connections into a plain-CNN student with all the residual connections removed.

To avoid a vanishing gradient problem, this method trains the plain-CNN jointly with the corresponding ResNet. Specifically, during the forward pass, the output of the initial layers are passed both into the deeper layers of plain-CNN along with the deeper layers of the ResNet.

Image reproduced from the paper.

Thus, the gradients flow through the initial layer of the plain-CNN because of the residual connections of the ResNet. But the inference after training is only done with the plain-CNN, thus avoiding memory and computations bandwidth of having residual networks.

Results and Conclusions: The plain-CNN trained with this method shows a performance competitive with the corresponding ResNet and outperforms others knowledge distillation methods.

Image reproduced from the paper.

Even with pruning on top of knowledge distillation, plain-CNN performs competitively and some times better than the corresponding ResNet with pruning ratios of 30 and 60 percent.

Image reproduced from the paper.

Kernel-Based Progressive Distillation for Adder Neural Networks (NeurIPS 2020)

Outstanding problem: Adder Neural Networks (ANNs) replace the convolutional operation in CNNs with additions (L1 operator) thus reducing the computational resources amenable to deploying in low-resource environments such as mobile phones and cameras.

In practice it is observed that ANNs are difficult to optimize and thus the performance of ANNs is still lower than CNNs. This calls for methods that can train ANNs efficiently and increase their performance to be competitive with traditional CNNs.

Proposed solution: This work proposes a method to distill the knowledge from CNN into an ANN, both the knowledge of output logit distributions and intermediate representations.

First, the authors observe that the optimal weight initialization of ANNs is based on Laplace distribution, but Gaussian distribution is used for CNNs and that the representation distributions in ANNs and CNNs are of different characteristics.

Thus, they devise a kernel method to transform the intermediate features from ANNs and CNNs into one vector space to enable distillation. They call this method a progressive kernel-based knowledge distillation (PKKD).

Image reproduced from the paper.

Results and Conclusions: PKKD ANN outperforms Binary Neural Networks (BNNs) and other variants of ANNs on CIFAR-10, CIFAR-100.

Image reproduced from the paper.

And also on the ImageNet dataset.

Image reproduced from the paper.

Conclusion

Research on Neural Network Distillation, and, more generally, NN compression, is evolving to be more scientific and rigorous. One of the reasons is, undoubtedly, the interaction between the wide adoption of deep learning methods in computer vision & NLP and elsewhere and the increasing amount of memory, energy, and computational resources required for state-of-the-art methods.

Due to the research in 2020, we learned that imitating the feature representations are more efficient than imitating logits alone. We can now make a ResNet into a feed-forward network without much degradation in performance and we know a more efficient way to self-distill large language models.

Going into 2021, it would be great to see approaches that combine distillation with pruning and quantization to be more efficient than simply performing two or three of them in sequence.

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletters (Deep Learning Weekly and the Comet Newsletter), join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

Knowledge Distillation Research Review was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

Neural Network Quantization Research Review

Prakash Kagitha — Tue, 19 Jan 2021 13:49:45 GMT

The latest advances and insights for Neural Network model compression with Quantization

Neural network quantization is a process of reducing the precision of the weights in the neural network, thus reducing the memory, computation, and energy bandwidths.

Particularly when deploying NN models on mobile or edge devices, quantization, and model compression in general, is desirable and often the only plausible way to deploy a mobile model as the memory and computational budget of these devices is very limited.

Preliminaries of Neural Network Quantization

There are a lot of quantization methods in the literature today and to best understand the pros and cons of each method it helps to classify quantization methods. Quantization in a general sense is assigning a discrete value from a pre-specified set of values for a given input.

Quantization-aware training vs. post-training quantization: When the quantization operations are embedded into a neural network and then trained it is called quantization-aware training. And if quantization is performed on a model after training, for example with rounding mechanisms, it is called post-training quantization.

Scalar quantization vs. vector quantization vs. product quantization: If the input is scalar and the pre-specified set is also a scalar, then it is called scalar quantization (for example, making FP32 floating point into INT8 integer). If the input is a vector and it is assigned to a vector, it is called vector quantization. Product quantization is a form of vector quantization with increased granularity as it works with sub-vectors rather than the entire vector.

Unified quantization or non-unified quantization: During quantization, if there is a restriction that discrete values that can be assigned should be at equal step size, then it is called unified quantization. Converting FP32 to INT8 is an example of unified quantization. Alternately, if the discrete set of values to be assigned doesn’t have any restrictions it is called non-unified quantization. Vector quantization and product quantization are examples of this type.

Fixed-precision quantization vs. mixed-precision quantization: If different layers or channels have different precisions or bit-widths then it is called mixed-precision quantization. If the entire network is quantized to values of one bit-width, then it is called fixed-precision quantization.

Layer-wise quantization or channel-wise quantization: This is a fine-grained classification within mixed-precision quantization depending on whether we have different bit-widths for each layer or for each channel.

Now we’ll dive into six research papers that addressed these problems in 2020.

And the Bit Goes Down: Revisiting the Quantization of Neural Networks (ICLR 2020)

Outstanding problem: State-of-the-art deep learning models take a lot of memory. Even ResNet-50, which is not quite state-of-the-art takes ~100MB and the Faster-RCNN takes ~200MB.

Compressing the model alone is desirable to deploy deep learning solutions in low-memory environments.

Proposed solution: This work proposes applying Product Quantization (PQ) to the weights of the network which represent the whole network with very few floating-point numbers.

Product Quantization: First, the columns of the M x N matrix are each split into d sub-vectors. It results in a total of M*d sub-vectors. These vectors are grouped into clusters, sometimes while minimizing a pre-specified loss. Then, to compress the matrix, these sub-vectors are replaced with the centroid of the cluster in which it falls. This achieves a stark compression as we store only cluster centroid vectors and cluster labels for sub-vectors rather than storing all the sub-vectors in a matrix.

This work applies the same process for weight matrix at each convolutional layer which is of size (C_out x C_in x K x K) where C_out and C_in are the number of channels going out of a layer and the number of channels coming into a layer and K is the height and width of a square kernel.

Figure reproduced from the paper.

First, the weight matrix is reshaped to a 2d matrix where the size becomes (C_out x C_in*K*K). Columns of the weight matrix of length C_in*K*K are transformed into C_in sub-vectors each of K*K length. This results in C_out*C_in sub-vectors.

The labels of the sub-vectors are learned with weighted k-means with an objective of reducing the offset (reconstruction loss) between activations produced by the original weight matrix and the activations produced by a matrix created by replacing all the sub-vectors with their cluster centroids.

(They also show that reconstructing the activations of a layer is more efficient than reconstructing the actual weights of the network.)

Results and conclusion: It is shown that models could be compressed efficiently with this approach compared to the existing methods.

Figure reproduced from the paper.

Also, a ResNet-50 model is quantized to a size of 5 MB (with 20x compression factor) and Mask R-CNN to a size of ~6MB (26x compression factor) both with competitive performances compared to the original models.

Figure reproduced from the paper.

Figure reproduced from the current paper.

AutoQ: Automated Kernel-Wise Neural Network Quantization

Outstanding problem: Kernel-wise quantization (different bit-widths for each kernel) is shown to be more effective than layer-wise quantization which is in turn shown to be more effective than network-wise quantization.

Figure reproduced from the paper.

As the search for space of all possible combinations is huge, manual engineering of kernel-wise bit-widths is sub-optimal. It is shown that even a generic RL agent like DDPG can’t find a good-enough policy. This work proposes a method that learns kernel-wise bit-widths more efficiently than any other existing method.

Proposed solution: This work proposed a Hierarchical Reinforcement Learning method to first choose layer-wise bit-widths with a High-Level Controller and then choose kernel-wise bit-widths with a Low-Level Controller within that layer.

HLC chooses bit-width for activations of a particular layer and also the mean bit-widths of all the kernels (termed as ‘goal for LLC’) of that layer. LLC then chooses bit-widths for all the kernels in that particular layer.

Figure reproduced from the paper.

The reward for this HRL agent in training is devised such that the number of FLOPs and memory required is lower and test accuracy is higher.

To accelerate training, LLC is given an additional intrinsic reward, when it completes the goal set by HLC or chooses the bid-widths for kernels such that the mean is equal to the predicted mean of the HLC for that particular layer.

Results and conclusion: The proposed method outperforms previous RL methods like DDPG. It outperforms HIRO as well, even though their implementation is based on HIRO because of the intrinsic reward they designed for the LLC.

Figure reproduced from the paper.

They also show that kernel-wise quantization is the right granularity to choose bit-widths with an experiment showing that sub-kernel-wise quantization doesn’t improve latency (inference time) beyond kernel-wise quantization.

Figure reproduced from the paper.

Differentiable Product Quantization for End-to-End Embedding Compression

Outstanding problem: The embedding layer alone has 95% of all parameters in a medium-sized LSTM with the vocabulary of a language modeling dataset PTB.

Compressing the embedding layers without loss in performance of a model would make them mobile and edge-device friendly.

Proposed solution: This method adopts the Product Quantization (PQ) method where columns of a matrix are split into sub-vectors, clustered, and finally are replaced with the centroid sub-vectors of the cluster they belong to.

Figure reproduced from the paper.

The notable aspect of this method is that it makes PQ differentiable by approximating the non-differentiable operators with Stop Gradient operators.

Given a weight matrix to compress (Query Q) the proposed method learns a codebook that contains the centroid vectors for each cluster and also the cluster labels for each sub-vector. At inference, this codebook is used to revert back the discrete representation (a list of cluster labels for each column) to a continuous vector representation.

Results and conclusion: The proposed method outperforms recently proposed embedding compression methods.

Figure reproduced from the paper.

Also, it outperforms traditional embedding compression approaches like scalar quantization, product quantization (standard), and low-rank approximation.

Figure reproduced from the paper.

Towards Accurate Post-training Network Quantization via Bit-Split and Stitching

Outstanding problem: Quantizing the neural network before training or while training (Quantization-aware training; with quantized operations embedded in the network) has to device special mechanisms for back-propagating through discrete quantized entities and needs domain expertise for hyperparameter tuning and also can’t work with models that are already trained.

On the other hand, post-training quantization avoids all of that but is shown to be ineffective for bit-widths lower than 8. TF-Lite is the previous state-of-the-art method that only uses 8-bit post-training quantization.

Proposed solution: This work proposes an optimization process to calibrate the network with a few unlabeled data and find the optimal low-bit integers to replace the floating-point values in the weight tensors. If the bit-width desired is m, the task is to find the best possible m-bit integer [-2^m-1, +2^m-1] for every floating-point FP32 weight in the weight matrix of a layer (so that it doesn’t result in performance degradation).

As this calibration process has a lot of search-space to optimize over, this work proposes to break the m-bit optimization process into m bits and then perform the more tractable optimization step of these individual bits.

Figure reproduced from the paper.

After the optimization, which could be analytically performed after sensible assumptions, the bits are stitched back by adding all the bits multiplied with their associated power 2 terms. Hence, this process is called bit-split and stitching.

Results and conclusion: Compared to the previous state-of-the-art methodology of TF-Lite, the proposed method is effective in the 3-bit quantization of a network on the ImageNet dataset.

Figure reproduced from the paper.

The proposed method also outperforms many recent methods for post-training quantization.

Figure reproduced from the paper.

Also, the proposed method achieves 4-bit weight quantization and 8-bit activation quantization of RetinaNet and Mask-RCNN for object detection and segmentation with only 0.8–1.2% mAP degradation.

Figure reproduced from the paper.

Bayesian Bits: Unifying Quantization and Pruning

Outstanding problem: Learning channel-wise bit-widths to perform adaptive quantization is very hard. Reinforcement learning and hierarchical reinforcement learning approaches have tried to learn effective strategies but they are not effective enough. This work learns channel-wise bit-widths effectively.

Also, if one needs to perform both quantization and pruning, methods that unify both are more effective than applying separate methods one-by-one. This work unifies both quantization and pruning.

While a lot of mixed-precision quantization methods find arbitrary bit-widths for each weight tensor, thus making the network incompatible with standard GPU/TPU accelerators, the proposed method comes up with strategies with only powers-of-two bit-widths, making it hardware friendly.

Proposed solution: First, this work proposes a formulation of M-bit quantization value as the addition of (M/2)-bit quantization value and M-bit residual error between (M/2)-bit quantization value and M-bit quantization value. Similarly, the (M/2)-bit quantization value can also be decomposed in terms of (M/4)-bit quantization value and the residual error between (M/2)-bit value and (M/4)-bit value.

Starting with 2-bit quantization, this decomposition allows for having binary gates that control whether to go for a higher bit-width or not, thus creating a mixed-precision strategy. This method learns this mixed-precision strategy, i.e., binary gates for each weight tensor in a variational inference framework with a few sensible approximations.

Figure reproduced from the paper.

By starting with 0-bit quantization (practically pruning) this method unifies quantization and pruning in an effective way.

Results and conclusion: The proposed method outperforms many existing fixed-precision and mixed-precision methods on MNIST and CIFAR-10 datasets.

Figure reproduced from the paper.

Also, it outperforms the existing state-of-the-art methods on the ImageNet dataset with ResNet-18 and MobileNet V2 architecture.

Figure reproduced from the paper.

Up or Down? Adaptive Rounding for Post-Training Quantization

Outstanding problem: It has been previously shown that when quantizing a value from higher precision to lower precision, the rounding mechanism used has a lot of impact on performance.

Specifically, the performances of quantization with the nearest-value rounding mechanism, always ceiling or always flooring rounding mechanism, and stochastic rounding mechanism show a lot of variances. See the figure below.

(Stochastic rounding is a mechanism of rounding a value to a lower-precision value with the probability of (1- px) where px is the proximity of that particular value and the lower-precision value.)

Figure reproduced from the paper.

This work proposes a better rounding mechanism post-training with just a small calibration on a few unlabelled examples.

Proposed solution: This work investigated whether to round up or round down for each weight by formulating it as an optimization problem of reducing the reconstruction loss of weights of a network.

This work set out to make the inherently discrete optimization problem into a continuous optimization problem with better formulations and sensible approximations based on Taylor series expansion.

Results and conclusion: The proposed (seemingly simple) method shows considerable effectiveness at post-training quantization and outperforms existing post-training quantization methods.

Figure reproduced from the paper.

The authors also show that the effectiveness of their method increases with the number of unlabeled examples used for calibration across different datasets.

Figure reproduced from the paper.

Conclusion

Research on Neural Network Quantization, more generally NN compression, is evolving to be more scientific and rigorous. One of the reasons is, undoubtedly, the interaction between the wide adoption of deep learning methods in Computer Vision & NLP and elsewhere and the increasing amount of memory, energy, and computational resources required for the state-of-the-art methods.

Because of the research in NN Quantization in 2020, we have different methods for layer-wise, channel-wise quantization methods rather than network-wise quantization methods, we have better post-training quantization techniques which are better than methods offered by TF-Lite, and we have approaches that combined pruning and quantization under one framework and give the advantages of both the techniques.

Going into the future we will see these advanced and efficient methods made more accessible to developers and researchers by including them into tools like TF-Lite and other NN compression libraries. And it is exciting to see the research on this topic in 2021 which has to beat already great efficient approaches introduced this year.

Neural Network Quantization Research Review was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

Neural Network Pruning Research Review 2020

Prakash Kagitha — Thu, 24 Dec 2020 14:13:34 GMT

The latest advances and insights towards Neural Network model compression with Pruning

Neural Network (NN) Pruning is a task of reducing the size of a Neural Network by removing some of its parameters/weights.

Pruning is often performed with the objective of reducing the memory, computational, and energy bandwidths required for training and deploying NN models which are notorious for their large model size, computational expense, and energy consumption.

Particularly when deploying NN models on mobiles or edge devices, Pruning, and model compression in general, is desirable and often the only plausible way to deploy as the memory, energy, and computational bandwidths are very limited.

But, one can ask, why not use potentially infinite virtual memory and computational power from the cloud? While a lot of NN models are running on the cloud even now, latency is not low enough for mobile/edge devices which hinders utility and requires data to be transferred to the cloud, which raises several privacy concerns.

The year 2020 has been a great year for NN Pruning research and this article discusses six new approaches/insights published at premier peer-reviewed conferences ICLR, ICML, and NeurIPS in 2020.

Preliminaries of NN Pruning research

The workflow of a typical Pruning method: A set of parameters of a trained model are zeroed out according to a pre-defined score (usually absolute magnitude) and the network remained is trained again (retraining), thus approximating the accuracy of the original model with this sparse, pruned model requiring drastically lower, when compared to the original model, memory, computational and energy bandwidths because of the fewer parameters.

To understand the advances in Pruning research it helps to place new techniques as different types across different dimensions (ways to classify) and their unique strengths and weaknesses. The following are a few standard dimensions Pruning methods could be classified into:

Structured vs. Unstructured Pruning

If NN weights are pruned individually in the pruning process, then it is called Unstructured Pruning.

Zeroing out the parameters randomly gives memory efficiency (model stored in sparse matrices) but may not always give a better computational performance, as we end up doing the same number of matrix multiplications as before. Because matrix dimensions didn’t change, they are just sparse. Although we could get computational advantages by replacing dense matrix multiplications with sparse matrix multiplications, it is not trivial to accelerate sparse operations on traditional GPUs/TPUs.

Alternately, in Structured Pruning, the parameters are removed in a structured way to reduce the overall computation required. For example, some of the channels of CNN or neurons in a feedforward layer are removed, which directly reduces the computation.

Structured Pruning (pruning neurons) and Unstructured Pruning (pruning connections) illustrated. Image reproduced from the source.

2. Scoring of parameters

Pruning methods may differ in the method used to score each parameter which is used to choose one parameter over another. The absolute magnitude scoring method is the standard but one could come up with a new scoring method that would increase the efficiency of Pruning.

3. One-shot Vs Iterative

While performing Pruning, Rather than pruning the desired amount at once, which is called One-shot Pruning, some approaches repeat the process of pruning the network to some extent and retraining it until the desired pruning rate is reached, which is called Iterative Pruning.

4. Scheduling of pruning and fine-tuning

In the paradigm of Iterative Pruning, previous work showed that learning or designing when to prune and how much helps. A Schedule of pruning specifies the ratio of pruning after each epoch along with the number of epochs to retrain the model whenever the model is pruned.

Neuron-level Structured Pruning using Polarization Regularizer (NeurIPS 2020)

Outstanding problem: In one of the variants of Structured Pruning, to remove entire filters in CNN or all weights associated with a particular neuron, a learnable scaling factor is used for each filter/neuron which is multiplied by the weights of that filter/neuron in the forward pass.

And to prune the model, all the filters/neurons with scaling factor less than a threshold are removed, thus accomplishing the Structured Pruning.

While training, an L1 regularizer is added to the main training loss to push all the scaling factors to zero with the aim of having scaling factors above the threshold only for filters/neurons that are absolutely necessary.

This work shows that the L1 regularizer is not efficient in contrasting between useful and not useful filters/neurons and tries to address this inefficiency.

Proposed solution: L1 regularizer tries to push all scaling factor values to zero which is not consistent with the intuition that only some of the filters are not useful while others are useful. It is hard to pick a threshold value that would clearly separate useful filters among other filters as seen in the left plot of the figure below.

This work proposes Polarization Regularizer, which pushes the scaling factors towards the extremes, either 0 or 1, which aligns better with the objective of removing some of the filters and while retaining others.

It is clear from the above plot that L1 Regularization yields scaling factors which makes it very hard to come up with an optimal threshold. While Polarization yields scaling factors which makes it easy to find an optimal threshold. Image reproduced from the source.

Specifically, this work adds a term to L1 regularizer which is maximum when all scaling factors are near to each other, pushing scaling factors towards two extremes.

The second term in equation two pushes the scaling factors to either of the extremes because its value is higher when all the scaling factors are the same. Image reproduced from the source.

Results and conclusions: The proposed method is better than a lot of methods by reducing the FLOPs required by ResNet-50 by 54% with just a 0.52 drop in accuracy. The authors even show that they can prune already efficient MobileNet V2 by 28% with just a 0.2 reduction in accuracy.

FLOPs required for ResNet-50 are reduced by 54% and by 28% for MobileNet V2. Image reproduced from the source.

It also allows for aggressive pruning, where structured pruning methods are not effective, without compromising on the accuracy compared to existing models.

Results of aggressive pruning of ResNet variants and MobileNet V2. Image reproduced from the source.

Comparing Rewinding and Fine-tuning in Neural Network Pruning (ICLR 2020)

Outstanding problem: After training and pruning, there is no evidence from the literature as to which form of retraining is most efficient: Rewinding or Fine-tuning.

Proposed solution: This work performs comprehensive empirical studies evaluating the efficiency of Rewinding and Fine-tuning and finds that Rewinding is more efficient.

After training and pruning, if the remaining network is retrained with the trained network, then it is called Fine-tuning. Note that while fine-tuning we will continue using the same learning rate schedule as though it is further training.

If the remaining network is retrained with the weights at initialization (by rewinding) rather than trained weights, it is called Rewinding. Note that while Rewinding, as the weights are actual initialization weights of the network before training, the learning rate schedule starts afresh like it is the start of the training, only now with the small pruned network.

Also, the authors propose a new retraining technique called Learning Rate Rewinding to replace standard weight rewinding and is shown to out-perform both Rewinding and Fine-tuning. Learning Rate Rewinding is a hybrid of Fine-tuning and rewinding, where trained weights are used to retrain as in Fine-tuning but the learning rate schedule is rewinded as in Rewinding.

The step-by-step process of how to perform Learning Rate Rewinding. Image reproduced from the source.

Results and conclusion: It is shown that learning rate schedule and Rewinding is more efficient than Fine-tuning on CIFAR-10, ImageNet, and WMT16 datasets with standard architecture. This behavior is shown to be consistent whether it is structured or unstructured pruning, and whether it is one-shot or iterative pruning.

Learning Rate Rewinding proposed in this work outperforms both weight Rewinding and Fine-tuning while Weight Rewinding outperforms Fine-tuning. Image reproduced from the source.

Pruning Filter in Filter (NeurIPS 2020)

Outstanding problem: Unstructured Pruning achieves more sparsity with minimal accuracy drop but achieving computational efficiency with this method is not straightforward, as sparse weights don’t necessarily reduce the number of matrix multiplications and sparse matrix multiplications are not usually accelerated by the standard GPUs/TPUs.

On the other hand, Structured Pruning, which is removing entire filters or channels, reduces computation but doesn't afford enough sparsity compared to Unstructured Pruning methods.

Balancing the tradeoff between structure and sparsity is not fully exploited. This work shows the importance of this with an observation that only some parts of a particular filter are useful. The authors prove that taking away the entire filter is not efficient.

Optimal pruning of the weights in each filter in VGG19. It shows that removing or retaining the entire filter is not optimal. The current work explores this optimality while retaining the computational efficiency of structured pruning methods. Image reproduced from the source.

Proposed solution: This work proposes a method called Filter Skeleton which is a strip-wise pruning method that is more granular than Structured Pruning, but could be reduced into fewer matrix multiplications, thus giving more sparsity while retaining the computational advantages of Structured Pruning methods.

Visualization of different pruning methods. The method proposed in this work is Stripe-wise pruning which is intuitive and has computational advantages. Image reproduced from the source.

Results and conclusion: Stripe-wise pruning outperforms an existing method of Group-wise pruning which is also a Structured Pruning method with more granularity. See fig. below.

Stripe-wise pruning outperforms Group-wise pruning. Image reproduced from the source.

The method proposed reduces the required FLOPs of ResNet-18 on ImageNet dataset by 50.4% while increasing the accuracy by 0.23%.

The FLOPs required for ResNet-18 on the ImageNet dataset is reduced by 50.4% while increasing the accuracy by 0.23%. Image reproduced from the source.

Pruning neural networks without any data by iteratively conserving synaptic flow (NeurIPS 2020)

Outstanding problem: Methods for Pruning at initialization without any training what-so-ever are not efficient.

This work identifies that the problem with current methods for Pruning at initialization is Layer-dropping, where a method prunes a particular layer completely, making the networks untrainable. This will occur when the sparsity we want to achieve is higher than the layer-drop threshold. In the figure below, layer-drop happens in all the methods except in the proposed method, Syn-Flow.

For all the previous methods, accuracy drops at a particular compression ratio because the method removes a layer entirely (Layer-drop). But the proposed method SynFLow achieves Max compression avoiding the Layer drop. Image reproduced from the source.

Proposed solution: This work theoretically understands that Iterative Magnitude Pruning avoids Layer-drop with iterative training and pruning by achieving a property which they call Maximal Critical Compression. With these insights, they propose a method that achieves Maximal Critical Compression without any data.

Results and conclusion: Syn-Flow consistently outperforms previous methods for Pruning at initialization in 12 combinations of datasets and models. See the figure below.

SynFlow outperforms previous methods in 12 combinations of datasets and methods. Image reproduced from the source.

The Lottery Ticket Hypothesis for Pre-trained BERT Networks (NeurIPS 2020)

Outstanding problem: Lottery ticket hypothesis says that there is a sub-network within the original, complete network that could be trained to achieve the same performance as the entire network. These sub-networks are called Lottery tickets. These lottery tickets are represented by binary masks that tell which of the weights should be zeroed out.

Proposed solution: The authors used Iterative Magnitude Pruning (IMP) and tried to find lottery tickets from the pre-trained BERT model.

IMP trains a network, removes weights with lower magnitude, and then repeats this process until the required sparsity is achieved. It uses Rewinding where the weights of the pruned network are reverted back to their initialization weights, here pre-trained BERT weights.

The lottery ticket here is the subnetwork of the original pre-trained BERT model which shows similar performance. Note that the weights of the lottery ticket and the pre-trained BERT model are the same where the lottery ticket is not pruned because of the Rewinding in IMP.

Results and conclusion: This work shows that the lottery ticket hypothesis holds for pre-trained BERT models as well. It also found subnetworks at 40% to 90% sparsity for a range of downstream tasks.

The last row corresponds to the approach introduced in this paper. Even though it is 40%-90% sparse, performance is comparable to a Full BERT base. Image reproduced from the source.

In a work similar to this one, it is shown that lottery tickets exist for NLP and RL as well. See the paper Playing the lottery with rewards and multiple languages: lottery tickets in RL and NLP.

HYDRA: Pruning Adversarially Robust Neural Networks (NeurIPS 2020)

Outstanding problem: In resource-constrained and safety-critical environments, both robustness and compactness are required at the same time. It is shown that to increase the robustness of a Neural Network, we have to increase the size of the network. So, robustness with fewer parameters is very desirable.

This work conjectures and shows evidence that even low-magnitude weights are important for robustness. As most of the dedicated pruning techniques remove low-magnitude weights, this could hinder robustness accuracy. This shows that dealing with robustness and pruning at the same time leads to more efficient methods, of which the current method is a great example.

Proposed solution: This work proposed a pruning method, HYDRA, which decides which weights to remove while decreasing the desired loss, here the loss derives from the robustness training objective.

Finding an optimal pruning mask is formulated as importance-score-based optimization where score values for different weights could have a floating-point importance score between 0 and 1. After the optimization, weights with lower importance scores are removed.

Results and conclusion: The proposed pruning technique prunes the network while showing less degradation in robustness accuracy compared to the dedicated pruning technique LVM.

For the method proposed robust accuracy doesn’t drastically decrease with pruning which is the case with the previous method LWM. Image reproduced from the source.

The authors also observe that the proposed method retains low-magnitude weights while retaining better robustness accuracy which shows the importance of low-magnitude weights towards robustness.

The proposed method retains low-magnitude weights (shown as black) and shows good robust accuracy while the existing pruning methods prune low-magnitude weights whether they are useful for robustness or not. Image reproduced from the source.

This method shows state-of-the-art performance with ResNet50 on ImageNet with adversarial training.

HYDRA, the proposed method, outperforms the previous best method Adv-LWM. Shown as {accuracy/robust accuracy}. Image reproduced from the source.

Conclusion and Future work

Neural Network Pruning research is evolving to be more scientific and rigorous. One of the reasons is, undoubtedly, the interaction between the wide adoption of deep learning methods in Computer Vision & NLP and elsewhere and the increasing amount of memory, energy, and computational resources required for the state-of-the-art methods.

Going into 2021, we know how to prune a network efficiently without any data, how to strike a balance between Structured and Unstructured Pruning methods while retaining the benefits of both worlds, how to find lottery tickets (pruned networks) in diverse domains and how to best retrain them, and how to prune a model even when adversarial robustness is needed.

And as with any field which is scientific and rigorous, Pruning research is poised to become more objective/authoritative (as opposed to having very little comparative analysis), with validation of new techniques over diverse datasets, architectures, and domains not just on, as seen in many cases, ImageNet dataset with a ResNet variant. This will inform us plenty about how to work with deep learning methods across different tasks even in low-resource environments.

Neural Network Pruning Research Review 2020 was originally published in Heartbeat on Medium, where people are continuing the conversation by highlighting and responding to this story.

NeurIPS 2020 Papers: Takeaways of a Deep Learning Engineer (Part 2 of 3)— Computer Vision

Prakash Kagitha — Sat, 05 Dec 2020 15:59:41 GMT

NeurIPS 2020 Papers: Takeaways of a Deep Learning Engineer— Computer Vision

Techniques and insights for applied deep learning (computer vision) from papers published at NeurIPS 2020

Image by Author

As mentioned in part 1— the most important thing:) — I went through all the titles of NeurIPS 2020 papers (more than 1900!) and read abstracts of 175 papers, and extracted DL engineer relevant insights from the following papers.

This is part 2. See the part 1 below.

NeurIPS 2020 Papers: A Deep Learning Engineer’s Takeaway

Rethinking Pre-training and Self-training

Using other datasets to better solve the target dataset is ubiquitous in deep learning practice. It could be supervised pre-training (Classification; ImageNet pre-trained) or self-supervised pre-training (SimCLR on unlabeled data) or self-training.

(Self-training is a process where an intermediate model (teacher model), which is trained on target dataset, is used to create ‘labels’ (thus called pseudo labels) for another dataset and then the final model (student model) is trained with both target dataset and the pseudo labeled dataset.)

Building on the previous work, the current work shows that the usefulness of ImageNet pre-training (starting with pre-trained weights rather than random) or self-supervised pre-training decreases with the size of the target dataset and the strength of the data augmentation. ImageNet pretraining didn’t help, rather hurt in some cases, the model when training on COCO dataset for object detection.

But, self-training helped in both low-data and high-data regime and with both strong and weak data augmentation strategies. It helped when pre-training didn’t help and showed improvement on it when it did.

ImageNet pre-training vs Self-training when the strength of the data augmentation is changed. Image from the pdf of the current paper.

Descriptions for Labels in the images above and below. Image from the pdf of the current paper.

Random Init vs ImageNet pre-training vs Self-training. Self-training helps when pre-training doesn’t help and improves on it when it does help. Image from the pdf of the current paper.

Takeaway: When you want to leverage other datasets in training a model on a target dataset, use self-training rather than ImageNet pretraining. But keep in mind that self-training takes more resources than just initializing your model with ImageNet pre-trained weights.

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

Different object detection models employ different intermediate representations from which the bounding box predictions are made.

For example, RetinaNet uses a bounding box (anchors) representational format, where it creates feature maps for each bounding box instance created by anchor boxes at each position of the feature grid. If a feature grid is of H x W, takes RetinaNet takes 9 anchor boxes (pre-specified aspect ratios) for each position of the feature grid giving us 9 x H x W bounding box instances to do IOU thresholding, predicting the classes and sub-pixel offsets, and do NMS on top among other things to get the final set of bounding boxes for an image.

Different intermediate representations for different models. Image from the pdf of the current paper.

FCOS and CenterNet use a center point as representation formats and estimates bounding boxes by predicting x and y dimensional offsets from the center point. And it has all the other processing steps very similar in objective with RetinaNet or any other object detection models.

CornerNet instead uses corner points as representation format (top left and bottom right) and creates a bounding box with those corner points.

Different representations are prevalent in object detection because each representation is good at some specific thing compared to all others. Bounding box representation is better aligned with annotation formats of datasets and is better at classification. Center point representation is better for detecting small objects. Corner point representation is better at localization.

This current work aims to combine the strengths of all these different representations. For a particular object detection model, they improve the features of its primary representation, bounding box for RetinaNet, by also taking into account features from other auxiliary representations, here, they are center points and corner points.

Illustration of working of BVR combining bounding box representations, center representations, and corner points representations. Image from the pdf of the current paper.

The author proposed a Transformer model. When given a feature vector of primary representation for a location on a feature grid (query) it calculates attention weights with feature vectors of auxiliary representations at relevant locations and returns a weighted average of these auxiliary representations.

The model, called Bridging Visual Representations (BVR), will use both the feature vector for primary representation and the weighted average of feature vectors from auxiliary representations to do classification and localization thus combining the strengths and expressive power of different representational choices.

RelationNet++ outperforms every other method. Image from the pdf of the current paper.

Takeaway: This is the state-of-the-art model and it makes sense. Any approach which combines the strengths of multiple solutions non-trivially would be valuable for a long time. Use this method when you train your next object detection model. (Too many good things for object detection!)

Quantifying Learnability and Describability of Visual Concepts Emerging in Representation Learning

Without a downstream task, it is hard to quantitatively evaluate image representations, i.e. the clusters formed with image representations for their semantic coherence and natural language describability.

This work formulates these tasks, learnability and describability of the clusters, as a forced-prediction problem and evaluates humans as predictors avoiding the issue of subjectivity which is a major problem with existing approaches. (Even though clusters are coherent, sometimes they can’t be described and even though they are describable different person might use different words and phrases).

After seeing a few samples of a cluster, a human should able to discriminate images of that cluster among images of other clusters. This means that clusters are separated in a human-interpretable way. The extent to which a human can do this is the metric for learnability.

Image from the pdf of the current paper.

After seeing the description of a cluster, a human should able to discriminate images of that cluster among images of other clusters. This means the given cluster is describable. (Description is sampled randomly from a manually populated set of descriptions for that cluster). The extent to which a human can do this is the metric for describability.

Clusters and their descriptions from self-supervised model SeLa. Image from the pdf of the current paper.

Authors also created a model to get automated descriptions for a cluster so that it could replace the human in the above describability metric.

Takeaway: If you have clusters of images with no labels, the extent to which you could discriminate other images as the same class or not, after seeing the images of a particular cluster, is a good metric to see whether your clusters are separated. The same goes for describability.

A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection

There are a lot of outstanding problems to deal with in object detection. Prominent among them dealt with this work are:

Class imbalance problem between foreground/background (positive/negative) bounding boxes. Focal loss in RetinaNet helps but not enough.
Difficulty in tuning hyperparameters in the loss function (faster-RCNN has 9 of them to tune).
Discrepancy created by having separate localization and classification heads, for eg. classification loss is not dependent on the IOU or localization of that object.

And this is how they deal with it:

The ranking based loss function for classification is more stable and learns without overfitting when compared to Cross-Entropy, or weighted cross entropy variant like a focal loss.

Comparison of the number of hyperparameters in each loss function. Image from the pdf of the current paper.

The proposed loss function has one hyperparameter and even that parameter doesn’t need tuning. (Results in the paper are without any tuning, it still outperformed baselines).

aLRP loss (the current paper) is stable and not overfitting compared to Cross-Entropy Loss and Focal Loss. Image from the pdf of the current paper.

As the proposed ranking-based loss function is not-differentiable, the authors provided equations for gradients of the loss function with respect to the parameters in the localization head and in the classification head. Here, the gradient update on parameters of the classification head is affected by the result of both classification head and localization head as well (and vice versa) so that an instance with less IOU with ground truth is penalized even though the ground truth class label is predicted confidently by the classification head. This makes the classification head work well where the localization head works well and vice versa, which gives more capacity to the model to get better at both precision and localization.

Gradients wrt parameter in classification head contain outputs of localization and vice versa. Image from the pdf of the current paper.

aLRP Loss beats other losses for Faster R-CNN. Image from the pdf of the current paper.

Takeaway: Stability when training and having fewer hyper-parameters to tune is much desired in practive. I can remember a lot scenarios where results are not reproducable. This type of work would be more valuable for a deep learning engineer and I recommend one using it when training your next object detection model.

Disentangling Human Error from the Ground Truth in Segmentation of Medical Images

Labeling in the medical image domain is cost-intensive and have a large inter-observer variability. A method that combines annotations from different annotators while modeling an annotator across images so that we can train with only a few annotations per image is desirable. This is that method.

Given an image with 3 ground truth masks labeled by three different annotators A1, A2, and A3, this work, which also models biases of each annotator, tries to predict three different versions of segmentation masks one for each annotator and tries to backpropagate the loss between these 3 predicted masks and 3 ground truth masks.

Illustration showing the prediction of Estimated true label distribution segmentation mask and then from that three annotator-specific masks and finally backpropagation through the three branches. Image from the pdf of the current paper.

As these annotator-specific segmentation masks are created with distortion (confusion matrix for each annotator) from the estimated true label which is predicted first, we would take the segmentation mask of the estimated true label as the prediction from the model during inference.

Takeaway: If your application has more inter-observer variability and you have the bandwidth to get multiple annotations per image, this seems to be the go-to right now to get one ground truth out of many.

Variational Amodal Object Completion

Predicting segmentation maps for a complete object when it is occluded is called Amodal Object Completion.

This work presents Amodel-VAE, which encodes the partial mask into a latent vector and predicts a complete mask decoding that latent vector. This work doesn’t require full-object segmentation annotations for training making it desirable as previous works needed complete segmentation masks annotated.

Image from the pdf of the current paper.

To train without complete masks, they carefully train Amodel-VAE in three stages.

At stage I, a decoder P(y_complete/z) is pre-trained with only masks that are complete thus learning a mapping from latent vector space to the space of complete masks.
At stage II, occluded partial masks are synthetically generated from a complete mask by randomly overlaying other objects (foreground) on it so that we will have a mapping between partial masks and complete masks. A VAE is trained with a pre-trained and frozen decoder to learn an encoder P(z/y_partial).
Finally, at stage III, encoder P(z/y_partial) is fine-tuned so that it could encode more complex occlusions which occur in the real-world dataset while the loss is propagated from the visible part of the object expecting the decoder to predict P(y_vis/z). (Decoder is not trained in this step. And the encoder is trading off some capability to produce latent vectors that predict full masks to its capability of encoding real-world/complex occlusions)

Different tasks made possible by Amodal-VAE. Image from the pdf of the current paper.

Takeaway: Practically, knowing the complete locations of objects in occlusion would help to track multiple people and decrease Id-swaps that we see even in SOTA tracking models. It should be interesting if you want to smart photoshop as well. More importantly, this is kind of a problem where use cases are limited only by our creativity.

RandAugment: Practical Automated Data Augmentation with a Reduced Search Space

Automated data augmentation needs to find the probability of each transformation and the magnitude to be used for each of these transformations.

With large possible values for probabilities and magnitudes for each of the transformations, search space becomes intractable. Recent method AutoAugment used RL to find an optimal sequence of transformations and their magnitudes. More recent variants of AutoAugment tried to make use of more efficient learning algorithms to find the optimal sequence of transformations efficiently.

Image from the pdf of the current paper.

Nonetheless, the number of iterations of training a model with a set of transformations to find the optimal probability and magnitude values for transformations is still intractable in practice if we are doing it on large-scale models and large-scale datasets. So, proxy tasks are set up, with small models and less data among other tweaks, representative of the target task. Optimal probabilities and magnitudes are found on proxy tasks and are used for the target task.

But that these proxy tasks are not actually representative of the complete target tasks. This work showed that the “optimal magnitude of augmentation depends on the size of the model and the training set.”

Now, to make this optimal policy search feasible, this current work proposed RandAugment which is just a grid search on two parameters with ~30 orders of magnitude smaller search space. This is, for sure, one of the few simple-but-powerful and back-to-basics kinds of work you could find.

First, RandAugment picks transformations with uniform probability. Because they observed that optimal policies from AutoAugment are making the dataset visually diverse rather than selecting a preferred set of particular transformations (different probabilities for different transformations).

Second, RandAugment has the same magnitude for all the transformations. Because they observed that optimal policies from an AutoAugment variant had similar magnitudes for all the transformations.

After these adjustments, automated data augmentation became a simple hyperparameter tuning task which could be done with a grid search and the whole algorithm might be written comfortably in 3 lines.

3-line code for RandAugment. Image from the pdf of the current paper.

Takeaway: Automated data augmentation evolved to a point that it is feasible to use in our ‘everyday’ models. If you have resources to do hyperparameter tuning, tune these two parameters (N and M for number of transformations and their global magnitude) as well and get state-of-the-art results.

Learning Loss for Test-Time Augmentation

https://neurips.cc/virtual/2020/public/poster_2ba596643cbbbc20318224181fa46b28.html

Let’s assume you want to test your model on a rotated image and images in your training set are never rotated or rotation data augmentation is not used while training. The best possible thing we could do is to do the rotation now at test time to make the images not rotated. And with 10 commonly used and naturally occurring transformations this could happen without you knowing.

So, what is the solution? While training, have a separate network that predicts the loss of a model for each of the transformations if applied to the image.

Using this model, apply only the transformations which give lower loss values at test time.

Proposed test-time augmentation (b) using a loss prediction model while inference. Image from the pdf of the current paper.

Takeaway: Didn’t train your model with necessary data augmentations? Want the best possible results on the test set? Use the above test-time Augmentation.

NeurIPS 2020 Papers: Takeaways of a Deep Learning Engineer (Part 2 of 3)— Computer Vision was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

NeurIPS 2020 Papers: A Deep Learning Engineer’s Takeaway

Prakash Kagitha — Fri, 27 Nov 2020 19:50:21 GMT

NeurIPS 2020 Papers: Takeaways for a Deep Learning Engineer

Techniques and insights for applied deep learning from papers published at NeurIPS 2020

Image by Author

Advances in Deep Learning research are of great utility for a Deep Learning engineer working on real-world problems as most of the Deep Learning research is empirical with validation of new techniques and theories done on datasets that closely resemble real-world datasets/tasks (ImageNet pre-trained weights are still useful!).

But, churning a vast amount of research to acquire techniques, insights, and perspectives that are relevant to a DL engineer is time-consuming, stressful, and not the least overwhelming.

For what so ever reason, I am crazy (I mean, really crazy! See Exhibit A here and here) about Deep Learning research and also have to justify a Deep Learning engineer’s role to earn my living. So, this is a great place to be in to cater to these needs of DL engineer relevant research churning.

Therefore, I went through all the titles of NeurIPS 2020 papers (more than 1900!) and read abstracts of 175 papers, and extracted DL engineer relevant insights from the following papers.

Now, sit back and enjoy.

This is part 1. See the other parts below.

NeurIPS 2020 Papers: Takeaways of a Deep Learning Engineer (Part 2 of 3)— Computer Vision

Accelerating Training of Transformer-Based Language Models with Progressive Layer Dropping

2.5x faster pre-training with Switchable Transformers(ST) compared to standard Transformers.

Equipped by Switchable Gates (G in the fig. below), some of the layers are skipped randomly according to sampled 0 or 1 from a Bernoulli distribution which is 25% time-efficient per sample.

(a) Standard Transformer (b) Reordering to make it PreLN (c) Switchable Gates (G) to decide whether to include a layer or not. (The image is reproduced from the pdf of the current paper.)

And remarkably, it is shown that it reaches the same validation error as the baselines with 53% fewer training samples.

Combining both time and sample efficiency, pre-training is 2.5x faster with comparable and sometimes better performance on downstream tasks.

Takeaway: When you want to pretrain or finetune a transformer, try out Switchable Transformers for faster training along with low inference times.

Coresets for Robust Training of Neural Networks against Noisy Labels

It is shown before that the Jacobian of neural network weights(W) and clean data (X), after some training, would approximate to a low-rank matrix, with a few large singular values and a lot of very small singular values.

Also, learning which generalizes (i.e from clean data) is in a low-dimensional space called Information space (I) and learning that doesn't generalize (i.e. from noisy labels, mostly memorization) is in a high-dimensional space called Nuisance space (N).

The current work introduces a technique that creates sets of mostly clean data (Coresets) to train a model with and show a significant increase in performance on noisy datasets i.e. 7% increase on mini Webvision with 50% noisy labels compared to the state-of-the-art.

The method introduced in this work, CRUST, performs significantly better than the state-of-the-art. (The image is reproduced from the pdf of the current paper.)

Takeaway: When you suspect the dataset you collected has noisy/mislabeled data points, use CRUST to train the model only on the clean data and improve performance and robustness.

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

There exists a sub-network that exhibits performance comparable to the original complete network while the training process is the same. These sub-networks are called lottery tickets and are defined by masks that tell which weight is zeroed out in the original network.

Current work adopted Iterative Magnitude pruning (IMP) which trains a subnetwork for some time and prunes k% weights which are of less magnitude. This process is repeated multiple times until the sparsity reaches the target sparsity. Important thing is that after every iteration of training the model starts again with the initial parameters rather than weights updated till then, which is called rewinding.

Here, the pre-trained weights of the BERT are the initialization we start IMP with. And the lottery ticket which is a subnetwork of the pre-trained BERT also contains the same pre-trained weights with some of them zeroed out.

This work showed that the lottery ticket hypothesis holds for pre-trained BERT models as well. And found subnetworks at 40% to 90% sparsity for a range of downstream tasks.

The last row corresponds to the approach introduced in this paper. Even though it is 40%-90% sparse, performance is comparable to a Full BERT base. (The image is reproduced from the pdf of the current paper.)

Also, the authors found a pre-trained BERT ticket with 70% sparsity which can transfer to many downstream tasks and perform at least as good as or better than a 70% sparse ticket found for that particular downstream task.

Last but one row, (IMP) MLM (70%), shows that there is a general 70% sparse BERT which would generalize to all the downstream task being at least as good as 70% sparse tickets of that particular task. (The image is reproduced from the pdf of the current paper.)

Takeaway: A Deep Learning engineer working on NLP has to finetune pre-trained BERT on a downstream task very often. Instead of from a full-size BERT, start fine-tuning with the 70% sparse lottery ticket found on MLM downstream task (last but one row) to train faster and decrease inference times and memory bandwidth without losing out on performance. It’s a no-brainer!

MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet is a hybrid of Masked Language Modeling(MLM) and auto-regressive Permuted Language Modeling(PLM) adopting the strengths and avoiding their limitations from each of its constituents.

Masked language modeling, as in BERT-style models, mask out ~15% of the data and try to predict those masked tokens. As the dependency between the masked tokens is not modeled it leads to pretrain-finetune discrepancy which is termed as Output Dependency.

On the other side, auto-regressive permuted language modeling, as in XLNet, doesn’t have entire information about the input sentence i.e when predicting say 5th element in the 8-element sequence the model doesn’t know that there are 8 elements in the sequence, thus lead to pretrain-finetune discrepancy (as the model see entire input sentence/paragraph in the downstream tasks) which is termed as Input Consistency.

MPNet combines both of them. XLNet-like architecture is modified by adding additional masks up to the end of the sentence so that prediction at any position would attend to N number of tokens where N is the length of the sequence, with some of them being masks.

Illustrative example showing how MPNet combines MLM and PLM. (The image is reproduced from the pdf of the current paper.)

They use two-stream self-attention which is introduced in XLNet to enable auto-regressive type prediction, at one go, where at any position content should be masked for prediction at that step but should be visible for the predictions at later steps.

“MPNet outperforms MLM and PLM by a large margin and achieves better results on tasks including GLUE, SQUAD compared with previous state-of-the-art pre-trained methods (e.g., BERT, XLNet, RoBERTa).”

Takeaway: If you ever wanted to pretrain a language model on your domain-specific data or with extra data than the state-of-the-art, use MPNet which is shown to have the best of both MLP and PLM worlds.

Identifying Mislabeled Data using the Area Under the Margin Ranking

Mislabeled data is common in large-scale datasets as they are crowdsourced or scraped from the internet which is noise prone.

This work formulates a simple intuitive idea. Let's say there are 100 dog images but 20 of them are labeled as ‘bird’. And similarly, 100 bird images but labeled 20 of them are labeled as ‘dog’.

After some training, for an image of a dog wrongly labeled as ‘bird’, the model gives a considerable probability for label ‘dog’ because of generalization from 80 correctly labeled images. The model also gives a considerable probability for the label ‘bird’ as well because of memorizing those 20 wrongly labeled images.

Now, the difference between the probability of ‘dog’ and the probability of ‘bird’ is called Area Under the Margin (AUM). This work recommends that if AUM is below some pre-defined threshold we should treat it as a wrongly labeled data sample and remove it from training.

If we can’t able to settle on one threshold value, we can populate wrongly labeled data intentionally and see what is AUM for those examples. This would be our threshold.

“On the WebVision50 classification task, this method removes 17% of training data, yielding a 1.6% (absolute) drop in test error. On CIFAR100 removing 13% of the data leads to a 1.2% drop in error.”

Takeaway: When creating a dataset, noisy/mislabeled data samples are mostly unavoidable. Then, use the AUM method to find the mislabeled data samples and remove them from the final training dataset.

Rethinking the Value of Labels for Improving Class-Imbalanced Learning

Do we need labels when existing labels are class imbalanced (some classes have more labeled examples than others) and we have a lot of unlabeled data?

Positive. Yes, we need labels. Self-train on the unlabeled data and you would be golden. (Self-training is a process where an intermediate model, which is trained on human-labeled data, is used to create ‘labels’ (thus, pseudo labels) and then the final model is trained on both human-labeled and intermediate model labeled data).

Negative. We may do away with the labels. One can use self-supervised pretraining on all the data available to learn meaningful representations and then learn the actual classification task. It is shown that this approach improves performance.

Takeaway: If you have class-imbalanced labels and more unlabeled data, do self-training or self-supervised pretraining. (It is shown that self-training beats self-supervised learning on CIFAR-10-LT though).

Big Bird: Transformers for Longer Sequences

https://neurips.cc/virtual/2020/public/poster_c8512d142a2d849725f31a9a7a361ab9.html

Self-attention in standard Transformers is of quadratic complexity (in memory and computation) wrt sequence length. So, training longer sequences is not feasible.

Enter Big Bird. It uses sparse attention where a particular position only attends to a few randomly selected tokens and some neighboring tokens.

That’s not what makes it work though. Big Bird has multiple CLS tokens that attend to the entire sequence. And a token in any position attends to these CLS tokens which give them relevant context, dependencies, and who knows what else self-attention layers learn.

Different types of attention in sparse attention (a) Random attention (b) Window neighborhood attention (c) Global attention on added CLS tokens. (The image is reproduced from the pdf of the current paper.)

“Big Bird’s sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, Big Bird drastically improves performance on various NLP tasks such as question answering, summarization, and novel applications to genomics data.”

Takeaway: If you are working with longer sentences or sequences like in summarization or applications of genomic data, use Big Bird for feasible training and respectable inference times. Even with smaller sentences, use Big Bird. I will take linear-complexity self-attention rather than quadratic any day!

Improving Auto-Augment via Augmentation-Wise Weight Sharing

https://neurips.cc/virtual/2020/public/poster_dc49dfebb0b00fd44aeff5c60cc1f825.html

Choosing a sequence of transformations and their magnitude for data augmentation for a particular task is domain-specific and time-consuming.

Auto-Augment is a technique to learn an optimal sequence of transformations where the reward is the validation loss negated. Usually, RL is used to learn this policy. One iteration in learning this optimal policy involves training a model completely and thus is a very expensive process.

So, the current work tries to make this process more efficient. It is based on the insight shown before that while training with a sequence of transformations the effect of the transformations is only prominent at the later stage of training.

In this current work, for each iteration to evaluate a particular policy (sequence of transformations), most of the training is done with a shared policy, and only the last part of the training is done with the current policy to be evaluated. This is called Augmentation-Wise Weight Sharing.

As the training with the shared policy is done only once for all the iterations this method is efficient in learning an optimal policy.

Two stages of training the model when evaluating the given policy. (The image is reproduced from the pdf of the current paper.)

“On CIFAR-10, this method achieves a top-1 error rate of 1.24%, which is currently the best performing single model without extra training data. On ImageNet, this method gets a top-1 error rate of 20.36% for ResNet-50, which leads to a 3.34% absolute error rate reduction over the baseline augmentation.”

Takeaway: When you have resources to use an optimal sequence of data augmentations to increase the performance of a model, use this method to train the RL agent which learns the optimal policy which is more efficient also making Auto-Augmentation feasible for large datasets.

Fast Transformers with Clustered Attention

https://neurips.cc/virtual/2020/public/poster_f6a8dd1c954c8506aadc764cc32b895e.html

Like Big Bird above, Fast Transformers approximates the standard self-attention to make it linear from quadratic dependency.

To do this, instead of calculating attention all-to-all (O(sequence_length*sequence_length)), queries are clustered and the attention values are calculated only for the centroids. And all the queries in a particular cluster would get the same attention values. This makes the overall computation of self-attention linear wrt sequence length. O(num_clusters*sequence_length).

To improve this approximation by handling the case where there could be some keys which have a large dot product with the centroid query but not with some of the cluster member queries, authors take top-k keys which the centroid query most attended to and calculate the exact key-value attention values for all the queries in the cluster with those top-k keys. This increases computation and memory but still is better than all-to-all.

“This paper shows that Fast Transformers can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pre-trained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.”

Takeaway: This is not as elegant as Big Bird we saw above but one has to try every option to bring the quadratic complexity of self-attention to linear.

Limits to Depth Efficiencies of Self-Attention

https://neurips.cc/virtual/2020/public/poster_ff4dfdf5904e920ce52b48c1cef97829.html

To scale transformers, it is empirically shown that increasing the width (dimension of internal representation) is as efficient as increasing the depth (number of self-attention layers).

Contrarily and more concretely, this work establishes that we can scale the transformers up to the ‘depth threshold’ which is the base 3 logarithm of the width. If the depth is below this depth threshold increasing depth is more efficient than increasing the width. This is termed depth efficiency.

And if the depth is higher than this depth threshold increasing depth will hurt compared to increasing the width. This is termed as depth inefficiency.

The number of parameters is directly proportional to the width of the network when layers are constant. The figure shows when the depth is more useful when it has sufficient width, i.e in depth efficiency phase. (The image is reproduced from the pdf of the current paper.)

“By identifying network width as a limiting factor, our analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity.”

Takeaway: When you want to scale the Transformer architecture for the next big language model, keep in mind that if the width is not large enough increasing depth doesn’t help. Depth should always be less than the ‘depth threshold’ which is the base-3 logarithm of width. So, increase the width before increasing the depth to scale your transformers to almost insane depth.

NeurIPS 2020 Papers: A Deep Learning Engineer’s Takeaway was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

ICLR 2020 papers: On Compositional/Systematic generalization (including ours)

Prakash Kagitha — Sun, 26 Apr 2020 05:56:46 GMT

An introduction to systematic generalization and a summary of 4 papers from ICLR 2020

Photo by Maarten van den Heuvel on Unsplash

Systematicity as an argument against deep learning

In 1988, Jerry Fodor leveled a concern against connectionist models (deep learning) explaining/modeling human language understanding and cognition that they are not systematic. Meant to say that some of the data points (like sentences) are systematically similar to some of the other data points and humans can understand all of these data points given that they understand one data point.

For example, if we understand a sentence ‘John loves Mary’, we would also understand ‘Mary loves John’ or for that matter, any sentence of the pattern ‘NP Vt NP’ because, the underlying knowledge (concepts?) in understanding all these sentences is one and the same i.e understanding syntax ‘NP Vt NP’.

As per Fodor, the same behavior should be expected of the model explaining language understanding that it should understand all systematically similar data points if it could understand (learns) one.

As systematicity in human cognition is very strong, this argument against connectionist models has been very prominent, historically, stimulating debate and a great exchange of scientific arguments.

Modernization of systematicity by Lake et.al. 2017

Lake et. al. 2017 instantiated this expectation as an aspect of generalization (systematic generalization) in sequence to sequence(Seq2Seq) learning models with the SCAN dataset. The model is tested on new combinations of already learning concepts as a requirement for the model to be systematic.

For example, the model trained on the input-output pairs (walk, WALK), (jump, JUMP), and (walk left, LTURN WALK) is tested with a pair (jump left, LTURN JUMP). The standard seq2seq models based on GRU, LSTM and their variants with attention failed catastrophically on the SCAN dataset of which the above examples are from.

After this, different datasets are proposed to test deep learning models on Visual Question Answering (VQA). Most recently, Bahadanu et.al. 2019 created CLOSURE, a variant of the CLEVR dataset, which tests the model’s performance on the questions which contain familiar parts in a more complex context. They showed that existing VQA models don’t perform well on this dataset and proposed a variant of Neural Module Networks to improve the performance.

ICLR 2020 and systematic generalization

It is interesting that systematic generalization is explored in different tasks and domains as there are four papers related to systematic generalization published at ICLR 2020 with a new way of solving systematic generalization, a new way of creating train-test splits to test systematic generalization, an investigation of the drivers of systematic generalization in an RL agent and finally, our work showing that the standard LSTM+attn models also exhibit systematic generalization.

Environmental drivers of systematicity and generalization in a situated agent

In this work, an RL agent is trained to follow commands like ‘find Obj’ or ‘lift Obj’ in a 2D/3D environment with different objects. After training ‘find Obj’ with all the objects in the environment and training ‘lift Obj’ with a subset of objects it has to generalize to the commands containing ‘lift’ with new objects (‘lift NewObj’).

This is analogous to the SCAN dataset where the model is tested on new combinations of already learning concepts. Here, the concept of a particular object is learned from ‘find’ command and the concept of ‘lift’ is learned being trained with lift combined with the objects in the trainset.

The following are the drivers of systematicity found in this investigation which would give a lot of insight into systematic generalization of deep learning models. Check out the paper for more information.

“[a] the number of object/word experiences in the training set;

[b] the visual invariances afforded by the agent’s perspective, or frame of reference; and

[c] the variety of visual input inherent in the perceptual aspect of the agent’s perception.”

Point [a] is the starting point for our investigation which is also published at ICLR 2020 (workshop — Bridging AI and Cognitive Science), the last paper we discuss in this paper. (see below)

Measuring Compositional Generalization: A Comprehensive Method on Realistic Data

As we saw earlier, the test set for measuring systematic/compositional generalization in the SCAN dataset contains all the combinations of a primitive jump with modifiers, the combinations that never occurred in the trainset. The model needs to understand the concept of jump from input-output pair (jump, JUMP) and the concept of different modifiers from their combinations with other primitives.

Although, this type of train/test split strategy may not be feasible and even not efficient for measuring compositional generalization in some cases. This is the premise and the problem this work is addressing by formulating a method that automatically creates train/test splits with lower atom divergence and higher compound divergence. See a paragraph from the paper below.

“We use the term compositionality experiment to mean a particular way of splitting the data into train and test sets with the goal of measuring compositional generalization. Based on the notions of atoms and compounds described above, we say that an ideal compositionality experiment should adhere to the following two principles

Similar atom distribution: All atoms present in the test set are also present in the train set, and the distribution of atoms in the train set is as similar as possible to their distribution in the test set.

Different compound distribution: The distribution of compounds in the train set is as different as possible from the distribution in the test set.”

With this method they proposed, they created multiple splits from the SCAN dataset and showed that they are better at evaluating compositional generalization than the original train/test splits.

Also, they created a dataset for sematic parsing called Compositional Freebase Questions (CFQ) which is the largest dataset for studying compositional generalization in natural language known. I recommend to check out their very nice blog post about this work in google ai blog.

Measuring Compositional Generalization

Permutation Equivariant Models for Compositional Generalization in Language

This is a new way of thinking about systematic/compositional generalization with the hypothesis that it could be seen as a form of group-invariance. They designed this group-invariance property right into the Seq2Seq model and solved different tasks in the SCAN dataset.

The following is the hypothesis they made and showed good evidence of by solving the SCAN dataset with the group-invariance property.

“Models achieving the compositional generalization required in certain SCAN tasks are equivariant with respect to permutation group operations in the input and output languages.”

Systematic generalization emerges in Seq2Seq models with variability in data (Our Paper)

The first paper we discussed showed that an RL agent systematically generalizes to a new object if a command like ‘lift’ is trained with more objects. i.e. generalizes to ‘pick NewObj’ after trained on ‘pick Obj1’, ‘pick Obj2’… ‘pick ObjN’, where N is good enough number of objects.

Turns out, just an LSTM+attn learns 6 modifiers in SCAN (out of 8 modifiers & 2 conjunctions) with an increased number of distinct primitives in the dataset and generalizes to commands with new primitives never trained with any modifiers like (‘jump twice’, ‘JUMP JUMP’) which is never shown before.

Helping us better understand the characteristics of systematic generalization in deep learning models, we found another behavior that is highly correlated with systematic generalization, instance independent representations of modifiers. I.e. if we subtract ‘walk’ from the command ‘walk twice’ and add ‘jump’, the model would give the output of ‘jump twice’. Vectors added/subtracted are from the encoder's final hidden state. This behavior and systematic generalization are highly correlated with a Pearson coefficient of 0.99.

The more number of distinct primitives a modifier is operated on in the trainset gives rise to systematic generalization and also makes the model represent a modifier independent of any instance of variable it operated on. Instance Independent representations for modifiers. And it is highly correlated with systematic generalization.

We also showed that, with 300 distinct primitives in the dataset, models trained on primitive variables (like ‘jump twice’ ) generalized to compound variables ‘{jump twice} twice’, though in a limited way. The approaches like Syntactic attention and Meta-seq2seq learning which solved the SCAN dataset didn’t show this behavior, more interestingly, even when trained with 300 different primitives.

Systematic generalization and the future

We are still in the initial stage of understanding the systematicity of human cognition and exploring systematic generalization in deep learning models by evaluating for it and finding the inductive biases that enable it.

For now, it is clear that systematicity would be a concern in many domains and tasks like language understanding, abstract and analogical reasoning, semantic scene analysis if we aim to build models that interact with the world efficiently the way humans do.

Finally, check out the ICLR 2020 workshop where our work is published which is a great place to explore all the amazing working at the intersection of cognitive science and AI.

Bridging AI and Cognitive Science (BAICS)

Feel free to discuss any work related to the systematicity of human cognition and systematic generalization. Follow me here for more updates and recent work at the intersection of cognitive science and deep learning.

Computer Vision and Computer Vision applied

Prakash Kagitha — Thu, 24 Oct 2019 09:08:17 GMT

Photo by Rohan Makhecha on Unsplash

Computer Vision is a scientific endeavor that aims to automate the human visual system, not necessarily aims to imitate it but emulate all the ability of the human visual system and beyond. So, as is the case with any field which has the potential to change the course of humanity, Computer vision has a lot of history made up of the likes of obsessive human effort, hard problems, victories, failures, and immense hope.

In spite of the complexity of visual scenes in the world around us, Computer vision is now capable of detecting practically any type of object including people, vehicles, and so on. Not only that it can detect objects it can recognize the characteristics of those objects like identity and gestures in the case of humans, deformations/defects in the case of objects, intrusions into boundaries, and much more. This is just a small part of an ocean that is applied Computer Vision.

Computer vision is the visual system for computers to play very strategic games like Go and Starcraft better than humans [1]. It can pave a way for self-driving cars to take automatic actions that don’t so that they won’t run into other cars or people. It seems that Computer vision can do anything as long as we have data and computation. Models learning to classify discriminatively and models that actually learn the whole distribution that generated the dataset both have shown significant progress in solving many real-world problems.

As a matter of fact, we at Akaike Technologies, believe that it can solve a lot of problems as long as we could formulate them. As a deep learning services company, we deployed various applications containing object detection, image segmentation, and machine inspection among others with computer vision. Alongside that, at the frontiers of computer vision, we know what are the current limitations and are ambitious to take the field further towards the collective vision of the community.

Now, we wish to delve into how computer vision evolved to be so successful. How started as an application in itself became a great scientific endeavor. And also what are its limitations and prospects, inclining towards its application to real-world problems around us.

Initial aspirations for Computer Vision (1950s-2012)

Computer vision emerged not so later than the field which aimed to make intelligent machines, Artificial Intelligence. In the 1950s and 1960s, there were small computer vision projects in the labs of AI pioneers like Marvin Minsky and Seymour Papert among others. We could see the optimism in 1956 when Seymour Papert wrote a proposal for imparting visual intelligence to machines describing it as a research project our a single summer [2]. Of course, it turned that it has a lot more to it.

There was great work done in David Marr's lab, who dealt with the problem from a neuroscience perspective [3]. Very prominent people like Marvin Minsky and Patrick Winston among others also worked on visual systems for robots and were betting on the idea that even visual intelligence could be symbolized and then be reasoned about as a top-down approach of visual intelligence [4], which, despite its potential, didn’t deliver any considerable development in Computer vision considering how much we have been able to do now. Alternatively, the bottom-up approach which is loosely modeled from processing units in the brain, neurons, composed with end-to-end learning mechanisms dominated the field as we discuss in the next section.

From the 1980s to even 2010, the efficient way to do computer vision, more specifically object detection, scene understanding, or classifying images on different schemas, is to develop mechanisms to extract some broad range of relevant features with expert domain knowledge and apply traditional statistics/machine learning on top of those feature characteristics [5]. These models, even though worked for some cases, needed a lot of domain-specific knowledge and manual engineering, and were so brittle outside their very specific problem.

The success of CNNs for Computer Vision

In 2012, the paradigm shift happened popularly when Convolutional Neural Networks(CNNs), a variant of neural networks with characteristics suitable for performing very well on images, reduced the error rate on a widely studied object recognition dataset called ImageNet by 50% [6]. From around that point, all or most of the computer vision turned to convolutional neural networks. This is more dramatic than we could imagine. Yan Lecun, the inventor of convolutional neural networks, was told, even though his approach(CNNs) achieved state of the art, that it doesn’t worth a publication in CVPR (a very respectable conference of Computer Vision) because it didn’t tell us anything about the visual system. This is in 2010. Now, it’s hard to find a CVPR paper that doesn’t contain Convolutional neural networks.

This very successful approach didn’t emerge overnight but we couldn’t say the same for the realization of emergence at all. The notion of creating intelligent machines by connecting brain like units, neurons, in groups is very old. It used to be called Connectionism [7]. The notion of convolutions which turned out efficient for computer vision is also not new. It is often credited to Fukushima coming up with or to formalize the idea related to convolutions in a network of neurons to make the model shift-invariant [8]. But a significant structure to the idea of convolution neural networks came from the work of Yan Lecun, credited as the inventor CNNs. This is in 1974.

As Prof. Lecun’s paper got rejected even in 2010, imagine the ideology in the 1970s. Many people devoted their life to Neural networks (and CNNs) but they don’t know any practical ways to train them so that they could perform on par with manually engineered systems or systems based on classical machine learning. Adding to that, there was a symbolic approach to creating intelligent machines with very prominent supporters, as we talked about in the last section, which slowed the progress of connectionism.

The upper part shows how a kernel is applied to an image that turns it into a feature map of compressed size. The lower part of the image describes the sequence of those operations (hence the name deep) to classify which number an image contains. This is a Deep Convolutional Neural Network which conceptually very similar to networks we use today. This work is from 1990.

With the effort of several researchers over several decades, the efficacy of connectionism(CNNs) became evident in smaller specific problems. Yan Lecun, in 1990, used convolutional neural networks to recognize digits to automatically understand the zipcode on checks at AT&T [9]. Among many people in the community, Yoshua Bengio and Geoffrey Hinton contributed a significant amount to learning algorithms behind these networks, insights about their inabilities and techniques to overcome them [10][11], to eventually cause the rise of Neural Networks, effectively, to be an answer for everything.

In 2012, two of the students from Geoffrey Hinton’s lab applied these techniques on an object detection dataset called Imagenet [6]. This is the most dramatic moment that started the Deep Learning revolution we see around us. The availability of data and the economic feasibility of computation are big factors for the whole field of deep learning to be practical and disruptive.

The 2012 paper that caused a 50% decrease in the error rate on the Imagenet dataset. (Left) Some of the examples from the evaluation set. (Right) Rows of nearest neighbors of last hidden layer encodings. The leftmost image is the quey and others are top N results [6].

Astonishingly, the reason why Yan Lecun’s CVPR paper got rejected in 2010 is the reason for the success of computer vision. Specifically, we can credit the success to one fundamental aspect of the deep learning revolution, i.e models learning everything end-to-end without any of the hand-engineering that was necessary for computer vision till now. With CNNs, models will automatically learn the features either discriminatively, when classifying, or generatively when actually have to generate natural images. We just have to define a task, prepare some data, and the model will learn everything that’s needed. This is the formula that brought down the revolution of learning machines and presumably a change in the course of humanity.

As shown below, the model learns to detect simple features and then effectively combines them to detect more complex features in a subsequent layer. In the end, if you have another layer to predict the probabilities of each image being in a particular class, we have a model that can classify images. The same paradigm applies even when you are doing a very complicated classification of whether an image of the object has a defect or not. Also, these models can be extended to mark the boundaries of different objects present in the image to masking objects and then tracking them. Effectively, everything comes down to automatically learning the features specific to the task end to end.

Below is the visualization of the filters that learned to detect specific things in each of the layers of Deep Convolutional Neural Network:

Layer 1 learned to detect plane surfaces, edges. Layer 2 weights learned to detect some contours [12].

Layer 3 learned to detect a combination of contours, sometimes meaningful parts of objects [12].

Layer 4 and 5 learned to detect parts or the whole object of some class [12].

Computer Vision community as a whole bet that this is the way to deal with everything. The formula is to define a task, get some data, and apply Neural Networks(CNNs). It will get the job done. This is what all the data science/machine learning teams in the world do. If you want to detect cracks or defects on the windmills or segment different types of lands from just aerial images, this is the way. If you aim to give personalized health insights or do personalized marketing for Electronic health data, this is the way. If you have to show personalized ads in an optimal way, this is the way. If you have to detect people, vehicles, street signs even km away, this is the way. If you have to detect intrusion into your property or a supermarket, this is the way. If you want to build intelligent systems to assist radiologists, this is the way. In fact, at Akaike Technologies, we used computer vision to deal with all of these problems and brought a great impact on efficiency and revenue.

These are the ways companies are using computer vision. This will give a sense of how broad computer vision is. AES is using drones and computer vision to make inspecting energy assets safer and more efficient. LG CNS is using Computer Vision to accurately detect defects in various products on the assembly line. Nordstrom is using Computer Vision-based product search to enable shoppers to easily find products simply by taking a photo. Unilever is using Computer Vision to gain new insights on consumer behavior and improve ad campaign effectiveness. IDEXX is using Computer Vision to automatically organize medical imagery and improve the productivity of their radiologists. All of these systems are based on Convolutional Neural Networks. Most businesses could bring a great change in adopting computer vision.

But don’t get deceived, it takes a lot of skill to formulate the problem as one that aligns with the strength of current computer vision models, to design a model specific to a problem and to train it, to iterate on it till it is production-ready, and to scale the solutions to impact billions of people. The skill is a sound combination of research and implementation in both adopting current literature into production and at the same time contributing to this scientific endeavor. At Akaike Technologies, we have proved to be excellent in solving the hardest of problems. You can see some of our accomplishments below at the end.

The prospects of Computer Vision

Being computer vision as successful as it is now, we are not anywhere near to the ultimate aim of this research endeavor, i.e to automate the visual intelligence we possess. There is quite a lot of discussion in this space of how we can make sure that current computer vision techniques would acquire comprehensive scene understanding. Apart from the statements where this term is used more liberally, comprehensive scene understanding is in some sense the ultimate fulfillment of computer vision.

To be intentionally and sensibly critical about the models we have today, simply put, we can say that it doesn’t have a clue of how the world works and how humans behave/feel. It could detect people in a scene and identify them, but it doesn’t know if someone is a customer or intruder based on their behavior which a human would recognize in most cases within moments. Ideally, it is intuitive in the sense that if we have data for each thing we can learn everything. Theoretically, this is true but nobody can find data of what happens if people do one thing rather than other (just as a single instance lets say, teasing about throwing something at you as opposed to actually throwing), and nobody can bring enough data of visual behavior that the model can understand how the person is feeling, his aims/goal/motivation.

Sometimes, it even feels practical to get the data and make machines learn the things described above, but it is just because we singled out a simple instance of so many great skills humans possess. The ability to understand the concept with minimal supervision, to combine different concepts to understand more complex concepts, to have the knowledge of intuitive physics of the world, intuitive psychology of humans, the ability to learn, and so on. Aiming to learn these things in the ways we model data now feels, and rightly argued by so many people, stupid.

This is evident for the computer vision(deep learning) community at large as well. And the community is striving towards models that can learn from data without many human labels sometimes even unsupervised, models that learn to learn different tasks at the same time, models that could interactively acquire data and then reason about it to make decisions. Also, to solve computer vision completely, things that feel distinct from it, like decision making, attention, and interactive learning are nevertheless paramount in building systems that compete with very efficiently evolved human visual intelligence.

Along with this, there are a lot of investigations from cognitive science, neuroscience, and philosophy that are flowing into computer vision (deep learning) conferences to equip current models with the abilities that surfaced when thinking about computer vision critically [13], [14]. We, at Akaike Technologies, see imparting these abilities into computer vision is one of the important directions in the next five to ten years and actively participate and contribute along with the community.

Computer vision in itself emerged as an application of learning systems, but now it grows into a significant ‘scientific’ endeavor to model visual intelligence. Now there is a different sense to computer vision and computer vision applied. We are happy that the field separated different goals and their approach and moving forward complementing each other.

Comprehensive scene understanding, at large, can impart potentially huge value to applied computer vision as well. Next-generation intruder detection, disaster management, and behavior profile analytics systems are just some of the low hanging fruits for the community, along with improvement in about everything we do now.

We believe that to excel at applied computer vision(or applied deep learning), beyond current successes, and to be sustainable in five to ten years, one has to exploit current technology with a skill to discern its strengths and weaknesses and be on the horizon of advancements by being critical and asking questions nobody else dares to ask.

Akaike Technologies

We are a group of rapidly growing deep learning experts with two decades of experience in a wide range of domains. We have a proven track record of solving hard problems with computer vision, NLP, and deep learning. We strive to always be at the edge of deep learning, exploiting, and exploring.

References

[1] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. nature, 529(7587), p.484.

[2] Papert, Seymour (1966–07–01). “The Summer Vision Project”. MIT AI Memos (1959–2004). HDL:1721.1/6125.

[3] Marr, D., 1982. Vision: A computational investigation into the human representation and processing of visual information.

[4] Minsky, M., 1990. Logical vs. Analogical or Symbolic vs. Connectionist or Neat vs. Scruffy Artificial Intelligence at MIT. Expanding Frontiers, Patrick H. Winston (Ed.).

[5] Jiang, X., 2009, August. Feature extraction for image recognition and computer vision. In 2009 2nd IEEE International Conference on Computer Science and Information Technology (pp. 1–15). IEEE.

[6] Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

[7] McClelland, J.L., Rumelhart, D.E., and PDP Research Group, 1986. Parallel distributed processing. Explorations in the Microstructure of Cognition, 2, pp.216–271.

[8] Fukushima, K., 1980. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4), pp.193–202.

[9] Le Cun, Y., Matan, O., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D. and Baird, H.S., 1990, June. Handwritten zip code recognition with multilayer networks. In Proc. 10th International Conference on Pattern Recognition (Vol. 2, pp. 35–40).

[10] Glorot, X. and Bengio, Y., 2010, March. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).

[11] Rumelhart, D.E., Hinton, G.E., and Williams, R.J., 1985. Learning internal representations by error propagation (No. ICS-8506). California Univ San Diego La Jolla Inst for Cognitive Science.

[12] Zeiler, M.D., and Fergus, R., 2014, September. Visualizing and understanding convolutional networks. In European conference on computer vision (pp. 818–833). Springer, Cham.

[13] Hassabis, D., Kumaran, D., Summerfield, C., and Botvinick, M., 2017. Neuroscience-inspired artificial intelligence. Neuron, 95(2), pp.245–258.

[14] Lake, B.M., Ullman, T.D., Tenenbaum, J.B. and Gershman, S.J., 2017. Building machines that learn and think like people. Behavioral and brain sciences, 40.

Computer Vision and Computer Vision applied was originally published in Akaike Technologies on Medium, where people are continuing the conversation by highlighting and responding to this story.

NIPS/NeurIPS 2018: Best* of the First Two Poster Sessions

Prakash Kagitha — Tue, 04 Dec 2018 13:15:09 GMT

Photo by Rohan Makhecha on Unsplash

NeurIPS is a great conference attracting the state of the art in almost every aspect of machine learning research.

There are a few things that a researcher in the field should, for sure, give attention to in a conference. In my perspective, those things would cluster research articles similar to these groups: 1. Understanding, 2. Essentials, 3. Progress, 4. Big problems/future

So, I grouped all the things that I find influential in to these categories. These are posters from the NeurIPS 2018 Tue Poster Sessions A&B with their abstracts.

1. Understanding

Are GANs Created Equal? A Large-Scale Study

Retrospective

Despite a very rich research activity leading to numerous interesting GAN algorithms, it is still very hard to assess which algorithm(s) perform better than others. We conduct a neutral, multi-faceted large-scale empirical study on state-of-the art models and evaluation measures. We find that most models can reach similar scores with enough hyperparameter optimization and random restarts. This suggests that improvements can arise from a higher computational budget and tuning more than fundamental algorithmic changes. To overcome some limitations of the current metrics, we also propose several data sets on which precision and recall can be computed. Our experimental results suggest that future GAN research should be based on more systematic and objective evaluation procedures. Finally, we did not find evidence that any of the tested algorithms consistently outperforms the non-saturating GAN introduced in \cite{goodfellow2014generative}.

An intriguing failing of convolutional neural networks and the CoordConv solution

Interesting , About time

We have shown the curious inability of CNNs to model the coordinate transform task, shown a simple fix in the form of the CoordConv layer, and given results that suggest including these layers can boost performance in a wide range of applications. Using CoordConv in a GAN produced less mode collapse as the transform between high-level spatial latents and pixels becomes easier to learn. A Faster R-CNN detection model trained on MNIST detection showed 24% better IOU when using CoordConv, and in the Reinforcement Learning (RL) domain agents playing Atari games benefit significantly from the use of CoordConv layers.

A Linear Speedup Analysis of Distributed Deep Learning with Sparse and Quantized Communication

Efficiency

The large communication overhead has imposed a bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) for training deep neural networks. Previous works have demonstrated the potential of using gradient sparsification and quantization to reduce the communication cost. However, there is still a lack of understanding about how sparse and quantized communication affects the convergence rate of the training algorithm. In this paper, we study the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization. We show that O(1/√MK) convergence rate can be achieved if the sparsification and quantization hyperparameters are configured properly. We also propose a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/√MK) convergence rate. Our evaluation validates our theoretical results and shows that our PQASGD can converge as fast as full-communication SGD with only 3%−5% communication data size.

On the Dimensionality of Word Embedding

Elegent

In this paper, we provide a theoretical understanding of word embedding and its dimensionality. Motivated by the unitary-invariance of word embedding, we propose the Pairwise Inner Product (PIP) loss, a novel metric on the dissimilarity between word embeddings. Using techniques from matrix perturbation theory, we reveal a fundamental bias-variance trade-off in dimensionality selection for word embeddings. This bias-variance trade-off sheds light on many empirical observations which were previously unexplained, for example the existence of an optimal dimensionality. Moreover, new insights and discoveries, like when and how word embeddings are robust to over-fitting, are revealed. By optimizing over the bias-variance trade-off of the PIP loss, we can explicitly answer the open question of dimensionality selection for word embedding.

Adversarial Examples that Fool both Computer Vision and Time-Limited Humans

Fundamental

Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.

Dendritic cortical microcircuits approximate the backpropagation algorithm

Insight

Deep learning has seen remarkable developments over the last years, many of them inspired by neuroscience. However, the main learning mechanism behind these advances — error backpropagation — appears to be at odds with neurobiology. Here, we introduce a multilayer neuronal network model with simplified dendritic compartments in which error-driven synaptic plasticity adapts the network towards a global desired output. In contrast to previous work our model does not require separate phases and synaptic learning is driven by local dendritic prediction errors continuously in time. Such errors originate at apical dendrites and occur due to a mismatch between predictive input from lateral interneurons and activity from actual top-down feedback. Through the use of simple dendritic compartments and different cell-types our model can represent both error and normal activity within a pyramidal neuron. We demonstrate the learning capabilities of the model in regression and classification tasks, and show analytically that it approximates the error backpropagation algorithm. Moreover, our framework is consistent with recent observations of learning between brain areas and the architecture of cortical microcircuits. Overall, we introduce a novel view of learning on dendritic cortical circuits and on how the brain may solve the long-standing synaptic credit assignment problem.

On Neuronal Capacity

Novel formulation

We define the capacity of a learning machine to be the logarithm of the number (or volume) of the functions it can implement. We review known results, and derive new results, estimating the capacity of several neuronal models: linear and polynomial threshold gates, linear and polynomial threshold gates with constrained weights (binary weights, positive weights), and ReLU neurons. We also derive capacity estimates and bounds for fully recurrent networks and layered feedforward networks.

Bias and Generalization in Deep Generative Models: An Empirical Study

True understanding

n high dimensional settings, density estimation algorithms rely crucially on their inductive bias. Despite recent empirical success, the inductive bias of deep generative models is not well understood. In this paper we propose a framework to systematically investigate bias and generalization in deep generative models of images by probing the learning algorithm with carefully designed training datasets. By measuring properties of the learned distribution, we are able to find interesting patterns of generalization. We verify that these patterns are consistent across datasets, common models and architectures.

How Does Batch Normalization Help Optimization?

Perspective

Batch Normalization (BatchNorm) is a widely adopted technique that enables faster and more stable training of deep neural networks (DNNs). Despite its pervasiveness, the exact reasons for BatchNorm’s effectiveness are still poorly understood. The popular belief is that this effectiveness stems from controlling the change of the layers’ input distributions during training to reduce the so-called “internal covariate shift”. In this work, we demonstrate that such distributional stability of layer inputs has little to do with the success of BatchNorm. Instead, we uncover a more fundamental impact of BatchNorm on the training process: it makes the optimization landscape significantly smoother. This smoothness induces a more predictive and stable behavior of the gradients, allowing for faster training.

A probabilistic population code based on neural samples

Neural Encoding

Sensory processing is often characterized as implementing probabilistic inference: networks of neurons compute posterior beliefs over unobserved causes given the sensory inputs. How these beliefs are computed and represented by neural responses is much-debated (Fiser et al. 2010, Pouget et al. 2013). A central debate concerns the question of whether neural responses represent samples of latent variables (Hoyer & Hyvarinnen 2003) or parameters of their distributions (Ma et al. 2006) with efforts being made to distinguish between them (Grabska-Barwinska et al. 2013). A separate debate addresses the question of whether neural responses are proportionally related to the encoded probabilities (Barlow 1969), or proportional to the logarithm of those probabilities (Jazayeri & Movshon 2006, Ma et al. 2006, Beck et al. 2012). Here, we show that these alternatives — contrary to common assumptions — are not mutually exclusive and that the very same system can be compatible with all of them. As a central analytical result, we show that modeling neural responses in area V1 as samples from a posterior distribution over latents in a linear Gaussian model of the image implies that those neural responses form a linear Probabilistic Population Code (PPC, Ma et al. 2006). In particular, the posterior distribution over some experimenter-defined variable like “orientation” is part of the exponential family with sufficient statistics that are linear in the neural sampling-based firing rates.

2. Essentials

Training Deep Models Faster with Robust, Approximate Importance Sampling

Solution

In theory, importance sampling speeds up stochastic gradient algorithms for supervised learning by prioritizing training examples. In practice, the cost of computing importances greatly limits the impact of importance sampling. We propose a robust, approximate importance sampling procedure (RAIS) for stochastic gradient de- scent. By approximating the ideal sampling distribution using robust optimization, RAIS provides much of the benefit of exact importance sampling with drastically reduced overhead. Empirically, we find RAIS-SGD and standard SGD follow similar learning curves, but RAIS moves faster through these paths, achieving speed-ups of at least 20% and sometimes much more.

DropMax: Adaptive Variational Softmax

Elegant

We propose DropMax, a stochastic version of softmax classifier which at each iteration drops non-target classes according to dropout probabilities adaptively decided for each instance. Specifically, we overlay binary masking variables over class output probabilities, which are input-adaptively learned via variational inference. This stochastic regularization has an effect of building an ensemble classifier out of exponentially many classifiers with different decision boundaries. Moreover, the learning of dropout rates for non-target classes on each instance allows the classifier to focus more on classification against the most confusing classes. We validate our model on multiple public datasets for classification, on which it obtains significantly improved accuracy over the regular softmax classifier and other baselines. Further analysis of the learned dropout probabilities shows that our model indeed selects confusing classes more often when it performs classification.

How to Start Training: The Effect of Initialization and Architecture

Utility

We identify and study two common failure modes for early training in deep ReLU nets. For each, we give a rigorous proof of when it occurs and how to avoid it, for fully connected, convolutional, and residual architectures. We show that the first failure mode, exploding or vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in and, for ResNets, by correctly scaling the residual modules. We prove that the second failure mode, exponentially large variance of activation length, never occurs in residual nets once the first failure mode is avoided. In contrast, for fully connected nets, we prove that this failure mode can happen and is avoided by keeping constant the sum of the reciprocals of layer widths. We demonstrate empirically the effectiveness of our theoretical results in predicting when networks are able to start training. In particular, we note that many popular initializations fail our criteria, whereas correct initialization and architecture allows much deeper networks to be trained.

Regularizing by the Variance of the Activations’ Sample-Variances

Simple-not-trivial

Normalization techniques play an important role in supporting efficient and often more effective training of deep neural networks. While conventional methods explicitly normalize the activations, we suggest to add a loss term instead. This new loss term encourages the variance of the activations to be stable and not vary from one random mini-batch to the next. Finally, we are able to link the new regularization term to the batchnorm method, which provides it with a regularization perspective. Our experiments demonstrate an improvement in accuracy over the batchnorm technique for both CNNs and fully connected networks.

Mesh-TensorFlow: Deep Learning for Supercomputers

Solution

Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the “batch” dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing SOTA results on WMT’14 English-to-French translation task and the one-billion-word Language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh

Structure-Aware Convolutional Neural Networks

Real world

Convolutional neural networks (CNNs) are inherently subject to invariable filters that can only aggregate local inputs with the same topological structures. It causes that CNNs are allowed to manage data with Euclidean or grid-like structures (e.g., images), not ones with non-Euclidean or graph structures (e.g., traffic networks). To broaden the reach of CNNs, we develop structure-aware convolution to eliminate the invariance, yielding a unified mechanism of dealing with both Euclidean and non-Euclidean structured data. Technically, filters in the structure-aware convolution are generalized to univariate functions, which are capable of aggregating local inputs with diverse topological structures. Since infinite parameters are required to determine a univariate function, we parameterize these filters with numbered learnable parameters in the context of the function approximation theory. By replacing the classical convolution in CNNs with the structure-aware convolution, Structure-Aware Convolutional Neural Networks (SACNNs) are readily established. Extensive experiments on eleven datasets strongly evidence that SACNNs outperform current models on various machine learning tasks, including image classification and clustering, text categorization, skeleton-based action recognition, molecular activity detection, and taxi flow prediction.

Visualizing the Loss Landscape of Neural Nets

Perspective

Neural network training relies on our ability to find “good” minimizers of highly non-convex loss functions. It is well known that certain network architecture designs (e.g., skip connections) produce loss functions that train easier, and well-chosen training parameters (batch size, learning rate, optimizer) produce minimizers that generalize better. However, the reasons for these differences, and their effect on the underlying loss landscape, is not well understood. In this paper, we explore the structure of neural loss functions, and the effect of loss landscapes on generalization, using a range of visualization methods. First, we introduce a simple “filter normalization” method that helps us visualize loss function curvature, and make meaningful side-by-side comparisons between loss functions. Then, using a variety of visualizations, we explore how network architecture affects the loss landscape, and how training parameters affect the shape of minimizers.

Training DNNs with Hybrid Block Floating Point

Craving to solve

The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them. These accelerators typically employ densely packed full-precision floating-point arithmetic to maximize performance per area. Ongoing research efforts seek to further increase that performance density by replacing floating-point with fixed-point arithmetic. However, a significant roadblock for these attempts has been fixed point’s narrow dynamic range, which is insufficient for DNN training convergence. We identify block floating point (BFP) as a promising alternative representation since it exhibits wide dynamic range and enables the majority of DNN operations to be performed with fixed-point logic. Unfortunately, BFP alone introduces several limitations that preclude its direct applicability. In this work, we introduce HBFP, a hybrid BFP-FP approach, which performs all dot products in BFP and other operations in floating point. HBFP delivers the best of both worlds: the high accuracy of floating point at the superior hardware density of fixed point. For a wide variety of models, we show that HBFP matches floating point’s accuracy while enabling hardware implementations that deliver up to 8.5x higher throughput.

FRAGE: Frequency-Agnostic Word Representation

Interesting

Continuous word representation (aka word embedding) is a basic building block in many neural network-based models used in natural language processing tasks. Although it is widely accepted that words with similar semantics should be close to each other in the embedding space, we find that word embeddings learned in several tasks are biased towards word frequency: the embeddings of high-frequency and low-frequency words lie in different subregions of the embedding space, and the embedding of a rare word and a popular word can be far from each other even if they are semantically similar. This makes learned word embeddings ineffective, especially for rare words, and consequently limits the performance of these neural network models. In order to mitigate the issue, in this paper, we propose a neat, simple yet effective adversarial training method to blur the boundary between the embeddings of high-frequency words and low-frequency words. We conducted comprehensive studies on ten datasets across four natural language processing tasks, including word similarity, language modeling, machine translation and text classification. Results show that we achieve higher performance than the baselines in all tasks.

Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

At the core

Recent research has shown that word embedding spaces learned from text corpora of different languages can be aligned without any parallel data supervision. Inspired by the success in unsupervised cross-lingual word embeddings, in this paper we target learning a cross-modal alignment between the embedding spaces of speech and text learned from corpora of their respective modalities in an unsupervised fashion. The proposed framework learns the individual speech and text embedding spaces, and attempts to align the two spaces via adversarial training, followed by a refinement procedure. We show how our framework could be used to perform the tasks of spoken word classification and translation, and the experimental results on these two tasks demonstrate that the performance of our unsupervised alignment approach is comparable to its supervised counterpart. Our framework is especially useful for developing automatic speech recognition (ASR) and speech-to-text translation systems for low- or zero-resource languages, which have little parallel audio-text data for training modern supervised ASR and speech-to-text translation models, but account for the majority of the languages spoken across the world.

Compact Generalized Non-local Network

Clean

The non-local module is designed for capturing long-range spatio-temporal dependencies in images and videos. Although having shown excellent performance, it lacks the mechanism to model the interactions between positions across channels, which are of vital importance in recognizing fine-grained objects and actions. To address this limitation, we generalize the non-local module and take the correlations between the positions of any two channels into account. This extension utilizes the compact representation for multiple kernel functions with Taylor expansion that makes the generalized non-local module in a fast and low-complexity computation flow. Moreover, we implement our generalized non-local method within channel groups to ease the optimization. Experimental results illustrate the clear-cut improvements and practical applicability of the generalized non-local module on both fine-grained object recognition and video classification. Code is available at: https://github.com/KaiyuYue/cgnl-network.pytorch.

3. Progress

Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions

Stretching

A novel framework for embeddings which are are numerical flexible and which extend the point embeddings, elliptical embeddings in wessserstein space. Wasserstein elliptical embeddings are more intuitive and yield tools that are better behaved numerically than the alternative choice of Gaussian embeddings with the Kullback-Leibler divergence. The paper demonstrates the advantages of elliptical embeddings by using them for visualization, to compute embeddings of words, and to reflect entailment or hypernymy.

FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction

Fundamentals

The basic principles in designing convolutional neural network (CNN) structures for predicting objects on different levels, e.g., image-level, region-level, and pixel-level, are diverging. Generally, network structures designed specifically for image classification are directly used as default backbone structure for other tasks including detection and segmentation, but there is seldom backbone structure designed under the consideration of unifying the advantages of networks designed for pixel-level or region-level predicting tasks, which may require very deep features with high resolution. Towards this goal, we design a fish-like network, called FishNet. In FishNet, the information of all resolutions is preserved and refined for the final task. Besides, we observe that existing works still cannot \emph{directly} propagate the gradient information from deep layers to shallow layers. Our design can better handle this problem. Extensive experiments have been conducted to demonstrate the remarkable performance of the FishNet. In particular, on ImageNet-1k, the accuracy of FishNet is able to surpass the performance of DenseNet and ResNet with fewer parameters. FishNet was applied as one of the modules in the winning entry of the COCO Detection 2018 challenge. The code is available at https://github.com/kevin-ssy/FishNet.

Towards Robust Interpretability with Self-Explaining Neural Networks

With out side effects

Most recent work on interpretability of complex machine learning models has focused on estimating a-posteriori explanations for previously trained models around specific predictions. Self-explaining models where interpretability plays a key role already during learning have received much less attention. We propose three desiderata for explanations in general — explicitness, faithfulness, and stability — and show that existing methods do not satisfy them. In response, we design self-explaining models in stages, progressively generalizing linear classifiers to complex yet architecturally explicit models. Faithfulness and stability are enforced via regularization specifically tailored to such models. Experimental results across various benchmark datasets show that our framework offers a promising direction for reconciling model complexity and interpretability.

Relational recurrent neural networks

Revolutionary

Memory-based neural networks model temporal data by leveraging an ability to remember information for long periods. It is unclear, however, whether they also have an ability to perform complex relational reasoning with the information they remember. Here, we first confirm our intuitions that standard memory architectures may struggle at tasks that heavily involve an understanding of the ways in which entities are connected — i.e., tasks involving relational reasoning. We then improve upon these deficits by using a new memory module — a Relational Memory Core (RMC) — which employs multi-head dot product attention to allow memories to interact. Finally, we test the RMC on a suite of tasks that may profit from more capable relational reasoning across sequential information, and show large gains in RL domains (BoxWorld & Mini PacMan), program evaluation, and language modeling, achieving state-of-the-art results on the WikiText-103, Project Gutenberg, and GigaWord datasets.

A probabilistic population code based on neural samples

Seek the might

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding

Encompassing everything

We marry two powerful ideas: deep representation learning for visual recognition and language understanding, and symbolic program execution for reasoning. Our neural-symbolic visual question answering (NS-VQA) system first recovers a structural scene representation from the image and a program trace from the question. It then executes the program on the scene representation to obtain an answer. Incorporating symbolic structure as prior knowledge offers three unique advantages. First, executing programs on a symbolic space is more robust to long program traces; our model can solve complex reasoning tasks better, achieving an accuracy of 99.8% on the CLEVR dataset. Second, the model is more data- and memory-efficient: it performs well after learning on a small number of training data; it can also encode an image into a compact representation, requiring less storage than existing methods for offline question answering. Third, symbolic program execution offers full transparency to the reasoning process; we are thus able to interpret and diagnose each execution step.

Neural Voice Cloning with a Few Samples

Artificial need

Voice cloning is a highly desired feature for personalized speech interfaces. We introduce a neural voice cloning system that learns to synthesize a person’s voice from only a few audio samples. We study two approaches: speaker adaptation and speaker encoding. Speaker adaptation is based on fine-tuning a multi-speaker generative model. Speaker encoding is based on training a separate model to directly infer a new speaker embedding, which will be applied to a multi-speaker generative model. In terms of naturalness of the speech and similarity to the original speaker, both approaches can achieve good performance, even with a few cloning audios. While speaker adaptation can achieve slightly better naturalness and similarity, cloning time and required memory for the speaker encoding approach are significantly less, making it more favorable for low-resource deployment.

Neural Arithmetic Logic Units

Being normal

Neural networks can learn to represent and manipulate numerical information, but they seldom generalize well outside of the range of numerical values encountered during training. To encourage more systematic numerical extrapolation, we propose an architecture that represents numerical quantities as linear activations which are manipulated using primitive arithmetic operators, controlled by learned gates. We call this module a neural arithmetic logic unit (NALU), by analogy to the arithmetic logic unit in traditional processors. Experiments show that NALU-enhanced neural networks can learn to track time, perform arithmetic over images of numbers, translate numerical language into real-valued scalars, execute computer code, and count objects in images. In contrast to conventional architectures, we obtain substantially better generalization both inside and outside of the range of numerical values encountered during training, often extrapolating orders of magnitude beyond trained numerical ranges.

4. Big Problems/Future

Embedding Logical Queries on Knowledge Graphs

Minority approach

Learning low-dimensional embeddings of knowledge graphs is a powerful approach used to predict unobserved or missing edges between entities. However, an open challenge in this area is developing techniques that can go beyond simple edge prediction and handle more complex logical queries, which might involve multiple unobserved edges, entities, and variables. For instance, given an incomplete biological knowledge graph, we might want to predict “em what drugs are likely to target proteins involved with both diseases X and Y?” — a query that requires reasoning about all possible proteins that might interact with diseases X and Y. Here we introduce a framework to efficiently make predictions about conjunctive logical queries — a flexible but tractable subset of first-order logic — on incomplete knowledge graphs. In our approach, we embed graph nodes in a low-dimensional space and represent logical operators as learned geometric operations (e.g., translation, rotation) in this embedding space. By performing logical operations within a low-dimensional embedding space, our approach achieves a time complexity that is linear in the number of query variables, compared to the exponential complexity required by a naive enumeration-based approach. We demonstrate the utility of this framework in two application studies on real-world datasets with millions of relations: predicting logical relationships in a network of drug-gene-disease interactions and in a graph-based representation of social interactions derived from a popular web forum.

Multi-Task Learning as Multi-Objective Optimization

The whole deal

In multi-task learning, multiple tasks are solved jointly, sharing inductive bias between them. Multi-task learning is inherently a multi-objective problem because different tasks may conflict, necessitating a trade-off. A common compromise is to optimize a proxy objective that minimizes a weighted linear combination of per-task losses. However, this workaround is only valid when the tasks do not compete, which is rarely the case. In this paper, we explicitly cast multi-task learning as multi-objective optimization, with the overall objective of finding a Pareto optimal solution. To this end, we use algorithms developed in the gradient-based multi-objective optimization literature. These algorithms are not directly applicable to large-scale learning problems since they scale poorly with the dimensionality of the gradients and the number of tasks. We therefore propose an upper bound for the multi-objective loss and show that it can be optimized efficiently. We further prove that optimizing this upper bound yields a Pareto optimal solution under realistic assumptions. We apply our method to a variety of multi-task deep learning problems including digit classification, scene understanding (joint semantic segmentation, instance segmentation, and depth estimation), and multi-label classification. Our method produces higher-performing models than recent multi-task learning formulations or per-task training.

RenderNet: A deep convolutional network for differentiable rendering from 3D shapes

Imminent

Traditional computer graphics rendering pipelines are designed for procedurally generating 2D images from 3D shapes with high performance. The nondifferentiability due to discrete operations (such as visibility computation) makes it hard to explicitly correlate rendering parameters and the resulting image, posing a significant challenge for inverse rendering tasks. Recent work on differentiable rendering achieves differentiability either by designing surrogate gradients for non-differentiable operations or via an approximate but differentiable renderer. These methods, however, are still limited when it comes to handling occlusion, and restricted to particular rendering effects. We present RenderNet, a differentiable rendering convolutional network with a novel projection unit that can render 2D images from 3D shapes. Spatial occlusion and shading calculation are automatically encoded in the network. Our experiments show that RenderNet can successfully learn to implement different shaders, and can be used in inverse rendering tasks to estimate shape, pose, lighting and texture from a single image.

e-SNLI: Natural Language Inference with Natural Language Explanations

Big dreams

In order for machine learning to garner widespread public adoption, models must be able to provide interpretable and robust explanations for their decisions, as well as learn from human-provided explanations at train time. In this work, we extend the Stanford Natural Language Inference dataset with an additional layer of human-annotated natural language explanations of the entailment relations. We further implement models that incorporate these explanations into their training process and output them at test time. We show how our corpus of explanations, which we call e-SNLI, can be used for various goals, such as obtaining full sentence justifications of a model’s decisions, improving universal sentence representations and transferring to out-of-domain NLI datasets. Our dataset thus opens up a range of research directions for using natural language explanations, both for improving models and for asserting their trust.

Speaker-Follower Models for Vision-and-Language Navigation

Future

Navigation guided by natural language instructions presents a challenging reasoning problem for instruction followers. Natural language instructions typically identify only a few high-level decisions and landmarks rather than complete low-level motor behaviors; much of the missing information must be inferred based on perceptual context. In machine learning settings, this is doubly challenging: it is difficult to collect enough annotated data to enable learning of this reasoning process from scratch, and also difficult to implement the reasoning process using generic sequence models. Here we describe an approach to vision-and-language navigation that addresses both these issues with an embedded speaker model. We use this speaker model to (1) synthesize new instructions for data augmentation and to (2) implement pragmatic reasoning, which evaluates how well candidate action sequences explain an instruction. Both steps are supported by a panoramic action space that reflects the granularity of human-generated instructions. Experiments show that all three components of this approach — -speaker-driven data augmentation, pragmatic reasoning and panoramic action space — -dramatically improve the performance of a baseline instruction follower, more than doubling the success rate over the best existing approach on a standard benchmark.

Neural Code Comprehension: A Learnable Representation of Code Semantics

The alchemy

With the recent success of embeddings in natural language processing, research has been conducted into applying similar methods to code analysis. Most works attempt to process the code directly or use a syntactic tree representation, treating it like sentences written in a natural language. However, none of the existing methods are sufficient to comprehend program semantics robustly, due to structural features such as function calls, branching, and interchangeable order of statements. In this paper, we propose a novel processing technique to learn code semantics, and apply it to a variety of program analysis tasks. In particular, we stipulate that a robust distributional hypothesis of code applies to both human- and machine-generated programs. Following this hypothesis, we define an embedding space, inst2vec, based on an Intermediate Representation (IR) of the code that is independent of the source programming language. We provide a novel definition of contextual flow for this IR, leveraging both the underlying data- and control-flow of the program. We then analyze the embeddings qualitatively using analogies and clustering, and evaluate the learned representation on three different high-level tasks. We show that even without fine-tuning, a single RNN architecture and fixed inst2vec embeddings outperform specialized approaches for performance prediction (compute device mapping, optimal thread coarsening); and algorithm classification from raw code (104 classes), where we set a new state-of-the-art.

Generalisation of structural knowledge in the hippocampal-entorhinal system

Moonshot

A central problem to understanding intelligence is the concept of generalisation. This allows previously learnt structure to be exploited to solve tasks in novel situations differing in their particularities. We take inspiration from neuroscience, specifically the hippocampal-entorhinal system known to be important for generalisation. We propose that to generalise structural knowledge, the representations of the structure of the world, i.e. how entities in the world relate to each other, need to be separated from representations of the entities themselves. We show, under these principles, artificial neural networks embedded with hierarchy and fast Hebbian memory, can learn the statistics of memories and generalise structural knowledge. Spatial neuronal representations mirroring those found in the brain emerge, suggesting spatial cognition is an instance of more general organising principles. We further unify many entorhinal cell types as basis functions for constructing transition graphs, and show these representations effectively utilise memories. We experimentally support model assumptions, showing a preserved relationship between entorhinal grid and hippocampal place cells across environments.

Where Do You Think You’re Going?: Inferring Beliefs about Dynamics from Behavior

Internals

Inferring intent from observed behavior has been studied extensively within the frameworks of Bayesian inverse planning and inverse reinforcement learning. These methods infer a goal or reward function that best explains the actions of the observed agent, typically a human demonstrator. Another agent can use this inferred intent to predict, imitate, or assist the human user. However, a central assumption in inverse reinforcement learning is that the demonstrator is close to optimal. While models of suboptimal behavior exist, they typically assume that suboptimal actions are the result of some type of random noise or a known cognitive bias, like temporal inconsistency. In this paper, we take an alternative approach, and model suboptimal behavior as the result of internal model misspecification: the reason that user actions might deviate from near-optimal actions is that the user has an incorrect set of beliefs about the rules — the dynamics — governing how actions affect the environment. Our insight is that while demonstrated actions may be suboptimal in the real world, they may actually be near-optimal with respect to the user’s internal model of the dynamics. By estimating these internal beliefs from observed behavior, we arrive at a new method for inferring intent. We demonstrate in simulation and in a user study with 12 participants that this approach enables us to more accurately model human intent, and can be used in a variety of applications, including offering assistance in a shared autonomy framework and inferring human preferences.

Task-Driven Convolutional Recurrent Models of the Visual System

Big problems/Future

Feed-forward convolutional neural networks (CNNs) are currently state-of-the-art for object classification tasks such as ImageNet. Further, they are quantitatively accurate models of temporally-averaged responses of neurons in the primate brain’s visual system. However, biological visual systems have two ubiquitous architectural features not shared with typical CNNs: local recurrence within cortical areas, and long-range feedback from downstream areas to upstream areas. Here we explored the role of recurrence in improving classification performance. We found that standard forms of recurrence (vanilla RNNs and LSTMs) do not perform well within deep CNNs on the ImageNet task. In contrast, novel cells that incorporated two structural features, bypassing and gating, were able to boost task accuracy substantially. We extended these design principles in an automated search over thousands of model architectures, which identified novel local recurrent cells and long-range feedback connections useful for object recognition. Moreover, these task-optimized ConvRNNs matched the dynamics of neural activity in the primate visual system better than feedforward networks, suggesting a role for the brain’s recurrent connections in performing difficult visual behaviors.

NIPS/NeurIPS 2018: Best* of the First Two Poster Sessions was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.