ICLR – notes - Bo Song

Day 1 – Thursday

Keynote speak

The Challenges of Human-Centered AI and Robotics: What We Want, Need, and are Getting From Human-Machine Interaction

Maja Matarić

Why I/we work on AI?

Intellectual curiosity

Intelligence is a symphony of

Physical + cognitive + and social/socio-emotional abilities.

Human intelligence is

Body + environment
Embodied and situated

Current model training for embodied intelligence is limited

Socio-emotional intelligence is still overlooked, yet is a key challenge & enabler.

“Information does not drive change, motivation drives changes.”

只有生命才能影响生命

Physical embodiment is critical.

The brain is activated in different parts when we are just talking vs body touching.
The puppeteer example – the baby sized robot is used to attract baby’s attention and coach baby to kick their legs for exercise. The baby mimics the robot’s body movement.

From embodiment to persona: modeling personality

Evaluation in an exercise study: human like the personal-like robots more.
人们喜欢性格和他自己接近的机器人

Vulnerability elicit human’s willing to help robot.

So the core idea of this talk is embodied and socio-emotional centered AI.

Vocab:
- I’m Pragmatic personally
- Humanoids ?
- Pavlovian

Oral session

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

Long thinking sequence does not increase model performance.

How to reduce thinking sequence ?

Label whether a reasoning chunk contains a ground truth answer.

Instead of directly using a strong model, use a smaller model and distill knowledge from a stronger teacher model.

They trained a not-necessary-reasoning (NRP detector) chunk detector. The detector detects whether a reasoning chunk is redundant or not.

And then they designed a loss function to penalize the overthinking reasoning chunk.

Question?

Is there cases where. The reasoning chunk , after “wait, but” is actually helpful in improving the model performance? For example, the previous reasoning chunk, before “wait, but” was wrong.

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

Active reasoning

When model receives partial, incomplete information. It actively use tools or ask users for more information before responding.

Why do LLM agents can get trapped in active reasoning.

Why LLM lose track about what they have asked before? And keep asking the same question?
Because even if the previously asked question and answer is within the model’s context window. The model may still do not pay enough attention to it.

Thu 23 Apr 3:15 p.m. -03 — 5:45 p.m. -03

Poster P3- #1716

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

Addressing the challenge of long-context task through enhanced memory management.

How does this work?

Maintain a fixed token length memory. Read document chunk by chunk.

P4 #4902 3:15-5:45PM

Verifying Chain-of-Thought Reasoning via Its Computational Graph

Revela: Dense Retriever Learning via Language Modeling

Task:

Next token prediction in LM
In retrieval: net trunk prediction.

RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format

Merge large reasoning model with instruction fine-tuned model

When sampling, the reward model can be a rule based model, not necessarily be an LLM.
When facing a very hard question, if the correct trajectory is in a very low ratio. Use the few correct trajectory ratio as the golden set for a temporary SFT.
The loss function of SFT and RL is different.

When sampling in RL, we can let model first come up with 5 different strategies. And then based on each these different strategies, come up with 5 responses per strategy.

Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

For graph inference, do not prompt it the whole graph. However, give it tool extension.
Let the model to call the tool to retrieve the related graph node info prompt by prompt.

Day 2 – 3C Oral Session 3C ML architectures and training I

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

In LLM pertaining

Limitation of the standard cosine LRS
WSD: warm-start, stable, decay

Limitation: need to decide at which point to decay the learning rate. If we have more data coming, the training has to be restarted from scratch.

In this paper, they propose the WSM, warmup-stable and merge framework, for LLM pertaining LR scheduling. And then they merge the checkpoints , weighted merge checkpoints , to simulate the learning rate schedule.

The benefit is flexibility.

Does it still need the warm up stage for the learning rate?
- YES.

I can use it for Jax learning rate schedule. Experimentation.

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

任务敏感的 TPP (Tokens Per Parameter) 权衡

In pretraining, tokens per parameter means the ratio of tokens in pre-training / parameters in total.

For memorization tasks, like low TPP is better.

For reasoning tasks, ~20 TPP performs best.

MoE activates only a subset of parameters / experts per token. Yes.

How sparse MoE works. How it is trained and inference.

There is a router, a linear layer, before the experts, to calculate scores for each expert. It only activates top K experts with highest scores.
During training, auxiliary loss is added to this router to prevent it always route to a same expert.
- Each export shall be activated roughly same time in a same batch.
- Gaussian noise is added to the activation scores for each expert.

GRPO what is it

推理任务是“数据饥渴型”：推理能力在 TPP ≈ 20（每个参数对应 20 个 Token）时达到顶峰。如果为了降低 Loss 而盲目增加专家数（导致 TPP 过低），每个专家分到的数据太少，推理能力反而会退化。

Does token per parameter means in training, how many token is assigned to each parameter? Yes, in pretraining.

Total tokens per parameter and the number of active parameters are key factors in reasoning performance.

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Data curriculum: sort data in ascending order based on quality score. Quality score means the content quality, it is not the difficulty for the next token prediction

The warmup-stable-decay learning rate schedule wastes the high quality data model sees later.

Their solution is to fix the learning rate after stable stage. And do (model weights averaging?)

Use case:

For Jax model training, fix learning rate , do not decay. And then do (model averaging).
The latest training data in Discover has the highest quality.

Why do we warm start learning rate, then stable and then decay?

Why not constant LR all the time?
Why not only warm start and then constant?

Warm start: avoid weights explode

Constant: constant learning

Decay: finally landing at the most precise, minimal optimal.

Find his poster session.

P3-#521

Why putting highest quality data in the middle achieves highest performance?

No, it puts the highest quality data in the end of pertaining.

We cannot put the highest quality data in the beginning due to catroropic loss. (数据遗忘）

Softmax Transformers are Turing-Complete

Transformer is Turing-complete. It is essentially the interpreter for any code.

The ability comes from the Chain-of-thought mechanism, the output token is feed back to the transformer. It can serve as infinite memory for transformer.

Pre-training under infinite compute

What is power law and scaling law?

Scaling low, power law in training loss and parameter sizes.

The pertaining loss decreases proportionally with computation, model size and data size.

Invited talk: Images of the hidden universe

Katie Bouman

Caltech

summarize this paper for me. what problem it solves and how they solve it

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Problem: the KV cache size explodes in large reasoning models since the CoT is long.

Solution:

Categorize thinking / reasoning token into 3 categories: reasoning, execution and transition.
Think before quantize: Use more bits to quantize reasoning and execution; use less bits to quantize transition tokens KV cache.
Think Before You Evict (TBE): It uses a “proactive eviction” scheme. When the model hits a Transition thought (indicating a change in reasoning direction), it aggressively prunes or “anneals” older, less relevant tokens from previous segments.

What is anneals?

MrRoPE: Mixed-radix Rotary Position Embedding

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

https://openreview.net/pdf?id=MpeyjgWbKt

Poster session

ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Day 3 Reinforcement learning

summarize this paper for me. what is their work , what problem they solve

Semi-Supervised Preference Optimization with Limited Feedback

Problem: RL labeled preference data is scare.
Solution: Label SFT data as pos and neg and use it for RL.
Unresolved: how the priming work? Use the human preference data, small amount,

Multiplayer Nash Preference Optimization

The Art of Scaling Reinforcement Learning Compute for LLMs

Problem

Off policy vs on policy

How does the data quality score work?

Quality score: text that teach model reasoning, factual knowledge, and human language patterns, while minimizing noise and harmful content.
High quality pages: WiKipedia page, open source code tutorial.
Low quality pages: Celebrity gossip blogs.

What is Mamba (SSM) state model?

Mamba is a high-performance, linear-time sequence modeling architecture designed as a faster alternative to Transformers. It uses Selective State Space Models (SSMs) to process sequences, achieving linear complexity—meaning it handles long contexts efficiently with lower memory usage and faster inference

## Negotiation my salary

Mindset: The risk is not to lose the offer. The risk is say YES to a wrong offer

Scaling law vs Power law

The building LLM from scratch – Percy Liang

https://cs336.stanford.edu

Time Series Workshop

2014 – 2016: Scalable Bayesian time series models

2016 to 2023: Deep probabilistic model: from data driven to hybrid model

2023 – :Foundation models

The journey begun

Predict user next buying behavior. In Amazon.

How to use large language model to solve time series problem

They change tokenization
- https://arxiv.org/pdf/2403.07815
They scale the data by mean . Scale = 1 / C \sum(x_i)
They bucketize the continues numbers.

What is “Standard time series models often assume the future follows a specific shape, like a Bell Curve (Gaussian). “

“Often assume the future follows a specific shape?”

Chronos – time series language model

Workshop day 1

Recursive self-improvement AI

Chelsea Finn
Create a Filesystem storing experience.
Let a coding agent to code the harness that read these files.
github.com/stanford-iris-lab/meta-harness

Can I use it to build an agent for my Grandpa?

Workshop day 2

Attention sink: model attends to the beginning few of tokens very much.

The positional encoding panelize long distance attention. Thus model also attend to the end of the context window more.

Positional encoding: ALiBi (Attention with Linear Biases): The Literal Penalty
2. RoPE (Rotary Position Embedding): The “Spinning” Vectors
The original Sinusoidal Encoding proposed by the Attention is all you need paper is abandoned because:
- It assigns a position encoding based on the absolute position for each token. However, in language modeling, we care more about the relative position of each token.

GEPA optimizer , sensitive to initial initialization states?

GRPO optimizer, what is it

Bo Song

Life is beautiful. Enjoy.