Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

Hyogon Ryu, Jeonghwan Kim, Yewon Lim, Chaeun Lee, Jeongwook Kim, Donghoon Ham

Language Model ICML 2026 Workshop

Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

AgentVidBench: A Multi-Hop Video Question Answering Benchmark for Evaluating MLLM Agents

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho

Language Model ICML 2026 Workshop

AgentVidBench: A Multi-Hop Video Question Answering Benchmark for Evaluating MLLM Agents

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho

Language Model ICML 2026 Workshop

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Identifiable Token Correspondence for World Models

Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

Language Model ICML 2026

Identifiable Token Correspondence for World Models

How to Correctly Report LLM-as-a-Judge Evaluations

Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee

Language Model ICML 2026

How to Correctly Report LLM-as-a-Judge Evaluations

ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning

Yuchen Zeng, Shuibai Zhang, Wonjun Kang, Shutong Wu, Lynnix Zou, Ying Fan, Heeju Kim, Ziqian Lin, Jungtaek Kim, Hyung Il Koo, Dimitris Papailiopoulos, Kangwook Lee

Language Model ICML 2026

ReJump: A Tree-Jump Representation for Analyzing and Improving LLM Reasoning

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Seojeong Park*, Jiho Choi*, Junyong Kang, Seonho Lee, Jaeyo Shin, Hyunjung Shim

Language Model ICML 2026

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models

Sanghyun Lee, Seungryong Kim, Jongho Park, Dongmin Park

Language Model ICML 2026

Lookahead Unmasking Elicits Accurate Decoding in Diffusion Language Models

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Dongmin Park, Minkyu Kim, Beongjun Choi, Junhyuck Kim, Keon Lee, Jonghyun Lee, Inkyu Park, Byeong-Uk Lee, Jaeyoung Hwang, Jaewoo Ahn, Ameya S. Mahabaleshwarkar, Bilal Kartal, Pritam Biswas, Yoshi Suhara, Kangwook Lee, Jaewoong Cho

Language Model

Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams

Jiyeon Kim, Hyunji Lee, Dylan Zhou, Sue Hyun Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Sungmin Cha, Minjoon Seo

Language Model ACL 2026

Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams