Zirui Song — PhD @ MBZUAI · NLP & Multimodal AI

Zirui Song /ˈziːˌruː.i/ /sɔːŋ/

First-year PhD student in NLP at Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi.

PhD @ MBZUAINLPMultimodal AIGeolocationTrustworthy MLLMs

I am supervised by Prof. Xiuying Chen and Prof. Xiaojun Chang. I received my Bachelor of Engineering (Honours) in Software Engineering with First Class Honours from the University of Technology Sydney, where I also received the Dean's List 2025 prize (Top 2% of students).

Before that, I was a member of UTS-NLP since Oct 2023, where I was fortunate to be advised by Prof. Ling Chen and mentored by Prof. Meng Fang.

Currently working on multimodal reasoning, geolocation, embodied agents, and trustworthy MLLMs.

Google Scholar GitHub Email CV (PDF)

2026-04-292 papers were accepted by ICML 2026.
2026-04-076 papers (1 oral) were accepted by ACL 2026.
2025-11-08One paper was accepted by AAAI 2026.
2025-09-19One paper was accepted by NeurIPS 2026.
2025-08-213 papers (1 oral) were accepted by EMNLP 2025.
2025-07-25I was admitted to the degree of Bachelor of Engineering (Honours) in Software Engineering with First Class Honours.
2025-07-06I won the Dean's List 2025 prize (Top 2% of students) from UTS.
2025-07-02One paper was accepted by ECAI 2026.
2025-05-16One paper was accepted by ACL 2025.
2025-04-20One paper was accepted by Nature Computational Science.
2025-03-01Admitted to MBZUAI, where I commenced my PhD studies in August 2025.
2025-01-23One paper was accepted by NAACL 2025.
2025-01-02One paper was accepted by Communications Chemistry.
2024-09-25First day as a visiting student at MBZUAI under the supervision of Prof. Xiuying Chen.
2024-09-20One paper was accepted by EMNLP 2024.
2024-07-01One paper was accepted by ECCV 2024.

Earlier news

2023-11-29Prof. Ling Chen accepted me as an undergraduate research assistant at the Australian Artificial Intelligence Institute (AAII).
2023-07-01I was selected as an international exchange student majoring in Software Engineering at UTS.

Five selected works — first-author or co-first-author. The full list, including the freshest ICML 2026 and ACL 2026 acceptances, lives on Google Scholar.

ACL 2026 18 citations

Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models

Zirui Song, Qian Jiang, Mingxuan Cui, Mingzhe Li, Lang Gao, Zeyu Zhang, Zixiang Xu, Yanbo Wang, Chenxi Wang, Guangxian Ouyang, Zhenhao Chen, Xiuying Chen

An open, comprehensive benchmark exposing how large audio-language models can be jailbroken through the audio channel.

AAAI 2026

ManipLVM-R1: Reinforcement Learning for Reasoning in Embodied Manipulation with Large Vision-Language Models

Zirui Song, Guangxian Ouyang, Mingzhe Li, Yuheng Ji, Chenxi Wang, Zixiang Xu, Zeyu Zhang, Xiaoqing Zhang, Qian Jiang, Zhenhao Chen, Zhongzhi Li, Rui Yan, Xiuying Chen

Reinforcement learning for reasoning in embodied manipulation, aligning large vision-language models with affordance and trajectory rewards.

EMNLP 2025 Findings 143 citations

Injecting Domain-Specific Knowledge into Large Language Models: A Comprehensive Survey

Zirui Song*, Bin Yan*, Yuhan Liu, Miao Fang, Mingzhe Li, Rui Yan, Xiuying Chen

A taxonomy of domain knowledge injection: dynamic injection, static embedding, modular adapters, and prompt optimization.

NAACL 2025 23 citations

Hazards in Daily Life? Enabling Robots to Proactively Detect and Resolve Anomalies

Zirui Song*, Guangxian Ouyang*, Meng Fang, Hongbin Na, Zijing Shi, Zhenhao Chen, Yujie Fu, Zeyu Zhang, Shiyu Jiang, Miao Fang, Ling Chen, Xiuying Chen

AnomalyGen builds virtual anomaly scenes without human annotation to make household robots more robust.

ECCV 2024 60 citations

BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models

Rizhao Cai*, Zirui Song*, Dayan Guan, Zhenhao Chen, Xing Luo, Chenyu Yi, Alex Kot

A comprehensive benchmark for the cross-style visual capability of LMMs. Toolkit & Code

* equal contribution

Multimodal AI

My current research goal is to integrate multimodal information to improve the performance of large language models, with applications in geolocation and embodied AI.

MLLMsGeolocationEmbodied AIWorld Understanding

Trustworthy AI

I am also interested in jailbreak and attack issues of multimodal language models, particularly in vision and audio modalities.

JailbreakVision SafetyAudio SafetyRobustness

One question keeps pulling me back: given a photograph of a street it has never seen, can a model reason its way home?

2025.08 – 2029.05

PhD, Mohamed bin Zayed University of Artificial Intelligence

Expected · NLP Department · UAE Government Scholarship

2021.06 – 2025.05

B.E. (Honours), University of Technology Sydney

Software Engineering · First Class Honours · GPA 3.90/4.00 · Dean's List 2025 (Top 2%)

2026.03 – now

Alibaba Group

Algorithm Engineer (Research Intern) · Supervised by Xiang Wang

2024.09 – now

MBZUAI

Visiting student → PhD · Supervisor: Prof. Xiuying Chen · Trustworthy MLLMs

2023.10 – 2025.02

University of Technology Sydney

Research Intern · Prof. Ling Chen & Prof. Meng Fang · Multimodal Agents

2023.03 – 2024.01

Nanyang Technological University

Research Intern · Supervisor: Prof. Alex Kot · Multimodal

🎖️
Dean's List 2025 (Top 2% of students), University of Technology Sydney · certificate
🥈
Silver Medal, Kaggle — LLM Science Exam [51 / 2664], 2024
🥇
School Second Class Scholarship, 2022

Conference Reviewer

NeurIPS 2027ICML 2026ACL 2026ECCV 2026 EACL 2026ICLR 2026AAAI 2026EMNLP 2025 COLM 2025ACM MM 2025NeurIPS 2025ACL 2025 NAACL 2025ICME 2025IJCAI 2025EMNLP 2024

Journal Reviewer

IEEE TPAMI IEEE TAI

05/24National Undergraduate Innovation Project Documentation 中文
03/24Negative Transfer 中文
03/24Mixture of Experts Explained 中文
01/24EMNLP 2020 Tutorial Notes — Explainable AI 中文

First year of the PhD. The desert keeps its own hours, and so do I.

Lately I have been circling one question: what does it mean for a model to understand the world it has been shown. Not the act of prediction (we have plenty of that), but the quieter thing underneath. Whether it knows where it stands. Whether it can tell when it is wrong. Whether, given a photograph of a street it has never seen, it can reason its way home.

Most of what I work on lives near this question. Multimodal reasoning. Geolocation. Embodied agents that must act in places they have never been. The trust we extend, or refuse, to what a model claims it sees. I keep walking into the same room through different doors.

I read more than I write. I rewrite more than I publish. Some of the papers listed under my name belong to a younger version of me, and I am still learning how to be honest about that.

I have never loved being alive this much. New cities. New languages overheard on the bus. New collaborators who became friends before they became coauthors. I owe the courage of this season to the UAE Government Scholarship, which let me walk through doors I had only read about. I do not take that lightly.

What I am holding this season

a draft I do not yet know how to finish
a suspicion that our benchmarks have been answering the wrong question
a quiet thank-you to the people who wrote back to me when I was an undergraduate and unsure of everything

The plan, if it can be called one, is to stay here long enough to plant something. The desert is not empty; it is patient. I would like to grow a small oasis on it, the slow kind, one paper, one student, one honest conversation at a time.

If you are working on something you cannot let go of, I would like to hear about it. My inbox is mostly quiet after midnight.

Last updated: May 2026

The fastest way to reach me is email. My inbox is mostly quiet after midnight — that is when I read carefully.

✉️ziruisong2003@gmail.com
🎓Google Scholar
🐙github.com/ZiruiSongBest
📍MBZUAI, Abu Dhabi, United Arab Emirates
📄Curriculum Vitae (PDF)