The Speech and Music AI: Current Research Frontier workshop was held on 28 April 2025 at the NUS School of Computing, as a side event welcoming distinguished researchers attending ICLR 2025 in Singapore. We were honoured to host professors, scientists, and students from institutions including CUHK-Shenzhen, QMUL, Belmont University, UPF, UCSD, MBZUAI, NYU, and Meta, who shared their cutting-edge research on speech and music AI technologies. The seminar was chaired by Prof. Ye Wang and was open to all faculty and students of the School of Computing.
[01] Recent Advances of Speech Generation and DeepFake Detection
Abstract: This talk will cover the recent research activities by the Amphion team led by Prof. Zhizheng Wu. The talk will focus on spoken language understanding, speech generation, unified and foundation models for speech synthesis, voice conversion, and speech enhancement. The talk will be accompanied by three highlight talks of Prof. Wu’s PhD students who presented their work at ICLR 2025 in Singapore.
[01.1] Recent Advances of Speech Generation and DeepFake Detection
Presented by Zhizheng Wu, Associate Professor, Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen).
[01.2] Controllable and Unified Speech and Singing Voice Generation
Presented by Xueyao Zhang, PhD student, CUHK-Shenzhen.
[01.3] MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
Presented by Yuancheng Wang, PhD student, CUHK-Shenzhen.
[01.4] AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement
Presented by Junan Zhang, PhD student, CUHK-Shenzhen.
[02] Extending LLMs for Acoustic Music Understanding: Models, Benchmarks, and Multimodal Instruction Tuning
Abstract: Large Language Models (LLMs) have transformed learning and generation in text and vision, yet acoustic music—an inherently multimodal and expressive domain—remains underexplored. In this talk, I present recent progress in leveraging large-scale pre-trained models and instruction tuning for music understanding. I introduce MERT, a self-supervised acoustic music model with over 50k downloads, and MRABLE, a universal benchmark for evaluating music audio rePresenteds. I also present MusiLingo, a system that aligns pre-trained models across modalities to support music captioning and question answering. To address the gap in evaluating instruction-following capabilities, I propose CMI-Bench, the first benchmark designed to test models' ability to understand and follow complex music-related instructions across audio, text, and symbolic domains. I conclude by discussing open challenges in responsible deployment of generative music AI.
Presented by Yinghao Ma, PhD student in AI Music, Queen Mary University (QMUL).
[03] Flocoder: A Latent Flow-Matching Model for Symbolic Music Generation and Analysis
Abstract: We present work in progress on an open source framework for latent flow matching to provide generative MIDI outputs for inpainting tasks such as continuation and/or melody or accompaniment generation. This is a reframing of a prior diffusion-based system ("Pictures of MIDI", arXiv:2407.01499) for more efficient inference. A further goal is to constrain the quantized latent rePresenteds to correspond to repeated musical motifs, allowing the embeddings to be used for motif analysis. This is all presented in a new open-source framework called "flocoder" to which interested students are invited to contribute!
Presented by Scott H. Hawley, Professor, Belmont University.
[04] AI-Enhanced Music Learning
Abstract: Learning to play a musical instrument is a difficult task that requires sophisticated skills. This talk will describe approaches to designing and implementing new interaction paradigms for music learning and training based on state-of-the-art AI techniques applied to multimodal (audio, video, and motion) data.
Presented by Rafael Ramirez, Associate Professor, Universitat Pompeu Fabra (UPF).
[05] Presto! Distilling Steps and Layers for Accelerating Music Generation
Abstract: Despite advances in diffusion-based text-to-music (TTM) methods, efficient, high-quality generation remains a challenge. We introduce Presto!, an approach to inference acceleration for score-based diffusion transformers via reducing both sampling steps and cost per step. To reduce steps, we develop a new score-based distribution matching distillation (DMD) method for the EDM-family of diffusion models, the first GAN-based distillation method for TTM. To reduce the cost per step, we develop a simple, but powerful improvement to a recent layer distillation method that improves learning via preserving hidden state variance. Finally, we combine our improved step and layer distillation methods together for a dual-faceted approach. We evaluate our step and layer distillation methods independently and show each yield best-in-class performance. Furthermore, we find our combined distillation method can generate high-quality outputs with improved diversity accelerating our base model by 10-18x (32 second output in 230ms, 15x faster than the comparable SOTA model) -- the fastest high-quality TTM model to our knowledge.
Presented by Zachary Novack, PhD student, University of California San Diego (UCSD).
[06] Current Research at the Music X Lab
Abstract: In this talk, Gus will provide a brief overview of research at the Music X Lab, highlighting key projects at the intersection of music and artificial intelligence. He will also share reflections on the evolving landscape of Music AI in the age of large language models (LLMs), with a focus on the unique role and value of academic research in shaping its future. The talk will be accompanied by three highlight talks of Prof. Xia’s students who presented their work at ICLR 2025 in Singapore.
[06.1] Towards Meaningful (Music) AI Research
Presented by Gus Xia, Associate Professor, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI).
[06.2] EXPOTION: Facial Expression and Motion Control for Multimodal Music Generation
Presented by Xinyue Li, PhD student, MBZUAI.
[06.3] Emergent Content-Style Disentanglement via Variance-Invariance Constraints
Presented by Yuxuan Wu, PhD student, MBZUAI.
[06.4] Versatile Symbolic Music-for-Music Modeling via Function Alignment
Presented by Junyan Jiang, PhD student, New York University (NYU).
[07] Natural Language-Driven Audio Intelligence for Content Creation
Abstract: In the digital age, audio content creation has transcended traditional boundaries, becoming a dynamitic field that fuses technology, creativity, and user experience. This talk will discuss the recent advances in natural language-driven audio AI technologies that reshapes human interaction with audio content creation. In this talk, we first introduce the work on language-queried audio source separation (LASS), which aims to extract desired sounds from audio recordings using natural language queries. We’ll present AudioSep - a foundation model we proposed for separating speech, music and sound events using natural language queries. We will then discuss our first work on language-modelling & latent diffusion-based (AudioLDM) approaches for audio generation. Finally, we will introduce WavJourney, a Large Language Model (LLM) based AI agent for compositional audio creation. WavJourney is capable of crafting vivid audio storytelling with personalized voices, music, and sound effects simply from text instructions. We will further discuss the potential of our proposed compositional approach for audio generation, showing our experimental findings and state-of-the art results we’ve achieved on text-to-audio generation tasks.
Presented by Xubo Liu, Research Scientist, Meta.