AudioEditor: A Training-Free Diffusion-Based
Audio Editing Framework

Yuhang Jia¹, Yang Chen¹, JingHua Zhao¹, Shiwan Zhao¹

Wenjia Zeng², Yong Chen², Yong Qin^1,*

¹College of Computer Science, Nankai University, Tianjin, China

²Lingxi (Beijing) Technology Co., Ltd., Beijing, China

^*Corresponding author

[Paper on ArXiv] [Code on GitHub]

Abstract

Diffusion-based text-to-audio (TTA) generation has made substantial progress, leveraging Latent Diffusion Model (LDM) to produce high-quality, diverse and instruction-relevant audios. However, beyond generation, the task of audio editing remains equally important but has received comparatively little attention. Audio editing tasks face two primary challenges: executing precise edits and preserving the unedited sections. While workflows based on LDMs have effectively addressed these challenges in the field of image processing, similar approaches have been scarcely applied to audio editing. In this paper, we introduce AudioEditor, a training-free audio editing framework built on the pretrained diffusion-based TTA model. AudioEditor incorporates Null-text Inversion and EOT-Suppression methods, enabling the model to preserve original audio features while executing accurate edits. Comprehensive objective and subjective experiments validate the effectiveness of AudioEditor in delivering high-quality audio edits.

Figure 1: The overall workflow of AudioEditor. The workflow can be divided into 5 parts: A) audio space processing, B) spectrogram space processing, C) latent space processing, D) performing DDIM Inversion and Null-text Optimization, and E) performing EOT-suppression and Attention Loss updating.

Delete

Original prompt: Sound of the car horn was followed by laughter.

Target prompt: Sound of the car horn was followed by ~~laughter~~.

Origin	Auffusion	Baseline	AudioEditor

Original prompt: After a gunshot, there was a burst of dog barking.

Target prompt: After a gunshot, there was a burst of ~~dog barking~~.

Origin	Auffusion	Baseline	AudioEditor

Original prompt: Keys jingle as a car attempts to start.

Target prompt: ~~Keys jingle~~ as a car attempts to start.

Origin	Auffusion	Baseline	AudioEditor

Original prompt: A cat is meowing in noise.

Target prompt: A cat is meowing in ~~noise~~.

Origin	Auffusion	Baseline	AudioEditor

Replace

Original prompt: After a gunshot, there was a burst of dog barking.

Target prompt: After a thunder, there was a burst of dog barking.

Origin	Auffusion	Baseline	AudioEditor

Original prompt: A baby is crying.

Target prompt: A baby is laughing.

Origin	Auffusion	Baseline	AudioEditor

Original prompt: An old man speaking.

Target prompt: An young man speaking.

Origin	Auffusion	Baseline	AudioEditor

Original prompt: Playing joyful melodies on the piano.

Target prompt: Playing joyful melodies on the drums.

Origin	Auffusion	Baseline	AudioEditor

Add

Original prompt: A woman is giving a speech.

Target prompt: A woman is giving a speech amid cheers.

Origin	Auffusion	Baseline	AudioEditor

Original prompt: A crowd applauds.

Target prompt: A crowd applauds, while music plays.

Origin	Auffusion	Baseline	AudioEditor

Original prompt: A baby is crying.

Target prompt: A baby is crying as wind blowing.

Origin	Auffusion	Baseline	AudioEditor

Original prompt: Birds are chirping in the rain.

Target prompt: Birds are chirping in the rain and thunder rumbles.

Origin	Auffusion	Baseline	AudioEditor