AudioEidtor: A Training-Free Diffusion-Based
Audio Editing Framework



Yuhang Jia1, Yang Chen1, JingHua Zhao1, Shiwan Zhao1

Wenjia Zeng2, Yong Chen2, Yong Qin1,*

1College of Computer Science, Nankai University, Tianjin, China

2Lingxi (Beijing) Technology Co., Ltd., Beijing, China

*Corresponding author


Abstract

Diffusion-based text-to-audio (TTA) generation has made substantial progress, leveraging Latent Diffusion Model (LDM) to produce high-quality, diverse and instruction-relevant audios. However, beyond generation, the task of audio editing remains equally important but has received comparatively little attention. Audio editing tasks face two primary challenges: executing precise edits and preserving the unedited sections. While workflows based on LDMs have effectively addressed these challenges in the field of image processing, similar approaches have been scarcely applied to audio editing. In this paper, we introduce AudioEditor, a training-free audio editing framework built on the pretrained diffusion-based TTA model. AudioEditor incorporates Null-text Inversion and EOT-Suppression methods, enabling the model to preserve original audio features while executing accurate edits. Comprehensive objective and subjective experiments validate the effectiveness of AudioEditor in delivering high-quality audio edits.

Figure 1: The overall workflow of AudioEditor. The workflow can be divided into 5 parts: A) audio space processing, B) spectrogram space processing, C) latent space processing, D) performing DDIM Inversion and Null-text Optimization, and E) performing EOT-suppression and Attention Loss updating.



Delete


Original prompt:  Sound of the car horn was followed by laughter.

   Target prompt:  Sound of the car horn was followed by laughter.

Origin Auffusion Baseline AudioEditor

Original prompt:  After a gunshot, there was a burst of dog barking.

   Target prompt:  After a gunshot, there was a burst of dog barking.

Origin Auffusion Baseline AudioEditor

Original prompt:  Keys jingle as a car attempts to start.

   Target prompt:  Keys jingle as a car attempts to start.

Origin Auffusion Baseline AudioEditor

Original prompt:  A cat is meowing in noise.

   Target prompt:  A cat is meowing in noise.

Origin Auffusion Baseline AudioEditor



Replace


Original prompt:  After a gunshot, there was a burst of dog barking.

   Target prompt:  After a thunder, there was a burst of dog barking.

Origin Auffusion Baseline AudioEditor

Original prompt:  A baby is crying.

   Target prompt:  A baby is laughing.

Origin Auffusion Baseline AudioEditor

Original prompt:  An old man speaking.

   Target prompt:  An young man speaking.

Origin Auffusion Baseline AudioEditor

Original prompt:  Playing joyful melodies on the piano.

   Target prompt:  Playing joyful melodies on the drums.

Origin Auffusion Baseline AudioEditor



Add


Original prompt:  A woman is giving a speech.

   Target prompt:  A woman is giving a speech amid cheers.

Origin Auffusion Baseline AudioEditor

Original prompt:  A crowd applauds.

   Target prompt:  A crowd applauds, while music plays.

Origin Auffusion Baseline AudioEditor

Original prompt:  A baby is crying.

   Target prompt:  A baby is crying as wind blowing.

Origin Auffusion Baseline AudioEditor

Original prompt:  Birds are chirping in the rain.

   Target prompt:  Birds are chirping in the rain and thunder rumbles.

Origin Auffusion Baseline AudioEditor