Diffusion-based text-to-audio (TTA) generation has made substantial progress,
leveraging Latent Diffusion Model (LDM) to produce high-quality, diverse and instruction-relevant
audios. However,
beyond generation, the task of audio editing remains equally important but has received comparatively
little attention.
Audio editing tasks face two primary challenges: executing precise edits and preserving the unedited
sections.
While workflows based on LDMs have effectively addressed these challenges in the field of image
processing,
similar approaches have been scarcely applied to audio editing. In this paper, we introduce AudioEditor,
a training-free audio editing framework built on the pretrained diffusion-based TTA model.
AudioEditor incorporates Null-text Inversion and EOT-Suppression methods,
enabling the model to preserve original audio features while executing accurate edits.
Comprehensive objective and subjective experiments validate the effectiveness of AudioEditor in
delivering high-quality audio edits.
Delete
Original prompt: Sound of the car horn was followed by laughter.
Target prompt: Sound of the car horn was followed by laughter.
Origin
Auffusion
Baseline
AudioEditor
Original prompt: After a gunshot, there was a burst of dog barking.
Target prompt: After a gunshot, there was a burst of dog barking.
Origin
Auffusion
Baseline
AudioEditor
Original prompt: Keys jingle as a car attempts to start.
Target prompt: Keys jingle as a car attempts to start.
Origin
Auffusion
Baseline
AudioEditor
Original prompt: A cat is meowing in noise.
Target prompt: A cat is meowing in noise.
Origin
Auffusion
Baseline
AudioEditor
Replace
Original prompt: After a gunshot, there was a burst of dog barking.
Target prompt: After a thunder, there was a burst of dog barking.
Origin
Auffusion
Baseline
AudioEditor
Original prompt: A baby is crying.
Target prompt: A baby is laughing.
Origin
Auffusion
Baseline
AudioEditor
Original prompt: An old man speaking.
Target prompt: An young man speaking.
Origin
Auffusion
Baseline
AudioEditor
Original prompt: Playing joyful melodies on the piano.
Target prompt: Playing joyful melodies on the drums.
Origin
Auffusion
Baseline
AudioEditor
Add
Original prompt: A woman is giving a speech.
Target prompt: A woman is giving a speech amid cheers.
Origin
Auffusion
Baseline
AudioEditor
Original prompt: A crowd applauds.
Target prompt: A crowd applauds, while music plays.
Origin
Auffusion
Baseline
AudioEditor
Original prompt: A baby is crying.
Target prompt: A baby is crying as wind blowing.
Origin
Auffusion
Baseline
AudioEditor
Original prompt: Birds are chirping in the rain.
Target prompt: Birds are chirping in the rain and thunder rumbles.