in , ,

Riffusion: A Tool That Generates Music From Text Using Stable Diffusion

Riffusion Generates Music From Text

DALL-E 2, Imagen, Midjourney, and, in particular, Stable Diffusion have become a true phenomenon, shaking the world of art and design and allowing anyone to generate amazing images thanks to artificial intelligence.

If you don’t know, Stable Diffusion is a deep machine learning model released this year that made a big splash by generating high-quality images from text. The best part of this model is that it is open-source, allowing you to include this fantastic AI into your application. It has already been implemented in Lensa AI, etc., and the developers are proposing new ways to enjoy AI.

And now, two other researchers, Seth Forsgren and Hayk Martiros, have shown the great potential of stable Diffusion. They used Stable Diffusion to generate spectrograms, graphic representations of sound intensity as a function of time and frequency. They created Riffusion, a tool that combines instruments to generate musical excerpts and produce interesting results.

A spectrogram is a visual depiction of a sound clip’s frequency content. The X-axis is for time, while the Y-axis is for frequency. The colour of each pixel represents the audio amplitude at the frequency and time given in that row and column.

A spectrogram can be computed from the speech using the short-time Fourier transform (STFT). STFT approximates speech as a combination of sine waves with different amplitudes and phases.

Since the STFT is invertible, Forsgren and Martiros seem to have used the spectrogram produced by Stable Diffusion to create the speech. In this process, since the spectrogram originally contained only the amplitude of a sine wave, the Griffin-Lim algorithm was used to approximate the phase and reconstruct the audio clip. In addition, GPU is used for the efficiency of audio processing, and the Torchaudio library is used for that purpose.

You can listen to the work of the two researchers on the Riffusion project website. It uses Stable Diffusion, a text-to-image model, so you can create your music by describing the type of music you want to create. Detailed technical information on Riffusion is explained on the dedicated page.

A link to the GitHub repository is published at the bottom of the page. You can download the code and use Riffusion on your system if interested.