Generally AI Episode 2: AI-Generated Speech and Music

12 Feb 2024 (over 1 year ago)

AI-Generated Voices

Stephen Hawking used a voice synthesizer called the Cortex 510, which was based on the voice of Dennis Clut.
Apple is introducing a new feature called "Personal Voice" in iOS, which allows users to create their own synthetic voice.
Artificially generated voices can be used for various purposes, including assisting individuals with speech disabilities, impersonating others for malicious intent, and editing audio content.
Meta's Voice Box model, an open-source tool, enables users to create synthetic voices, but access to the model is currently limited.
AI voice generation tools require explicit consent from the voice owner to create an artificial model of their voice.
Malicious use of AI-generated voices includes impersonating celebrities or individuals for financial gain or spreading misinformation.
Celebrities are offering services to record personalized voice messages for a fee, raising ethical concerns about consent and authenticity.
Protecting oneself from voice theft involves limiting publicly available recordings, being cautious of unusual requests (e.g., asking for gift cards), and verifying personal relationships through unique questions.

The ethical use of AI-generated voices should prioritize entertainment value and beneficial purposes, while considering potential malicious uses.
Deepfake technology, including AI-generated voices, poses legal challenges regarding copyright, ownership, and impersonation.

In the 1980s, hip-hop acts like Africa Bambaataa used synthesized sounds to replace real instruments, made possible by the development of MIDI (Musical Instrument Digital Interface).
Generative AI models like OpenAI's MuseNet and Google's Music Transformer can generate sequences of MIDI notes, allowing for the creation of new music.
Diffusion models, commonly used for image generation, have also been applied to music generation.
Google's Noise2Music model takes audio noise and progressively denoises it, guided by a text prompt.
Spectrograms, which represent sound as images, can be generated and modified using fine-tuned diffusion models.
Recent techniques for music generation at the audio level include Meta's MusicGen and Google's MusicLM, which output audio tokens instead of text tokens.
The metawin AI can generate 12-second audio clips with one bar per second, while Google's AI cannot generate audio.
The metawin AI generated a blues riff that was better than the first two clips generated by other AIs.
The Riffusion AI generated a continuous stream of music that was not well-received.
Stable diffusion models do not have any grammar rules or music theory, they generate music from nothing.
There is a potential market for AI-generated music, especially for street performers who can use it as a backing band.

The speaker owns a record player and found a record with the sounds of the Moog synthesizer when it was new.
Moog is a synthesizer company based in North Carolina.
Moog holds an annual festival in Durham, North Carolina.
The festival is expensive to attend.
Attendees do not receive a free synthesizer for attending the festival.