Generally AI Episode 2: AI-Generated Speech and Music
12 Feb 2024 (over 1 year ago)

AI-Generated Voices
- Stephen Hawking used a voice synthesizer called the Cortex 510, which was based on the voice of Dennis Clut.
 - Apple is introducing a new feature called "Personal Voice" in iOS, which allows users to create their own synthetic voice.
 - Artificially generated voices can be used for various purposes, including assisting individuals with speech disabilities, impersonating others for malicious intent, and editing audio content.
 - Meta's Voice Box model, an open-source tool, enables users to create synthetic voices, but access to the model is currently limited.
 - AI voice generation tools require explicit consent from the voice owner to create an artificial model of their voice.
 - Malicious use of AI-generated voices includes impersonating celebrities or individuals for financial gain or spreading misinformation.
 - Celebrities are offering services to record personalized voice messages for a fee, raising ethical concerns about consent and authenticity.
 - Protecting oneself from voice theft involves limiting publicly available recordings, being cautious of unusual requests (e.g., asking for gift cards), and verifying personal relationships through unique questions.
 
Ethical Considerations
- The ethical use of AI-generated voices should prioritize entertainment value and beneficial purposes, while considering potential malicious uses.
 - Deepfake technology, including AI-generated voices, poses legal challenges regarding copyright, ownership, and impersonation.
 
Music Generation
- In the 1980s, hip-hop acts like Africa Bambaataa used synthesized sounds to replace real instruments, made possible by the development of MIDI (Musical Instrument Digital Interface).
 - Generative AI models like OpenAI's MuseNet and Google's Music Transformer can generate sequences of MIDI notes, allowing for the creation of new music.
 - Diffusion models, commonly used for image generation, have also been applied to music generation.
 - Google's Noise2Music model takes audio noise and progressively denoises it, guided by a text prompt.
 - Spectrograms, which represent sound as images, can be generated and modified using fine-tuned diffusion models.
 - Recent techniques for music generation at the audio level include Meta's MusicGen and Google's MusicLM, which output audio tokens instead of text tokens.
 - The metawin AI can generate 12-second audio clips with one bar per second, while Google's AI cannot generate audio.
 - The metawin AI generated a blues riff that was better than the first two clips generated by other AIs.
 - The Riffusion AI generated a continuous stream of music that was not well-received.
 - Stable diffusion models do not have any grammar rules or music theory, they generate music from nothing.
 - There is a potential market for AI-generated music, especially for street performers who can use it as a backing band.
 
Moog Synthesizer
- The speaker owns a record player and found a record with the sounds of the Moog synthesizer when it was new.
 - Moog is a synthesizer company based in North Carolina.
 - Moog holds an annual festival in Durham, North Carolina.
 - The festival is expensive to attend.
 - Attendees do not receive a free synthesizer for attending the festival.