Building AI Models Faster And Cheaper Than You Think
Coming Up
- Recent advancements in AI have made many science fiction concepts a reality.
- Generative AI models like GPT-4, Midjourney, and now Sora are pushing the boundaries of what's possible.
- YC companies are building foundation models during the batch with just $500,000.
- These models are being developed by young college graduates in a relatively short time frame.
- This demonstrates that it's possible to be on the cutting edge of AI research without significant resources.
Sora Videos
- Sora's video showcases a humanoid robot walking a golden retriever on a suburban street.
- The video demonstrates significant improvements in text generation, with the model accurately spelling out "help" and producing high-definition images.
- The physics of the robot's and dog's movements are mostly accurate, capturing the lifelike gait of a golden retriever.
- The prompt was followed precisely, although minor imperfections were noted, such as a floating dog and inconsistencies in the street and structures.
- Sora's videos exhibit long-term visual consistency, maintaining a consistent architectural style and environment throughout the minute-long clip.
- The drone camera circles the Golden Gate Bridge, showcasing stunning views of the cliffs, ocean waves, and San Francisco in the background.
- The high definition of the video is impressive, capturing intricate details of the bridge and city.
- Geographical accuracy is not perfect, with terrain and city layout differing from the real world.
- Minor imperfections include disjointed bridge columns at certain angles and cars driving on the wrong side of the road.
- Simulating fluid motion remains a challenge, resulting in slightly static waves.
How Sora works under the hood?
- Sora combines a transformer model, typically used for text, with a diffusion model, used in image generation like DALL-E and Midjourney.
- It adds a temporal component to ensure consistency between frames and time.
- Sora is trained with videos and "SpaceTime patches," which are 3x3 matrices of pixels that include spatial and temporal information.
- The size of these patches can vary, and they are trained in a large architecture.
- SpaceTime patches are the video equivalent of tokens, building on prior work in transformer models for images and robotics.
How expensive is it to generate videos vs. texts?
- Generating videos is more computationally expensive than generating text due to the additional dimension of time.
- GPT-4 has a trillion parameters and operates in two dimensions, while videos require an order of magnitude more parameters, likely around 10 trillion.
- It likely requires 10 times the number of GPUs used for GPT-4, which was around 20,000-30,000 GPUs.
- Some YC companies have achieved similar functionality with fewer resources by optimizing data, compute, and expertise.
Infinity AI
- Makes deep fake videos of a particular person.
- Trained their model on the first three episodes of the Lite cone podcast.
- Only needed an hour or so of YouTube video to get an accurate representation.
Sync Labs
- API for creating real-time lip-syncing.
- Trained the models on a single A100 GPU.
- Compressed a lot of the data and used low-resolution video to reduce the amount of data needed.
- Partnered with Aure to get access to a dedicated GPU cluster, allowing them to iterate 100 times faster.
- YC companies get over half a million in credits and instant access to a GPU cluster within 24 hours.
- The companies in the YC batch didn't have to use any of the YC money to train their models.
Sonauto
- Sonauto is a company that has built a text-to-song model.
- The model can generate songs based on given lyrics and the specified singer.
- The founders of Sonauto are 21 years old and built the model in months by teaching themselves.
- The generated songs have understandable lyrics and sound like they are sung by a person.
Metalware
- Metalware is a company that is building a co-pilot for hardware design.
- The founders of Metalware had a background in hardware engineering but not in AI.
- They trained a foundation model for hardware design during the batch without much AI expertise.
- Metalware used high-quality data from textbooks and a smaller model (GPT-2.5) to reduce computational resources.
- By constraining tasks, using high-quality data, and choosing a smaller model, Metalware was able to build a foundation model for various applications beyond just generating video or text.
Guide Labs
- Building an explainable foundation model to understand how the model makes predictions.
- The team is training a model to determine when it's better to invest in building a custom model or fine-tuning an open-source model.
- Expertise in AI might be overrated, as smart individuals who are willing to read research papers can achieve similar results.
- YC can provide credits to offset some of the compute costs.
- The key differentiator lies in finding high-quality data, even if it's not a giant dataset.
Phind
- Phind is a company that created a co-pilot for software.
- They used synthetic data from programming competitions to train their model.
- Synthetic data was initially controversial because it seemed like a model couldn't generate its own data and learn from it.
- However, it works because LLMs are capable of reasoning, which allows them to generate data and improve their own models.
- Other generative AI models, like self-driving car models, are also trained on massive amounts of simulation data.
- Sora is an AI model that can generate videos.
- It uses video footage generated from game engines like Unreal Engine or Unity, which have full physics simulators.
- This allows Sora to generate videos from multiple camera angles and simulate the real world.
- The implications of this technology go beyond entertainment, as it can be used for weather prediction, scientific simulations, and more.
Diffuse Bio
- Diffuse Bio applies foundation models to biology to create new molecules for drugs and gene therapies.
- The founder has expertise in biology and published papers in Nature.
- Custom kernels were built to speed up the model training process, reducing resource requirements.
Piramidal
- Piramidal builds a foundation model for the human brain to predict EEG signals.
- EEG signals are similar to videos, representing electrical impulses over time.
- Chunking the data into spacetime chunks reduced the runtime complexity quadratically.
- The model can be trained with just 800 hours of GPU compute.
- EEG data is an unexpected application area for foundation models.
K-Scale Labs
- K-Scale Labs is developing consumer humanoid robots.
- The founder previously built the foundation robotics model for Tesla and integrated it into the Optimus Prime robot.
- Advances in foundation models, such as the physics simulator for the world, are enabling breakthroughs in robotics.
DraftAid
- DraftAid is building AI models for CAD design.
- Traditional CAD software uses old kernels that run on Fortran and are expensive to use.
- DraftAid is using AI models to replace some of these kernels, making the process faster and cheaper.
Playground
- Playground is a YC company that has developed an AI model that can generate images.
- The model is open-source and outperforms Stable Diffusion in many cases.
- Playground was able to achieve this on far less money than Stability AI and other teams in the space.
- Suil Doshi, the founder of Playground, taught himself AI in a month by reading papers and meeting with experts in the field.
- This highlights the fact that the AI field is still new and that it is possible to become an expert in a relatively short amount of time.
- Companies can compete with OpenAI and other large AI companies by training their own models for specific verticals and use cases.
Outro
- There are many incredible things being done in AI by people who are likely not that different from the viewers.
- Many notable figures in AI, such as Sam Altman and Dario Amade, started somewhere, and YC could be the starting point for aspiring individuals.