OpenAI Secretly Trained GPT-4 With More Than a Million Hours of Transcribed YouTube Videos

Last month, the Wall Street Journal's Joanna Stern sat down with OpenAI CTO Mira Murati to discuss the company's latest text-to-video generator called Sora.

During the brief conversation, Stern asked Murati if Sora was trained on videos from YouTube, Instagram, and Facebook — resulting in a long and awkward pause.

"We used publicly available data and licensed data," Murati said.

"So, videos on YouTube?" Stern shot back.

"I'm actually not sure about that," Murati replied, following what can only be described as a grimace.

And as it turns out, there's a good reason why the CTO may have been uncomfortable with that question. As the New York Times reports, OpenAI secretly trained its GPT-4 large language model (LLM) with over a million hours of transcribed YouTube videos.

Sources with knowledge about conversations discussing ripping audio and transcriptions from YouTube videos told the newspaper that the transcripts were fed into GPT-4.

And it's not just OpenAI — YouTube owner Google also harvested transcripts, per the NYT's sources, to train its own AI models.

It's yet another data point illustrating how AI companies are relying on massive amounts of murky and possibly copyright-infringing data to train their models — all without ever fairly compensating the rights holders, let alone asking for their consent.

The practice has already led to a number of lawsuits, with rightsholders accusing companies including OpenAI and Microsoft of misattributing their practices to "fair use," a doctrine of US copyright law that allows limited use of copyrighted material without acquiring permission.

Even the NYT itself has filed a lawsuit against OpenAI and Microsoft, accusing them of copyright infringement.

Last week, days before the NYT published its piece, YouTube CEO Neal Mohan sent a clear message, telling Bloomberg that if OpenAI had in fact trained Sora on YouTube videos, that would be a "clear violation" of the video platform's terms of use.

Google spokesperson Matt Bryant told the NYT that YouTube prohibits any "unauthorized scraping or downloading of YouTube content."

Bryant also told The Verge that the company had already "seen unconfirmed reports" of OpenAI's activity.

To be clear, we still don't fully know the extent to which Sora and GPT-4 are connected. We do know that OpenAI isn't reinventing the wheel for its upcoming text-to-video generator, relying on a translational layer that's powered by its LLM to interpret text prompts.

Maybe the real question is whether ripping a million hours of YouTube videos without permission amounts to stealing. Copyright law in the US remains a legal gray area, especially when it comes to fair use.

Experts told the NYT that as AI companies churn through the entirety of the internet, licensing all of the content would likely be impossible.

"The data needed is so massive that even collective licensing really can’t work," Sy Damle, a lawyer who represents the venture capital firm Andreessen Horowitz, told the newspaper.

Even without securing all of the rights, AI companies could soon be facing an even stranger challenge: running out of training data entirely.

Researchers found that by 2026, there's a 90 percent chance AI companies could run out of high-quality data to feed their insatiable models. In other words, the likes of OpenAI could eventually have to resort to training their AI models on synthetic, AI-generated output — a dangerous race to the bottom that could have far more disastrous consequences than copyright-related lawsuits.

More on OpenAI: AI Companies Running Out of Training Data After Burning Through Entire Internet