Investigation Shows Tech Companies Trained AI on YouTube Transcripts

Spread the love

Artificial intelligence isn’t magical – it’s in the name: “artificial.” We know the content is originating from somewhere. An investigation showed that some of the big names in tech, including Apple, trained their AI technology on transcripts from YouTube videos – all without permission.

Investigation Shows YouTube Transcripts Used

Proof News conducted an investigation that included a search tool to look for YouTube in the dataset. The investigation determined that the subtitles from nearly 175,000 YouTube videos from more than 48,000 channels were used by tech companies.

The videos that were used included late-night TV episodes from The Late Show with Stephen Colbert and Jimmy Kimmel Live. Also showing up in the investigation were videos by MrBeast, PewDiePie, and Marques Brownlee.

Image source:
Unsplash

The dataset came from “the Pile.” In 2020, the Pile was described as a mix of 22 datasets from EleutherAI, a nonprofit.

A Google spokesperson said in an email to CNET that the company stands by what it has said previously, going back to a comment from April. CEO Neal Mohan said at that time that he didn’t know whether OpenAI used YouTube videos. But if it did, he recognized that it would be a violation of YouTube’s TOS.

Where Else Does the AI Content Come From?

Nearly every tech company has announced recently that it is developing or has developed an AI system. As stated initially, we know it’s not magical and that the content comes from somewhere. It just wasn’t expected that the AI was coming from YouTube transcripts.

OpenAI, the creators of ChatGPT, has mentioned previously that it was getting more difficult to find datasets to train AI, and that led it to make deals with Reddit and News Corp. for their content. Google has said it has an agreement with content creators that allows it to use YouTube content in its AI training. AI Overview was recently added to Google Search. Learn how to turn AI Overview off if it isn’t your cup of tea.

Image source:
Unsplash

Yet, an Anthropic spokesperson acknowledged to Proof News that it used the Pile to train Claude, it’s AI assistant. The spokesperson also acknowledged that there are some YouTube subtitles in the Pile.

Whether you use Claude, ChatGPT, or another AI technology, it was trained on a dataset. The question is whether it was trained on willing content providers, like Reddit, or whether the search for providers expanded to content that was used without the creators’ knowledge. It’s definitely something you should be considering the next time you use an AI chatbot.

Image credit: Unsplash

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Sign up for all newsletters.
By signing up, you agree to our Privacy Policy and European users agree to the data transfer policy. We will not share your data and you can unsubscribe at any time. Subscribe


Laura Tucker
Contributor

Laura has spent more than 20 years writing news, reviews, and op-eds, with the majority of those years as an editor as well. She has exclusively used Apple products for the past 35 years. In addition to writing and editing at MTE, she also runs the site’s sponsored review program.

Leave a comment