A new nonprofit is seeking to spotlight "fairly trained" AI models Credit - Getty Images—lOvE lOvE
Ed Newton-Rex publicly resigned from his executive job at a prominent AI company last year, following a disagreement with his bosses over their approach to copyright.
Stability AI, the makers of the popular AI image generation model Stable Diffusion, had trained the model by feeding it with millions of images that had been “scraped” from the internet, without the consent of their creators. In common with many other leading AI companies, Stability had argued that this technique didn’t violate copyright law because it constituted a form of “fair use” of copyrighted work.
Newton-Rex, the head of Stability’s audio team, disagreed. “Companies worth billions of dollars are, without permission, training generative AI models on creators’ works, which are then being used to create new content that in many cases can compete with the original works. I don’t see how this can be acceptable in a society that has set up the economics of the creative arts such that creators rely on copyright,” he wrote in a post in November announcing his resignation on X, the platform formerly known as Twitter.
It was one of the first salvos in a battle that is now raging over the use of copyrighted work to train AI systems. In December, the New York Times sued OpenAI in a Manhattan court, alleging that the creator of ChatGPT had illegally used millions of the newspaper’s articles to train AI systems that are intended to compete with the Times as a reliable source of information. Meanwhile, in July 2023, comedian Sarah Silverman and other writers sued OpenAI and Meta, accusing the companies of using their writing to train AI models without their permission. Earlier that year, artists Kelly McKernan, Sarah Andersen, and Karla Orti sued Midjourney, Stability AI, and DeviantArt, which develop image generating AI models, claiming the companies trained their AI models on the artists' work. Some visual artists are also fighting back by using new tools that offer to “poison” AI models trained on them without consent, causing them to break in unpredictable ways or resist attempts to copy their artistic style.
OpenAI has said it believes the New York Times lawsuit against it is “without merit,” adding that while it believes training on data scraped from the internet is fair use, it provides publishers with an opt-out “because it’s the right thing to do.” Stability AI did not immediately respond to a request for comment.
On Jan. 17, Newton-Rex announced a new type of effort to incentivize AI companies to respect creators. He launched a nonprofit, called ‘Fairly Trained,’ which offers a certification to AI companies that train their models only on data whose creators have consented. Elevating companies with better practices around sourcing their training data, he hopes, will incentivize the whole ecosystem to treat creators more fairly. “There is a really ethical side to this industry, and the point of this certification is to highlight that,” Newton-Rex tells TIME.
Nine models had been certified by Fairly Trained to coincide with its launch—many of them made by AI companies in the music-generation space. They include models by Endel, a “sound wellness” company that has collaborated with artists including Grimes and James Blake. The certification denotes that the companies have legally licensed the data on which their models were trained, rather than simply claiming fair use.
Alongside his work on AI, Newton-Rex is also a classical composer who writes choral music. He says his artistic practice motivated him to stand up for creators. “This has always been an issue that has been very close to my heart, and I’m sure that comes in large part from being a musician myself,” he says. “It’s hard to know what it really feels like to be a creator until you have actually gone through the process of pouring your work into something and seeing it go out into the world.” The resentment of seeing only meager royalty checks roll in for his art, while AI companies turn over billions of dollars, he believes, is a common feeling among artists of all stripes. “I’ve poured huge amounts of work into this and here’s what I’m getting back. Do I want [my work] to be used without any further payment by a company to build their own models that they are profiting from?”
He continues: “A lot of creators, myself included, would say no to that. [But] if there’s a chance to consent and a chance to talk about the terms, and a chance to ultimately make some money, that could be a really good thing.”
Fairly Trained does not ask companies seeking certification to share their datasets for auditing. Instead, it asks companies to fill out written submissions detailing what their datasets include and where the data comes from, what due diligence processes they have in place, and whether they are keeping good records, according to Newton-Rex. “There’s clearly an element of trust there,” he says. “There’s a conversation to be had around that, and whether more is needed. But my feeling is… at least at the start, actually a trust-based system works. And people will be disincentivized from giving inaccurate information, especially as that could lead to being decertified down the line.” Most companies that claim fair use exemptions, he adds, are “pretty forthright” in their views that they are legally entitled to follow that strategy.
Still, taking the word of companies about the contents and provenance of their datasets is an approach with an obvious loophole. “We have to actually see these datasets themselves to verify whether they still contain problematic content,” says Abeba Birhane, a scholar who studies the contents of large datasets used to train AI systems. “It’s really difficult to say whether it’s enough or not, without seeing the datasets themselves.”
Most of the largest AI companies, including OpenAI, Google DeepMind, Meta, and Anthropic, do not disclose the contents or even many details about the datasets used to train their largest models. This has often been an obstacle to creators seeking to learn whether their data has been used to train models without their consent.
OpenAI has inked agreements with several newsrooms, including the Associated Press and Axel Springer, to license news articles for use as training data. It is reportedly in further discussions with several others, including CNN, Fox, and TIME.
Write to Billy Perrigo at firstname.lastname@example.org.