The unsettling reality of AI copyright is that we never know what will occur next.

AI models that produce art, music, and code by studying the work of others have grown in popularity over the past year. But as these tools gain popularity, unresolved legal issues can influence the direction of the industry.

AI models that produce art, music, and code by studying the work of others have grown in popularity over the past year. But as these tools gain popularity, unresolved legal issues can influence the direction of the industry.

The issue emerges as a result of the training methods used by generative AI systems. They operate by recognizing and recreating patterns in data, much like the majority of machine learning software. However, because these programs produce code, text, music, and art, the data they use is human-generated, web-scraped, and in some cases copyright-protected.

This was not a major concern for AI researchers in the distant past (the 2010s). At the time, the most advanced versions could only produce black-and-white, fuzzy images of faces the size of a fingernail. There was no immediate danger to anyone. But in 2022, when a single amateur can use software like Stable Diffusion to copy an artist’s style in a matter of hours or when businesses are selling AI-generated prints and social media filters that are overt ripoffs of living designers, questions of legality and ethics have become much more important.

Is it legal for generative AI models to be trained on data that is copyright protected?

Consider Hollie Mengert, a Disney illustrator who discovered that a Canadian mechanical engineering student had copied her art as part of an AI experiment. The student spent many hours training a machine learning model that could mimic Mengert’s style after downloading 32 of the artist’s works. “For me, personally, it feels like someone’s taking work that I’ve done, you know, things that I’ve learned — I’ve been a working artist since I graduated art school in 2011 — and is using them to create art that [sic] I didn’t consent to and didn’t give permission for,” Mengert said to technologist Andy Baio, who reported the case.

But is that equitable? And what can Mengert do about it?

The Verge talked to a variety of specialists, including lawyers, analysts, and staff members at AI startups, to get answers to these issues and gain an understanding of the legal environment around generative AI. Some others asserted with certainty that these systems might definitely violate copyright and might soon be subject to significant legal challenges. Others asserted, with equal assurance, that the opposite was true: that everything taking place in the field of generative AI right now is legal and that any legal actions are doomed to fail.

The truth is that nobody knows, according to Baio, who has been closely monitoring the generative AI scene. “I see individuals on both sides of this incredibly certain in their opinions,” Baio said to The Verge. And anyone who claims to know with certainty how this will turn out in court is mistaken.

Andres Guadamuz, a professor at the University of Sussex in the UK who specializes in AI and intellectual property law, asserted that while there are still many unanswered concerns, there are only a few fundamental ones that can be answered. First, can the output of a generative AI model be copyrighted, and if so, whose ownership is it? Second, does having copyrighted input used to train an AI give you any legal authority over the model or the products it produces? After these queries are resolved, a bigger query surfaces: how do you handle the consequences of this technology? What types of legislative restrictions on data gathering should or ought to be put in place?Can there coexistence between those who are developing these technologies and those whose data is required to do so?

Let’s take these questions one at a time.

Can you legally reproduce the work that an AI model produces?

At least for the first question, the solution is not too challenging. For works produced exclusively by machines, there is no copyright protection in the United States. Copyright appears to be attainable, nevertheless, when the creator can demonstrate there was significant human involvement.

A comic book created with the aid of text-to-image AI Midjourney was given a first-of-its-kind registration by the US Copyright Office in September. The comic is a finished product; it is an 18-page story with dialogue, characters, and a standard comic book layout. The comic’s copyright registration hasn’t yet been revoked, despite reports that the USCO is revisiting its judgment in the wake of the incident. The degree of human involvement in the comic’s creation appears to be a consideration in the evaluation. The work’s artist, Kristina Kashtanova, told IPWatchdog that the USCO had requested “details of my process to show that there was substantial human involvement in the process of creation of this graphic novel.”(The USCO itself does not comment on specific cases.)

Guadamuz asserts that granting copyright for works produced with the aid of AI will continue to be a problem. He claims that in the US, “I don’t think it’s enough to get copyright if you just enter ‘cat by van Gogh’.” However, “I can completely see that being covered by copyright if you start experimenting with prompts and make numerous photographs and start fine-tuning your images, start using seeds, and start engineering a little more.”

The level of human involvement will probably determine how much of an AI model’s output is protected by copyright.

Given this framework, it is most likely true that copyright protection cannot be applied to the great bulk of the output of generative AI models. They are typically produced in large quantities using only a few keywords as a cue. However, more complex procedures would result in better cases. These might contain contentious works, such as the AI-generated print that took first place in a state art competition. Given that the author claimed to have spent weeks refining his prompts and manually editing the final product, this case may have involved a significant amount of intellectual effort.

A computer scientist named Giorgio Franceschelli who has written about the issues with AI copyright claims that assessing human input will be “particularly true” for cases decided in the EU. The law is also different in the UK, which is a significant area of concern for Western AI businesses. The UK is one of a select few countries that offers copyright for works produced solely by computers, but it defines the author as “the person by whom the arrangements necessary for the creation of the work are undertaken,” which can be interpreted in a variety of ways (e.g., would this “person” be the model’s developer or its operator?). Nonetheless, it sets the stage for copyright protection to be provided.

Guadamuz warns, however, that registering copyright is merely the first step. He asserts, “The US Copyright Office is not a court.” If you want to file a lawsuit for copyright infringement, you must register, but the legal enforceability of that will be decided by the court.

The input question: can you use copyright-protected data to train AI models?

The main concerns regarding AI and copyright, in the opinion of the majority of experts, revolve with the data utilized to train these models. The majority of systems are trained using massive volumes of text, code, or graphic content that has been retrieved from the web. One of the largest and most effective text-to-AI systems, Stable Diffusion, for instance, has billions of images in its training dataset that were gathered from hundreds of different websites, including personal blogs hosted on WordPress and Blogspot, art communities like DeviantArt, and stock photo websites like Shutterstock and Getty Images. There’s a good chance you’re already in one of the enormous training datasets for generative AI—there’s even a website where you can check by uploading a photo or doing a text search.

AI researchers, start-ups, and multibillion-dollar tech companies all cite the fair use doctrine, which encourages the use of copyright-protected works to advance freedom of expression, as their justification for using these images (at least in the US).

Daniel Gervais, a professor at Vanderbilt Law School who specializes in intellectual property law and has written extensively on how this overlaps with AI, notes that there are a variety of factors to take into account when determining whether something is fair use. However, he asserts that two elements are “much, much more prominent.” In other words, does the use-case alter the nature of the material in any way (often referred to as a “transformative” use), and does it endanger the original creator’s way of life by competing with their works?

Although it’s probably permissible to train generative AI on copyright-protected material, you might be able to utilize the same model for illicit purposes.

Gervais asserts that “it is considerably more likely than not” that training systems on copyrighted material will be covered by fair use in light of the emphasis placed on these elements. But when it comes to creating content, the same cannot always be said. In other words, you can use data from other individuals to build an AI model, but what you do with that model can be illegal. Consider the difference between using made-up funds to finance a movie and attempting to purchase a vehicle with them.

Take into account applying the same text-to-image AI model to various situations. It is quite improbable that copyright infringement will occur if the model is trained on many millions of photographs and used to create original artwork. The output does not pose a threat to the market for the original artwork because the training data was altered during the process. However, a disgruntled artist would have a far stronger case against you if you improved that model on 100 images by a certain artist and produced images that matched their style.

You are in direct competition with Stephen King if you give an AI 10 Stephen King novels and instruct it to write a Stephen King novel. Is that considered fair use? Most likely not,” asserts Gervais.

Importantly, however, there are numerous situations in which input, purpose, and output are all balanced differently and could sway any judicial decision one way or the other between these two poles of fair and unfair use.

The majority of businesses who offer these services are aware of these distinctions, according to Ryan Khurana, chief of staff at the generative AI company Wombo. He said via email to The Verge that “intentionally using prompts that draw on copyrighted works to generate an output […] violates the terms of service of every major participant.” However, “enforcement is tough,” he continues, and businesses are more focused on “finding ways to prohibit utilizing models in ways that violate copyright […] than limiting training data.” This is especially true for free and open-source text-to-image models like Stable Diffusion, which may be trained and applied without any restrictions or filters. The business might have masked its tracks, but it could also be encouraging usage that violate copyright.

Whether or not the training data and model were developed by charitable organizations or university researchers is another factor in determining fair use. Startups are aware of how this generally boosts fair use defenses. As an illustration, Stability AI, the business that sells Stable Diffusion, did not personally gather the training data for the model or train the models that underlie the program. Instead, it sponsored and oversaw this academic research project, and a German university has licensed the Stable Diffusion model. As a result, Stability AI can legally distance itself from the model’s creation and turn it into a for-profit service (DreamStudio).

This method has been nicknamed “AI data laundering” by Baio. He mentions that this technique has been utilized in the past to develop facial recognition AI software and cites the example of MegaFace, a dataset assembled by University of Washington researchers by scraping Flickr images. According to Baio, “the university researchers grabbed the data, cleaned it up, and exploited it by commercial enterprises.” He claims that the Chinese government, law enforcement, and the facial recognition company Clearview AI now have access to this data, which includes millions of private photographs. The developers of generative AI models will likely benefit from being protected from liability as well thanks to such a tried-and-true laundering process.

There is one more twist to this, though, as Gervais points out that the next Supreme Court case involving Andy Warhol and Prince may cause the present definition of fair use to change. In this occasion, Warhol used images of Prince to produce artwork. Was this a fair use or a violation of someone else’s copyright?

“The Supreme Court doesn’t deal with fair use very frequently, so when it does, it typically makes a significant decision. Gervais predicts that they will follow suit here. And it’s dangerous to assert that anything is settled law while you’re waiting for the Supreme Court to make a decision.

How can AI companies and artists coexist peacefully?

Even if it is determined that the training of generative AI models falls under fair use, the issues facing the industry will still not be fully resolved. It also won’t always apply to other generative AI domains, such as coding and music, and won’t appease artists who are upset that their work has been used to train commercial models. In light of this, the question is: what solutions, technical or otherwise, can be implemented to enable generative AI to grow while acknowledging or compensating the creators whose work made the field possible?

Obtaining a license and paying the data’s producers is the most obvious recommendation. However, some believe that this will ruin the sector. Allowing any copyright claim, they contend, is “tantamount to saying, not that copyright owners will get paid, but that the use won’t be permitted at all.” Bryan Casey and Mark Lemley, authors of “Fair Learning,” a legal paper that has become the cornerstone of defenses for fair use in generative AI, say training datasets are so big that “there is no plausible option simply to license all of the underlying photographs, videos, audio files, or texts for the new use.”

Others, however, note that we have successfully handled copyright issues of a similar size and complexity in the past and can do so again. The era of music piracy, when file-sharing programs were developed on the back of widespread copyright violation and succeeded only until there were legal challenges that led to new arrangements that honored copyright, was compared to by a number of experts The Verge spoke to.

So, in the early 2000s, there had Napster, which was popular but entirely unlawful. And today, we have services like Spotify and iTunes,” Matthew Butterick, a lawyer who is suing businesses for using data they have obtained through data scraping to train AI models, said earlier this month. “How did these systems come to be? by businesses entering into licensing agreements and bringing in content lawfully. I find it a little terrible that a comparable thing can’t happen with AI because all the stakeholders got together and made it work.

Researchers and businesses are already experimenting with different ways to pay creators.

Ryan Khurana from Wombo projected a similar result. Because of the many licensing models, the diversity of rights holders, and the number of intermediaries involved, he said, “Music has by far the most complex copyright rules.” Given the complexities [of the legal issues surrounding AI], I believe the license structure for the entire generative field will eventually resemble that of music.

Other alternatives are also being trialled. Shutterstock, for example, says it plans to set up a fund to compensate individuals whose work it’s sold to AI companies to train their models, while DeviantArt has created a metadata tag for images shared on the web that warns AI researchers not to scrape their content. (At least one small social network, Cohost, has already adopted the tag across its site and says if it finds that researchers are scraping its images regardless, it “won’t rule out legal action.”) These approaches, though, have met with mixed from artistic communities. Can one-off license fees ever compensate for lost livelihood? And how does a no-scraping tag deployed now help artists whose work has already been used to train commercial AI system?

It appears that the harm has already been done to many creators. However, AI businesses are at least offering fresh ideas for the future. The creation of databases by AI researchers where there is no chance of copyright violation, either because the content has been properly licensed or because it has been built with AI training in mind, is an obvious step forward. “The Stack,” a dataset for AI training created explicitly to avoid claims of copyright infringement, is one such example. Only code with the most lenient open-source licensing is included, and it provides developers with a simple method to have their data removed upon request. According to its authors, the industry may use their model.

The Stack was developed in partnership with partner ServiceNow, and according to Yacine Jernite, Machine Learning & Society lead at Hugging Face, “The Stack’s approach can absolutely be adapted to other media.” It is a crucial first step in investigating the many different consent mechanisms that are available; these mechanisms function best when they take into account the policies of the platform from which the AI training data was extracted. Hugging Face, according to Jernite, aims to assist in bringing about a “fundamental shift” in how creators are viewed by AI researchers. However, the company’s strategy continues to be unusual for now.

What happens next?

Whatever our conclusion on these legal issues, the various players in the field of generative AI are already preparing for… something. The tech companies are firmly establishing themselves, continually asserting that what they do is legitimate (while presumably hoping no one actually challenges this claim). Copyright owners are tentatively establishing their own positions on the other side of the line without fully committing to any course of action. Recently, Getty Images prohibited AI content due to the potential legal risk to users (“I don’t think it’s responsible. I think it could be illegal,” CEO Craig Peters said last month to The Verge), and the music industry trade organization RIAA declared that AI-powered music mixers and extractors violate members’ copyright.

But with the announcement last week of a proposed class action lawsuit against Microsoft, GitHub, and OpenAI, the battle for AI copyright has already begun. All three companies are charged with intentionally copying open-source code using Copilot, an AI coding assistant, without the necessary licenses. The lawsuit’s attorneys claimed last week in an interview with The Verge that it might establish a precedent for the entire generative AI field.

But once someone emerges, I predict that lawsuits will begin to fly left and right.

Guadamuz and Baio, meanwhile, both say they’re surprised there haven’t been more legal challenges yet. “Honestly, I am flabbergasted,” says Guadamuz. “But I think that’s in part because these industries are afraid of being the first one [to sue] and losing a decision. Once someone breaks cover, though, I think the lawsuits are going to start flying left and right.”

Baio suggested one difficulty is that many people most affected by this technology — artists and the like — are simply not in a good position to launch legal challenges. “They don’t have the resources,” he says. “This sort of litigation is very expensive and time-consuming, and you’re only going to do it if you know you’re going to win. This is why I’ve thought for some time that the first lawsuits around AI art will be from stock image sites. They seem poised to lose the most from this technology, they can clearly prove that a large amount of their corpus was used to train these models, and they have the funding to take it to court.”

Guadamuz agrees. “Everyone knows how expensive it’s going to be,” he says. “Whoever sues will get a decision in the lower courts, then they will appeal, then they will appeal again, and eventually, it could go all the way to the Supreme Court.”


Check Also

Apple’s Siri will soon handle multiple smart home commands

Despite some hints toward a possible Apple smart display with its new StandBy feature for …

Leave a Reply

Your email address will not be published. Required fields are marked *