The arms race between companies focused on creating AI models and creators seeking to defend their intellectual property by polluting data could have serious consequences for the machine learning ecosystem, according to experts. Computer scientists from the University of Chicago have offered techniques to combat wholesale scraping of content, particularly artwork, in order to prevent the use of that data to train AI models. However, this intentional pollution of data could coincide with the increasing adoption of AI in businesses and by consumers, resulting in a “model collapse” where AI systems become disconnected from reality.
Gary McGraw, co-founder of the Berryville Institute of Machine Learning (BIML), warns that the degeneration of data is already occurring and could pose problems for future AI applications, particularly large language models (LLMs). He emphasizes the need for foundational models to consume only reliable data, as the consequences of the AI systems consuming their own mistakes and making even clearer errors could be disastrous.
The issue of data poisoning is currently under research and has different implications depending on the context. The Open Worldwide Application Security Project (OWASP), for instance, has identified the poisoning of training data as the third most significant threat to LLMs. In response, researchers from the University of Chicago have created “style cloaks,” an adversarial AI technique that modifies artwork in a way that confuses AI models trained on the data. The researchers have developed an application called Glaze, which has already been downloaded over 740,000 times.
While some hope for a balanced equilibrium between AI companies and creators, Steve Wilson, chief product officer at Contrast Security, cautions that current efforts may lead to more problems than solutions. The use of “perturbations” or “style cloaks” could inadvertently degrade the performance of beneficial AI services and raise legal and ethical concerns.
The battle between AI companies and human content creators highlights the importance of bringing creators onboard in the creation of AI models. AI models heavily rely on content created by humans, but the unauthorized use of this content has created a rift. Content creators are seeking ways to protect their data, while AI companies aim to consume the content for training. This battle, along with the shift from human-created to machine-created content, could have long-lasting effects.
Model collapse is a serious concern for the sustainability of training data obtained from the web. This degenerative process affects generations of learned generative models, with generated data polluting the training set of the next generation of models. To counter this issue, researchers suggest that data collected from genuine human interactions with systems will become increasingly valuable.
Currently, large AI models are likely to find ways around defenses implemented to protect content creators’ intellectual property, assuming they win legal battles. As AI and machine learning techniques continue to evolve, they will become better at detecting data poisoning, making defensive approaches less effective. Collaborative solutions such as Adobe’s Firefly, which uses digital “nutrition labels” to provide information about the source and tools used to create an image, could offer some defense without excessively polluting the ecosystem. However, these solutions may not be a long-term solution to combat AI-generated mimicry or theft.
Gary McGraw suggests that large companies working on LLMs should invest in preventing data pollution on the internet and collaborate with human creators. By marking content as “do not use for training,” they can help solve the problem themselves. However, it remains uncertain if these companies have fully absorbed this message.
The arms race between AI companies and content creators has significant implications for the machine learning ecosystem. Balancing the protection of intellectual property with the advancement of AI technology will be crucial to sustain the benefits of training from large-scale data. Developing robust and ethical AI systems, coupled with strong legal frameworks, will be essential moving forward.

