When ChatGPT launched last year, many people were taken aback by how much knowledge artificial intelligence (AI) had. Even more alarming was where this knowledge had come from. The world found out that many of these AI companies had been training their models for years on existing information, much of which is covered by copyright laws.
How Are AI Models Trained?
Artificial intelligence (AI) large language models are trained on billions of documents, including articles, books, blogs, newspapers, and much more. AI companies have built software that automatically crawls websites and other digital content that is publically accessible.
Google has been doing this for years to populate its search engine, but now AI companies are doing this to train their algorithms.
Various companies have protested the use of their content for training AI models. Reddit went dark for a few days in 2023, and many Reddit sub-forums are still private.
How Are Writers Impacted?
First and foremost, AI companies are using copyrighted content without permission and without compensating writers.
Next, this impacts the future careers of writers whom AI may replace. This is an egregious issue and one of the main reasons why the Writers Guild of America went on strike for so long in 2023.
Not only could this result in thousands of jobs being lost, but we also lose all the creativity writers bring to the table. AI models are only trained on what content has been created, not what content could be created.
How Can Writers Protect Themselves From AI?
There are no foolproof ways to guard your published content from the prying eyes of artificial intelligence bots since AI companies can choose to ignore any temporary barriers you put in place.
Send Takedown Notices
If you find unauthorized copies of your works online when they should not be freely accessible, send a takedown notice. The Authors Guild has a great tutorial on how to do this here.
One of the simplest ways to file a complaint about a website stealing content is by filing a Digital Millennium Copyright Act (DMCA) request with Google. You can also file DMCA complaints with domain or web hosting companies by looking up the website’s information on a WhoIs lookup service, such as this one by ICANN.
Include a Notice About No AI Training
The Authors Guild, one of the premier organizations for professional writers, recommends adding the following warnings to your copyright page, book cover, or article:
NO AI TRAINING: Without in any way limiting the author’s [and publisher’s] exclusive rights under copyright, any use of this publication to “train” generative artificial intelligence (AI) technologies to generate text is expressly prohibited. The author reserves all rights to license uses of this work for generative AI training and development of machine learning language models.
Similarly, if you own a website, you can add the following text to a page called “No AI” so that AI web crawling bots will avoid your site:
The owner of this website does not consent to the content on this website being used or downloaded by any third parties for the purposes of developing, training or operating artificial intelligence or other machine learning systems (“Artificial Intelligence Purposes”), except as authorized by the owner in writing (including written electronic communication). Absent such consent, users of this website, including any third parties accessing the website through automated systems, are prohibited from using any of the content on the website for Artificial Intelligence Purposes.
Block the GPTBot From Scanning Your Website
Another way that you can prevent OpenAI, the parent company behind ChatGPT, from crawling your website is to update your robots.txt file on your website to block their GPTBot.
The robots.txt file is a file that tells web crawlers, such as Google, what content they are allowed to access on your website and what content they are not allowed to access. “Good” web robots will obey this file and now crawl anything they are not supposed to.
Recently, a way to block the GPTBot came to light. This will tell the GPTBot that it cannot crawl your website.
Here’s how to block the GPTBot:
- Create a file named “robots.txt” in your website’s home directory. If you don’t know how to do this, ask a web developer.
- Enter the following text. If some text already exists, add this to the bottom of the file.
User-agent: GPTBot
Disallow: / - Save the file.
Your file should look like this:
Use the No AI Meta Tag
If you have a website, you can also add a meta tag to the website headers that signals to AI companies that they should not use your content. While AI companies have technically agreed to respect this method at this time, it’s best to have it anyway.
Add the following to your website’s header. If you need help, ask a web developer.
<meta name="robots" content="noai, noimageai">
Put Your Content Behind a Paywall
A much more foolproof way to protect your content is by using a paywall. If you own a blog or website, you can add paywall to prevent people from accessing your content without paying, or if you use another media platform such as Medium, you can make your content private unless people pay for it. Even requiring readers to log into your website before consuming free content can prevent AI companies from scraping your content.
Use a CAPTCHA
If you don’t want to put content behind a paywall or other login screen, you could add a simple CAPTCHA to your content so that users have to solve a puzzle or check a checkbox to verify that they are human.
CAPTCHAs used to be annoying with obscured letters and numbers or trying to guess which image had a bridge. However, many companies, including Google (called ReCAPTCHA), Cloudflare (called Turnstile), and hCaptcha, now offer better solutions.