Ai Content
How do I get into training datasets used by ChatGPT or Claude?
In 2023, the New York Times sued OpenAI, claiming its articles had been used to train ChatGPT without permission. Around the same time, Reddit struck a deal with OpenAI and Google worth $60 million a year to license user-generated content for training. These two headlines capture the two sides of the same issue: some publishers want to block their data, while others see value in being included. For brands, nonprofits, and media outlets, the question isn’t just whether their content can end up in large language model (LLM) training datasets—it’s whether they should try to get there on purpose.
How training datasets are built
Companies like OpenAI and Anthropic build models such as ChatGPT and Claude by combining licensed data, public web crawls, and human feedback. OpenAI has acknowledged that it sources content from a “mixture of licensed data, publicly available data, and data created by human trainers” (OpenAI FAQ).
Some major providers now license structured content directly:
- Reddit licenses comment data to OpenAI and Google for both training and search integration (Reuters).
- Stack Overflow signed a deal with OpenAI in 2024 to provide technical Q&A data.
- News Corp entered a multi-year licensing agreement with OpenAI covering titles like the Wall Street Journal and The Times of London.
These deals are meant to ensure high-quality, reliable data feeds into training while compensating publishers.
Can smaller sites get included?
For smaller publishers or brands, there isn’t currently a “submit your site for training” portal. However, there are three practical routes:
- Allowing training access
In 2024, OpenAI, Anthropic, and others began supporting the llms.txt standard, which lets site owners opt in or out of training. Including clear permissions in llms.txt increases the likelihood of your data being crawled. - Partnering through aggregators
Many LLM providers license from aggregators like LexisNexis, Common Crawl, or news syndication services. Getting your content distributed through these channels can indirectly place it in training datasets. - Joining direct licensing programs
While most direct deals have gone to large publishers, some AI companies are expanding pilots with niche and specialized content providers, particularly in health, finance, and education. For instance, OpenAI has discussed creating custom GPTs with licensed datasets for enterprise use.
Benefits of being included
- Brand familiarity: When LLMs train on your content, your brand language and expertise become part of how the model “speaks” about your industry.
- Downstream visibility: Training data can indirectly increase the chance of being referenced in answers, since models weight familiar sources.
- Revenue potential: As shown by Reddit, News Corp, and Stack Overflow, publishers can negotiate licensing deals that bring financial upside.
Risks to weigh
- Loss of control: Once data is used for training, it’s baked into the model. You can’t ask to have it “unlearned.”
- Attribution gaps: Training doesn’t guarantee that your site will be cited. Inclusion increases familiarity but doesn’t equal direct credit.
- Legal exposure: Ongoing lawsuits, such as the New York Times vs. OpenAI case, show that copyright and fair use questions are far from settled (NYT).
What you can do now
- Publish a clear llms.txt file stating if you allow or block training use.
- Consider syndicating your content to aggregators that license to AI providers.
- Explore direct outreach to companies like OpenAI and Anthropic if you have niche or high-value datasets.
- Invest in structured, high-authority content so your material is not only crawlable but also valuable enough to be considered for licensing.
How Contently helps
Technical signals are important, but content quality remains the deciding factor for inclusion in high-value datasets. This is where Contently helps brands stand out.
With Contently, companies can:
- Produce expert-level content that’s both machine-readable and audience-friendly.
- Apply schema and metadata best practices to maximize discoverability.
- Refresh older material so it remains relevant to AI crawlers and licensing programs.
- Build a consistent editorial voice that increases the odds of being considered trustworthy training data.
By combining technical readiness with editorial excellence, Contently helps brands position themselves not just for today’s search but for tomorrow’s AI training pipelines.
Conclusion
Getting into training datasets for ChatGPT or Claude isn’t as simple as flipping a switch. It involves a mix of technical permissions, distribution strategies, and in some cases, direct licensing. While there are risks to weigh, there are also real opportunities for visibility and revenue.
Brands that prepare now—through llms.txt, partnerships, and authoritative content—set themselves up to play a bigger role in how AI systems learn and communicate. With Contently as a partner, that preparation doesn’t have to come at the expense of quality or brand integrity.
Sources
- Reuters – Reddit data licensing deal
- Stack Overflow – Partnership with OpenAI
- New York Times – News Corp and OpenAI licensing deal
- Wired – llms.txt standard
- OpenAI – Data usage and training policy
- OpenAI – Custom GPTs with licensed datasets
- NYT – Lawsuit against OpenAI
Get better at your job right now.
Read our monthly newsletter to master content marketing. It’s made for marketers, creators, and everyone in between.