Datasets and Data Sources — Where LLMs Get Their Knowledge About Brands

LLMs like ChatGPT, Claude, and Gemini aren't trained on your website alone — they're trained on trillions of words from hundreds of different sources. If you want to understand why AI mentions or ignores your brand, you need to understand which datasets shape these systems. This guide gives you an overview of where LLMs get their knowledge, how you can influence these sources, and which sources you should prioritize.

Published on

November 14, 2025

Author

Jakob Langemark

Datasets and Data Sources — Where LLMs Get Their Knowledge About Brands

LLMs like ChatGPT, Claude, and Gemini aren't trained on your website alone — they're trained on trillions of words from hundreds of different sources. If you want to understand why AI mentions or ignores your brand, you need to understand which datasets shape these systems. This guide gives you an overview of where LLMs get their knowledge, how you can influence these sources, and which sources you should prioritize.

How LLMs Learn About Brands

LLM training happens in several phases:

1. Pre-training (the foundation)

The model is trained on massive text corpora to learn language structure and general knowledge.

Primary datasets:

Common Crawl — Scrape of the entire web (trillions of pages)
Wikipedia — Structured, authoritative knowledge
Books corpus — Millions of digital books
Reddit — Community-driven discussions
News archives — Historical news articles
Scientific papers — Academic knowledge

What this means for your brand: If your brand isn't mentioned in these sources during pre-training, the model knows nothing about you from the start.

2. Fine-tuning (refinement)

After pre-training, the model is fine-tuned on specifically selected, high-quality data.

Sources:

Curated text from trustworthy sources
Expert-written content
Structured databases

3. RLHF (Reinforcement Learning from Human Feedback)

The model learns to give useful, safe answers based on human feedback.

This doesn't directly affect brand knowledge, but:

The model learns to cite sources
Learns to acknowledge uncertainty
Learns to prioritize authoritative information

4. Real-time retrieval (for some systems)

Some AI systems (like Perplexity and ChatGPT's web browsing) crawl live to supplement their knowledge.

Sources:

Your website (if crawlable)
News sites
Social media

The Most Important Data Sources for Brand Knowledge

1. Common Crawl

What it is: A nonprofit project that monthly crawls billions of web pages and makes data freely available.

Why it matters:

Many AI models (incl. GPT) are trained on Common Crawl
If your site isn't here, you're invisible to many LLMs

Is your site in Common Crawl?

Check here: https://index.commoncrawl.org/

# Check if your domain is in Common Crawl
curl "http://index.commoncrawl.org/CC-MAIN-2024-10-index?url=yourwebsite.com&output=json" | head -5

# Check if your domain is in Common Crawl
curl "http://index.commoncrawl.org/CC-MAIN-2024-10-index?url=yourwebsite.com&output=json" | head -5

# Check if your domain is in Common Crawl
curl "http://index.commoncrawl.org/CC-MAIN-2024-10-index?url=yourwebsite.com&output=json" | head -5

How to get in:

Allow CCBot in robots.txt
Ensure your site is crawlable
Wait — Common Crawl crawls periodically automatically

2. Wikipedia

Why it's gold: Wikipedia is one of the most authoritative sources LLMs trust. If your brand has a Wikipedia page, LLMs will cite it as fact.

How to influence:

Create a Wikipedia page (if you meet notability criteria)
Ensure existing pages are accurate and updated
Add sources and references

Wikipedia notability criteria:

Significant media coverage in trustworthy sources
Independent sources (not PR or own marketing)
Multiple in-depth articles

If you can't get a Wikipedia page: Focus on being mentioned in existing relevant Wikipedia articles (e.g., industry articles, geographic pages, or competitor pages with "See also" sections).

3. Crunchbase

What it is: Database of companies, funding, and tech ecosystems.

Why it matters: LLMs use Crunchbase to understand:

What your company does
Who your competitors are
Funding stage and size

Optimize your Crunchbase profile:

Claim your Crunchbase profile
Fill out all fields:
- Description (specifically what you do)
- Categories (choose relevant tags)
- Funding info
- Website link
Keep it updated at funding rounds or product launches

4. LinkedIn

Why it's important: LLMs use LinkedIn to understand:

Company size and employees
Industry and focus areas
Brand positioning

Optimize your LinkedIn Company Page:

About section: Clear description of what you do
Specialties: Tag relevant keywords
Updates: Regular posts about product news, hiring, thought leadership
Employee profiles: Employee profiles connect to company page

5. News and Media Coverage

Why it matters: LLMs value brands with media coverage higher than self-published content.

Prioritize:

Tier 1 media: TechCrunch, Wired, Wall Street Journal, Financial Times
Industry publications: Relevant trade journals
Regional news: Local business media

How to get coverage:

Press releases at product launches
Thought leadership articles (guest posts)
Comment on trending topics
Awards and recognitions

6. GitHub (for tech brands)

Why it matters: Many LLMs are trained on open-source code from GitHub.

If you have open-source projects:

Update README with clear description
Add "About" section to repo
Link to your company website
Include use cases and examples

7. Social Media

Twitter/X:

Public tweets are crawled by some LLMs
Thought leadership and branding

Reddit:

Community discussions around brands
Authentic user experiences

YouTube:

Transcriptions are crawled
Product demos and tutorials

8. Academic Papers and Research

If your brand is tech or research-driven:

Publish whitepapers
Sponsor academic research
Contribute to conferences

Upload to:

arXiv.org (pre-prints)
ResearchGate
University repositories

How to Prioritize Your Efforts

Tier 1: Must-have (highest impact)

Your own website — With correct robots.txt, JSON-LD, sitemap
Wikipedia — If eligible
Crunchbase — For tech brands
LinkedIn — Optimized company page

Tier 2: Strong ROI

News coverage — Tier 1 and industry publications
Common Crawl — Ensure your site is crawlable
GitHub — For open-source brands
Industry directories — Relevant niche directories

Tier 3: Long-tail value

Reddit — Authentic community engagement
YouTube — Video content with transcriptions
Podcasts — Guest appearances (with transcriptions)
Forums — Stack Overflow, Hacker News, niche communities

How to Verify You're in the Datasets

Check Common Crawl

curl "http://index.commoncrawl.org/CC-MAIN-2024-10-index?url=yourwebsite.com&output=json"

curl "http://index.commoncrawl.org/CC-MAIN-2024-10-index?url=yourwebsite.com&output=json"

curl "http://index.commoncrawl.org/CC-MAIN-2024-10-index?url=yourwebsite.com&output=json"

Check Wikipedia

Search at https://en.wikipedia.org/wiki/Special:Search for your brand name.

Check Crunchbase

Visit https://www.crunchbase.com/organization/YOUR-COMPANY

Check news coverage

# Google News search
https://news.google.com/search?q="Your Brand Name"

# Google News search
https://news.google.com/search?q="Your Brand Name"

# Google News search
https://news.google.com/search?q="Your Brand Name"

Check GitHub

Search at https://github.com/search?q=YOUR-BRAND

Implementation Checklist

Use this checklist to ensure presence in important datasets:

Website optimized — robots.txt, sitemap, JSON-LD
Common Crawl — Allow CCBot
Wikipedia presence — Create or update page (if eligible)
Crunchbase profile — Claimed and updated
LinkedIn company page — Complete profile
News coverage — Minimum 3-5 mentions in relevant media
GitHub repos — If tech brand, clear README
Social presence — Active on at least 2 platforms
Industry directories — Listed in relevant directories
Content distribution — Syndicate content to Medium, LinkedIn articles

Conclusion

LLMs learn about your brand through hundreds of sources — not just your website. For maximum AI visibility, you need to ensure presence in the datasets that matter most: Common Crawl (via your website), Wikipedia (if eligible), Crunchbase, LinkedIn, and news coverage.

Prioritize Wikipedia and news coverage — they have disproportionate weight in how LLMs assess authority. Next, focus on making your own website perfectly crawlable and structured with JSON-LD.

Remember: Datasets aren't updated in real-time. It can take months from when you publish content until it's reflected in LLM responses. Start now, and systematically build presence in the sources AI systems trust.

‹ You Can’t Media-Spend Your Way Into the Knowledge Graph

Crawlability and AI-crawlers – how to ensure GPTBot finds you. ›