Datasets and Data Sources — Where LLMs Get Their Knowledge About Brands

LLMs like ChatGPT, Claude, and Gemini aren't trained on your website alone — they're trained on trillions of words from hundreds of different sources. If you want to understand why AI mentions or ignores your brand, you need to understand which datasets shape these systems. This guide gives you an overview of where LLMs get their knowledge, how you can influence these sources, and which sources you should prioritize.

Published on

November 14, 2025

Author

Jakob Langemark

Follow us

Datasets and Data Sources — Where LLMs Get Their Knowledge About Brands

LLMs like ChatGPT, Claude, and Gemini aren't trained on your website alone — they're trained on trillions of words from hundreds of different sources. If you want to understand why AI mentions or ignores your brand, you need to understand which datasets shape these systems. This guide gives you an overview of where LLMs get their knowledge, how you can influence these sources, and which sources you should prioritize.

How LLMs Learn About Brands

LLM training happens in several phases:

1. Pre-training (the foundation)

The model is trained on massive text corpora to learn language structure and general knowledge.

Primary datasets:

  • Common Crawl — Scrape of the entire web (trillions of pages)

  • Wikipedia — Structured, authoritative knowledge

  • Books corpus — Millions of digital books

  • Reddit — Community-driven discussions

  • News archives — Historical news articles

  • Scientific papers — Academic knowledge

What this means for your brand: If your brand isn't mentioned in these sources during pre-training, the model knows nothing about you from the start.

2. Fine-tuning (refinement)

After pre-training, the model is fine-tuned on specifically selected, high-quality data.

Sources:

  • Curated text from trustworthy sources

  • Expert-written content

  • Structured databases

3. RLHF (Reinforcement Learning from Human Feedback)

The model learns to give useful, safe answers based on human feedback.

This doesn't directly affect brand knowledge, but:

  • The model learns to cite sources

  • Learns to acknowledge uncertainty

  • Learns to prioritize authoritative information

4. Real-time retrieval (for some systems)

Some AI systems (like Perplexity and ChatGPT's web browsing) crawl live to supplement their knowledge.

Sources:

  • Your website (if crawlable)

  • News sites

  • Social media

The Most Important Data Sources for Brand Knowledge

1. Common Crawl

What it is: A nonprofit project that monthly crawls billions of web pages and makes data freely available.

Why it matters:

  • Many AI models (incl. GPT) are trained on Common Crawl

  • If your site isn't here, you're invisible to many LLMs

Is your site in Common Crawl?

Check here: https://index.commoncrawl.org/

# Check if your domain is in Common Crawl
curl "http://index.commoncrawl.org/CC-MAIN-2024-10-index?url=yourwebsite.com&output=json" | head -5

How to get in:

  • Allow CCBot in robots.txt

  • Ensure your site is crawlable

  • Wait — Common Crawl crawls periodically automatically

2. Wikipedia

Why it's gold: Wikipedia is one of the most authoritative sources LLMs trust. If your brand has a Wikipedia page, LLMs will cite it as fact.

How to influence:

  • Create a Wikipedia page (if you meet notability criteria)

  • Ensure existing pages are accurate and updated

  • Add sources and references

Wikipedia notability criteria:

  • Significant media coverage in trustworthy sources

  • Independent sources (not PR or own marketing)

  • Multiple in-depth articles

If you can't get a Wikipedia page: Focus on being mentioned in existing relevant Wikipedia articles (e.g., industry articles, geographic pages, or competitor pages with "See also" sections).

3. Crunchbase

What it is: Database of companies, funding, and tech ecosystems.

Why it matters: LLMs use Crunchbase to understand:

  • What your company does

  • Who your competitors are

  • Funding stage and size

Optimize your Crunchbase profile:

  1. Claim your Crunchbase profile

  2. Fill out all fields:

    • Description (specifically what you do)

    • Categories (choose relevant tags)

    • Funding info

    • Website link

  3. Keep it updated at funding rounds or product launches

4. LinkedIn

Why it's important: LLMs use LinkedIn to understand:

  • Company size and employees

  • Industry and focus areas

  • Brand positioning

Optimize your LinkedIn Company Page:

  • About section: Clear description of what you do

  • Specialties: Tag relevant keywords

  • Updates: Regular posts about product news, hiring, thought leadership

  • Employee profiles: Employee profiles connect to company page

5. News and Media Coverage

Why it matters: LLMs value brands with media coverage higher than self-published content.

Prioritize:

  • Tier 1 media: TechCrunch, Wired, Wall Street Journal, Financial Times

  • Industry publications: Relevant trade journals

  • Regional news: Local business media

How to get coverage:

  • Press releases at product launches

  • Thought leadership articles (guest posts)

  • Comment on trending topics

  • Awards and recognitions

6. GitHub (for tech brands)

Why it matters: Many LLMs are trained on open-source code from GitHub.

If you have open-source projects:

  • Update README with clear description

  • Add "About" section to repo

  • Link to your company website

  • Include use cases and examples

7. Social Media

Twitter/X:

  • Public tweets are crawled by some LLMs

  • Thought leadership and branding

Reddit:

  • Community discussions around brands

  • Authentic user experiences

YouTube:

  • Transcriptions are crawled

  • Product demos and tutorials

8. Academic Papers and Research

If your brand is tech or research-driven:

  • Publish whitepapers

  • Sponsor academic research

  • Contribute to conferences

Upload to:

How to Prioritize Your Efforts

Tier 1: Must-have (highest impact)

  1. Your own website — With correct robots.txt, JSON-LD, sitemap

  2. Wikipedia — If eligible

  3. Crunchbase — For tech brands

  4. LinkedIn — Optimized company page

Tier 2: Strong ROI

  1. News coverage — Tier 1 and industry publications

  2. Common Crawl — Ensure your site is crawlable

  3. GitHub — For open-source brands

  4. Industry directories — Relevant niche directories

Tier 3: Long-tail value

  1. Reddit — Authentic community engagement

  2. YouTube — Video content with transcriptions

  3. Podcasts — Guest appearances (with transcriptions)

  4. Forums — Stack Overflow, Hacker News, niche communities

How to Verify You're in the Datasets

Check Common Crawl

curl "http://index.commoncrawl.org/CC-MAIN-2024-10-index?url=yourwebsite.com&output=json"

Check Wikipedia

Search at https://en.wikipedia.org/wiki/Special:Search for your brand name.

Check Crunchbase

Visit https://www.crunchbase.com/organization/YOUR-COMPANY

Check news coverage

# Google News search
https://news.google.com/search?q="Your Brand Name"

Check GitHub

Search at https://github.com/search?q=YOUR-BRAND

Implementation Checklist

Use this checklist to ensure presence in important datasets:

  1. Website optimized — robots.txt, sitemap, JSON-LD

  2. Common Crawl — Allow CCBot

  3. Wikipedia presence — Create or update page (if eligible)

  4. Crunchbase profile — Claimed and updated

  5. LinkedIn company page — Complete profile

  6. News coverage — Minimum 3-5 mentions in relevant media

  7. GitHub repos — If tech brand, clear README

  8. Social presence — Active on at least 2 platforms

  9. Industry directories — Listed in relevant directories

  10. Content distribution — Syndicate content to Medium, LinkedIn articles

Conclusion

LLMs learn about your brand through hundreds of sources — not just your website. For maximum AI visibility, you need to ensure presence in the datasets that matter most: Common Crawl (via your website), Wikipedia (if eligible), Crunchbase, LinkedIn, and news coverage.

Prioritize Wikipedia and news coverage — they have disproportionate weight in how LLMs assess authority. Next, focus on making your own website perfectly crawlable and structured with JSON-LD.

Remember: Datasets aren't updated in real-time. It can take months from when you publish content until it's reflected in LLM responses. Start now, and systematically build presence in the sources AI systems trust.