Datasets and Data Sources — Where LLMs Get Their Knowledge About Brands
LLMs like ChatGPT, Claude, and Gemini aren't trained on your website alone — they're trained on trillions of words from hundreds of different sources. If you want to understand why AI mentions or ignores your brand, you need to understand which datasets shape these systems. This guide gives you an overview of where LLMs get their knowledge, how you can influence these sources, and which sources you should prioritize.

Datasets and Data Sources — Where LLMs Get Their Knowledge About Brands
LLMs like ChatGPT, Claude, and Gemini aren't trained on your website alone — they're trained on trillions of words from hundreds of different sources. If you want to understand why AI mentions or ignores your brand, you need to understand which datasets shape these systems. This guide gives you an overview of where LLMs get their knowledge, how you can influence these sources, and which sources you should prioritize.
How LLMs Learn About Brands
LLM training happens in several phases:
1. Pre-training (the foundation)
The model is trained on massive text corpora to learn language structure and general knowledge.
Primary datasets:
Common Crawl — Scrape of the entire web (trillions of pages)
Wikipedia — Structured, authoritative knowledge
Books corpus — Millions of digital books
Reddit — Community-driven discussions
News archives — Historical news articles
Scientific papers — Academic knowledge
What this means for your brand: If your brand isn't mentioned in these sources during pre-training, the model knows nothing about you from the start.
2. Fine-tuning (refinement)
After pre-training, the model is fine-tuned on specifically selected, high-quality data.
Sources:
Curated text from trustworthy sources
Expert-written content
Structured databases
3. RLHF (Reinforcement Learning from Human Feedback)
The model learns to give useful, safe answers based on human feedback.
This doesn't directly affect brand knowledge, but:
The model learns to cite sources
Learns to acknowledge uncertainty
Learns to prioritize authoritative information
4. Real-time retrieval (for some systems)
Some AI systems (like Perplexity and ChatGPT's web browsing) crawl live to supplement their knowledge.
Sources:
Your website (if crawlable)
News sites
Social media
The Most Important Data Sources for Brand Knowledge
1. Common Crawl
What it is: A nonprofit project that monthly crawls billions of web pages and makes data freely available.
Why it matters:
Many AI models (incl. GPT) are trained on Common Crawl
If your site isn't here, you're invisible to many LLMs
Is your site in Common Crawl?
Check here: https://index.commoncrawl.org/
How to get in:
Allow CCBot in robots.txt
Ensure your site is crawlable
Wait — Common Crawl crawls periodically automatically
2. Wikipedia
Why it's gold: Wikipedia is one of the most authoritative sources LLMs trust. If your brand has a Wikipedia page, LLMs will cite it as fact.
How to influence:
Create a Wikipedia page (if you meet notability criteria)
Ensure existing pages are accurate and updated
Add sources and references
Wikipedia notability criteria:
Significant media coverage in trustworthy sources
Independent sources (not PR or own marketing)
Multiple in-depth articles
If you can't get a Wikipedia page: Focus on being mentioned in existing relevant Wikipedia articles (e.g., industry articles, geographic pages, or competitor pages with "See also" sections).
3. Crunchbase
What it is: Database of companies, funding, and tech ecosystems.
Why it matters: LLMs use Crunchbase to understand:
What your company does
Who your competitors are
Funding stage and size
Optimize your Crunchbase profile:
Claim your Crunchbase profile
Fill out all fields:
Description (specifically what you do)
Categories (choose relevant tags)
Funding info
Website link
Keep it updated at funding rounds or product launches
4. LinkedIn
Why it's important: LLMs use LinkedIn to understand:
Company size and employees
Industry and focus areas
Brand positioning
Optimize your LinkedIn Company Page:
About section: Clear description of what you do
Specialties: Tag relevant keywords
Updates: Regular posts about product news, hiring, thought leadership
Employee profiles: Employee profiles connect to company page
5. News and Media Coverage
Why it matters: LLMs value brands with media coverage higher than self-published content.
Prioritize:
Tier 1 media: TechCrunch, Wired, Wall Street Journal, Financial Times
Industry publications: Relevant trade journals
Regional news: Local business media
How to get coverage:
Press releases at product launches
Thought leadership articles (guest posts)
Comment on trending topics
Awards and recognitions
6. GitHub (for tech brands)
Why it matters: Many LLMs are trained on open-source code from GitHub.
If you have open-source projects:
Update README with clear description
Add "About" section to repo
Link to your company website
Include use cases and examples
7. Social Media
Twitter/X:
Public tweets are crawled by some LLMs
Thought leadership and branding
Reddit:
Community discussions around brands
Authentic user experiences
YouTube:
Transcriptions are crawled
Product demos and tutorials
8. Academic Papers and Research
If your brand is tech or research-driven:
Publish whitepapers
Sponsor academic research
Contribute to conferences
Upload to:
arXiv.org (pre-prints)
University repositories
How to Prioritize Your Efforts
Tier 1: Must-have (highest impact)
Your own website — With correct robots.txt, JSON-LD, sitemap
Wikipedia — If eligible
Crunchbase — For tech brands
LinkedIn — Optimized company page
Tier 2: Strong ROI
News coverage — Tier 1 and industry publications
Common Crawl — Ensure your site is crawlable
GitHub — For open-source brands
Industry directories — Relevant niche directories
Tier 3: Long-tail value
Reddit — Authentic community engagement
YouTube — Video content with transcriptions
Podcasts — Guest appearances (with transcriptions)
Forums — Stack Overflow, Hacker News, niche communities
How to Verify You're in the Datasets
Check Common Crawl
Check Wikipedia
Search at https://en.wikipedia.org/wiki/Special:Search for your brand name.
Check Crunchbase
Visit https://www.crunchbase.com/organization/YOUR-COMPANY
Check news coverage
Check GitHub
Search at https://github.com/search?q=YOUR-BRAND
Implementation Checklist
Use this checklist to ensure presence in important datasets:
Website optimized — robots.txt, sitemap, JSON-LD
Common Crawl — Allow CCBot
Wikipedia presence — Create or update page (if eligible)
Crunchbase profile — Claimed and updated
LinkedIn company page — Complete profile
News coverage — Minimum 3-5 mentions in relevant media
GitHub repos — If tech brand, clear README
Social presence — Active on at least 2 platforms
Industry directories — Listed in relevant directories
Content distribution — Syndicate content to Medium, LinkedIn articles
Conclusion
LLMs learn about your brand through hundreds of sources — not just your website. For maximum AI visibility, you need to ensure presence in the datasets that matter most: Common Crawl (via your website), Wikipedia (if eligible), Crunchbase, LinkedIn, and news coverage.
Prioritize Wikipedia and news coverage — they have disproportionate weight in how LLMs assess authority. Next, focus on making your own website perfectly crawlable and structured with JSON-LD.
Remember: Datasets aren't updated in real-time. It can take months from when you publish content until it's reflected in LLM responses. Start now, and systematically build presence in the sources AI systems trust.