While publishers fight to protect their content, a deeper question remains: Common Crawl, the web archive that fuels the training of most large language models, uses authority metrics to prioritize its crawls. Could these scores, called Harmonic Centrality and PageRank, influence how often AIs cite certain sources? An analysis of 607 million domains reveals troubling correlations.
Key takeaways:
- 64% of the language models analyzed between 2019 and 2023 were trained with filtered Common Crawl data, including over 80% of GPT-3's tokens.
- Common Crawl uses Harmonic Centrality to prioritize which domains to crawl and how many pages to capture, creating an overrepresentation of high-authority sites in the training data.
- The domains most cited by AIs (Wikipedia, Reddit, YouTube) also rank among the highest in Common Crawl's WebGraph, raising the question of an indirect influence.
- An investigation by The Atlantic reveals that Common Crawl bypassed paywalls and ignored content removal requests since 2016, fueling a major controversy over copyright.
The Common Crawl controversy explodes in 2025
In November 2025, journalist Alex Reisner published a explosive investigation for The Atlantic This nonprofit organization, founded by a former Google employee and sponsored by Amazon Web Services (AWS), has been archiving the entirety of the publicly accessible web for years.
The investigation reveals that Common Crawl provided millions of paywalled articles to AI companies, bypassing technical protections by not loading the JavaScript that verifies subscriptions. More troublingly, despite takedown requests from major publications like The New York Times (July 2023) and the Danish Rights Alliance (July 2024), no archive file has been modified since 2016.
Common Crawl's executive director, Rich Skrenta, defends a position as radical as it is surprising: " Robots are people too " and "". If you didn't want your content on the Internet, you shouldn't have published it. Despite these statements, the organization published a denial the same day as the investigation, claiming it never circumvents access restrictions.
The financial influence of AI giants
The financial links between Common Crawl and the AI industry raise questions about independence. In 2023, OpenAI and Anthropic each paid $250,000 to the organization. NVIDIA also appears as a collaborator on Common Crawl's website.
These donations come as Common Crawl hosts over 9.5 petabytes of data and is cited in more than 10,000 academic publications. The Washington Post analyzed Google's C4 dataset (a filtered version of Common Crawl) and discovered 15 million websites, including sources like patents.google.com, nytimes.com (4th position), as well as controversial sites like RT.com and Breitbart.
Harmonic Centrality: the overlooked authority signal
Beyond the copyright controversy, a technical dimension remains largely overlooked. Common Crawl does more than archive: it also publishes WebGraph data containing authority metrics for 607 million domainsetc.
Since 2017, Common Crawl has used Harmonic Centrality to determine which domains to crawl first. This metric actually measures a domain's "closeness" to all others in the web link graph. The higher the score, the more frequently the domain is crawled and the more pages are captured.
Common Crawl's lead engineer explains that this approach is preferred over Google's PageRank because it is more resistant to spam. The Harmonic Centrality score is not only used to decide which domains to crawl but also how many URLs to include.
The dominant domains of the WebGraph
The top 15 domains in Common Crawl's WebGraph (October–December 2025) reveal a dominance of social platforms and Google infrastructures:
| Rank | Domain | HC Rank | PageRank |
|---|---|---|---|
| 1 | facebook.com | #1 | #3 |
| 2 | Google APIs | No. 2 | No. 2 |
| 3 | #3 | #1 | |
| 4 | No. 4 | No. 5 | |
| 5 | Google Tag Manager | No. 5 | No. 4 |
| 6 | YouTube | No. 6 | No. 8 |
| 7 | No. 7 | No. 10 | |
| 8 | GStatic | No. 8 | No. 7 |
| 9 | No. 9 | No. 12 | |
| ten | gmpg.org website | No. 10 | No. 9 |
| eleven | cloudflare.com website | No. 11 | No. 6 |
| twelve | gravatar.com website | No. 12 | No. 14 |
| thirteen | wordpress.org website | No. 13 | No. 13 |
| fourteen | wikipedia.org website | No. 14 | #37 |
| 15 | apple.com | #15 | #19 |
An interesting observation: Wikipedia ranks 14th in Harmonic Centrality but only 37th in PageRank, while representing about 22% of the training data for major language models and remaining the most cited source by ChatGPT with 7.8% of citations.
The citation patterns of language models
Several recent studies have analyzed the sources cited by AIs. Semrush, after analyzing 150,000+ citations, finds that Reddit dominates with 40.1% of citations, followed by Wikipedia (26.3%) and Google (23%). This Reddit dominance is partly explained by the $60 million API licensing deal struck with Google in early 2024.
Profound, which analyzed 680 million citations between August 2024 and June 2025, reveals differences between platforms: Wikipedia accounts for 7.8% of ChatGPT citations, while Reddit reaches 6.6% on Perplexity and 2.2% in Google's AI Overviews. .com domains represent 80.41% of all citations, while .orgs account for only 11.29%.
Search Atlas, after analyzing 5.17 million citations covering 907,003 unique domains, confirms that commercial domains dominate across all platforms, while academic and government sources remain underrepresented.
Traditional authority does not predict AI visibility
A major discovery by Search Atlas in 2025 contradicts intuitions: traditional SEO authority metrics (Domain Rating, Domain Authority) show weak or negative correlations with visibility in language model responses.
Analysis of 21,767 domains reveals that Perplexity shows a correlation of -0.18 with Domain Power, while Gemini shows -0.09. High-authority domains occasionally underperform, whereas mid-tier sites maintain more stable visibility.
Research confirms that AIs reward contextual relevance and diversity rather than authority, restructuring information discovery around content quality rather than reputation derived from backlinks. Only 11% of domains are cited by both ChatGPT AND Perplexity.
Brand search volume as a primary predictor
Unlike link metrics, branded search volume appears as the number one predictor of AI citations, with a correlation of 0.334. Sites present on four or more platforms are 2.8 times more likely to appear in ChatGPT responses.
Another study shows that Targeted optimization can increase AI visibility by 30 to 40%. Adding statistics increases visibility by 22%, while including direct quotes increases it by 37%.
The long tail and marginalized communities
The Mozilla Foundation report, February 2024 raises another concern: Common Crawl's use of Harmonic Centrality to prioritize crawls means that digitally marginalized communities are less likely to be included in training data.
Of the 607 million domains indexed by Common Crawl, over 100 million fall into the long tail with a rank higher than 1 million. Common Crawl's senior engineer also acknowledges that Common Crawl does not contain the entirety of the web, contrary to popular belief.
Structured data as an optimization lever
The structured data and markup schema.org appear as determining factors. An experiment by Search Engine Land shows that a well-structured site with schema reaches rank 3 with appearance in AI Overview, while a site without schema is not indexed at all.
The comparison tables with HTML appropriate ones show citation rates 47% higher. The FAQPage schema directly feeds AI question-answer extraction. Wikidata, Google Knowledge Graph’s number-one source with 500 billion facts and 5 billion entities, strengthens entity recognition via the sameAs property.
A multifactor equation
La sselection of citations by language models remains a complex phenomenon. Confirmed factors include content quality and relevance, freshness and recency (significant impact — 40–60% of cited sources change monthly), structured formatting, real-time retrieval performance, and platform-specific preferences.
Possibly contributing factors include historical presence in training data, embedded authority associations, and signals derived from the WebGraph (direct or indirect).
Practical implications for optimization
The research suggests several courses of action.
- Don't ignore authority : although content and freshness matter significantly, domain-level signals probably play a role in the overall equation.
- Track multiple metrics : the CC Rank is one data point among others, not a silver bullet, but potentially useful for benchmarking. Understand the differences between platforms: Wikipedia dominates ChatGPT, Reddit dominates Perplexity and Google AI Overviews.
- Refocus link building priorities : Search Atlas' analysis recommends focusing on contextual and thematic connections rather than inflating authority; results show that a high Domain Rating or Domain Authority alone does not increase the likelihood of being cited by AI models.
The question of the long tail
If your domain sits in Common Crawl’s long tail (ranked worse than 1 million), it’s worth investigating whether this correlates with citation difficulties. Mozilla points out that Common Crawl’s mission as an organization does not easily align with the needs of building trustworthy AI.
The organization deliberately does not remove hate speech, wanting its data to remain useful to researchers studying these phenomena. However, this data is undesirable for training language models because it can lead to harmful outputs.
Towards more transparency
Mozilla recommends that Common Crawl better highlight the limitations and biases of its data, and be more transparent about its governance. The organization should also require greater transparency around generative AI by asking AI builders to disclose their use of Common Crawl.
For Mozilla, the companies behind AIs should create or support dedicated intermediaries responsible for filtering Common Crawl in transparent and responsible ways. In the long term, Mozilla advocates reducing reliance on sources like Common Crawl and placing greater emphasis on training generative AI with datasets created and curated by humans in a fair and transparent manner.
The relationship between Common Crawl’s authority metrics and language model citations merits rigorous empirical study. The CC Rank Checker tool is a small contribution to making these data accessible, but deeper questions require more research, more data, and more transparency from AI companies about the composition of their training datasets.
The article “Common Crawl: the hidden metric that could influence your visibility in AIs” was published on the site Abondance.