Colour Contrast on the Web

A WCAG 2.1 Level AA Compliance Audit of Common Crawl’s Top 500 Domains

Thom Vaughan & Pedro Ortiz Suarez

February 2026

Interactive dashboard →

Abstract

We present a large-scale automated audit of WCAG 2.1/2.2 Level AA colour contrast compliance across the 500 most frequently crawled registered domains in Common Crawl’s CC-MAIN-2026-08 February 2026 crawl archive. Rather than conducting a live crawl, all page content was sourced from Common Crawl’s open WARC archives, ensuring reproducibility and eliminating any load on target web servers. Our static CSS analysis of 240 homepages identified 4,327 unique foreground/background colour pairings, of which 1,771 (40.9%) failed to meet the 4.5:1 contrast ratio threshold for normal text. The median per-site pass rate was 62.7%, with 20.4% of sites achieving full compliance across all detected colour pairings. These findings suggest that colour contrast remains a widespread accessibility barrier on the most prominent websites, with significant variation across domain categories.

1. Introduction

Web accessibility is a fundamental aspect of an inclusive internet. The Web Content Accessibility Guidelines (WCAG), maintained by the World Wide Web Consortium (W3C), define success criteria intended to make web content accessible to people with disabilities, including those with low vision, colour vision deficiencies, and other visual impairments.

Among the most measurable of these criteria is colour contrast. WCAG 2.1, under Success Criterion 1.4.3 (Contrast, Minimum), requires that the visual presentation of text and images of text have a contrast ratio of at least 4.5:1 for normal text and 3:1 for large text (defined as text at 18 points or larger, or 14 points or larger if bold). This criterion is classified as Level AA, the conformance level most commonly targeted by accessibility regulations worldwide, including the European Accessibility Act, Section 508 of the US Rehabilitation Act, and the UK Equality Act.

Previous studies of web accessibility have typically relied on live crawling of websites, which raises concerns about reproducibility (websites change over time), server load (automated audits can strain web infrastructure), and ethical considerations around crawling without explicit consent. In this study, we take a different approach: we use the open, freely available web archives maintained by Common Crawl, a non-profit organisation that performs broad web crawls on a monthly basis and makes the resulting data freely available to the public.

By sourcing all page content from Common Crawl’s WARC files, our analysis is fully reproducible. Any researcher can re-run our pipeline against the same CC-MAIN-2026-08 archive and obtain identical results. This approach also eliminates any load on the target websites themselves.

This work investigates whether meaningful, deterministic accessibility analysis can be performed directly from web crawl archives without rendering live pages. Using archived HTML from Common Crawl, we evaluate declared foreground and background colour pairings against WCAG contrast thresholds at web scale.

1.1 Disclosure

This project was developed with the assistance of Claude Opus 4.6 Extended (Anthropic). The analysis pipeline, dashboard, article, and supporting code were produced through an iterative collaborative process between the author and an AI assistant.

2. Methodology

2.1 Domain Selection

We selected the 500 most frequently crawled registered domains from Common Crawl’s CC-MAIN-2026-08 February 2026 crawl archive, as ranked by page captures in Common Crawl’s crawl statistics. These statistics are derived from Common Crawl’s URL index data and are publicly available. The domain list spans a diverse cross-section of the web, including blogging platforms (blogspot.com, wordpress.org), reference sites (wikipedia.org, wiktionary.org), technology companies (google.com, microsoft.com, apple.com), educational institutions, government agencies, and e-commerce platforms.

2.2 Data Retrieval

For each domain, we queried Common Crawl’s Columnar Index to locate an archived capture of the domain’s homepage from the CC-MAIN-2026-08 crawl. The Columnar Index is a Parquet-based representation of the crawl index stored on S3, queryable via Amazon Athena. A single SQL query across all 500 domains returns the WARC filename, byte offset, and record length for each homepage capture, identifying the exact location of the page content within Common Crawl’s distributed WARC archive.

We then fetched the actual HTML content using HTTP byte-range requests to data.commoncrawl.org, extracting just the relevant WARC record from the larger archive file. This is the standard method for accessing individual pages within Common Crawl’s petabyte-scale archive. Each WARC record contains the original HTTP response headers and body, from which we extracted the HTML document.

The index query was configured to select only successful responses (HTTP status 200) with a detected MIME type of text/html, preferring the www subdomain or bare domain over deeper subdomains, and HTTPS over HTTP where available.

2.3 Colour Extraction

From each HTML document, we extracted CSS colour declarations using two complementary methods:

  1. Embedded stylesheets: All <style> block contents were parsed as CSS, extracting rule selectors and their color and background-color declarations. The CSS background shorthand property was also parsed to extract colour components.
  2. Inline styles: All style attributes on HTML elements were parsed for color and background-color declarations.

This approach captures the static CSS present in the archived HTML. It does not capture styles applied by JavaScript at runtime, styles loaded from external CSS files (which would require additional index lookups and WARC fetches), or styles applied through CSS custom properties (var(--name)). This is a known limitation that we discuss in Section 5.

Colour values were parsed from all standard CSS colour formats: hexadecimal notation (#RGB, #RRGGBB, #RGBA, #RRGGBBAA), the rgb() and rgba() functions, the hsl() and hsla() functions, and all 148 CSS named colours. Non-colour values such as transparent, inherit, currentColor, and initial were excluded from analysis.

2.4 Colour Pairing and Contrast Calculation

This analysis measures declared colour contrast properties rather than rendered visual presentation.

Extracted colour declarations were paired into foreground/background combinations using the following rules:

Identical pairings (same foreground and background RGB values) were deduplicated to avoid counting the same colour combination multiple times.

For each pairing, we calculated the contrast ratio using the formula defined in WCAG 2.1:

Contrast ratio = (L1 + 0.05) / (L2 + 0.05)

where L1 is the relative luminance of the lighter colour and L2 is the relative luminance of the darker colour. Relative luminance is calculated from linearised sRGB values:

L = 0.2126 × Rlin + 0.7152 × Glin + 0.0722 × Blin

where each component is linearised from its 8-bit sRGB value using the standard piecewise function.

2.5 Compliance Assessment

Each colour pairing was evaluated against two WCAG 2.1 Level AA thresholds:

Since our static analysis does not determine the rendered font size of text elements, we report pass rates against both thresholds. The normal text threshold (4.5:1) is the more stringent criterion and represents the stricter assessment; the large text threshold (3.0:1) shows how results differ when the more lenient standard applies.

3. Results

3.1 Coverage

Of the 500 domains in our sample, 428 yielded analysable homepage HTML from the CC-MAIN-2026-08 archive. Of these, 240 contained at least one parseable CSS colour declaration. 188 homepages were retrieved successfully but contained no CSS colour data in their embedded or inline styles (these sites likely rely entirely on external stylesheets or JavaScript-injected styles).

3.2 Overall Compliance

Across 240 domains with colour data, we identified 4,327 unique foreground/background colour pairings. The overall compliance picture:

The distribution of per-site pass rates (normal text threshold) was:

Distribution of per-site pass rates for normal text (4.5:1 threshold).
Pass rate range Domains Percentage
100% (fully compliant)4920.4%
90–99%10.4%
75–89%2811.7%
50–74%9941.2%
25–49%3313.8%
0–24%3012.5%

3.3 Analysis by Domain Category

We categorised each domain by its primary function (Education, Government, Technology, News/Media, E-commerce, Hosting/Platform, Open Knowledge, Research, and Other).

Pass rate statistics by domain category, sorted by average pass rate descending.
Category Domains Avg pass rate Median Compliant
Research172.2%72.2%0
EU Institutions166.7%66.7%0
E-commerce564.1%66.7%0
Education4763.7%64.7%14
News/Media1061.7%75.0%1
Other13461.5%64.7%30
Open Knowledge752.4%50.0%0
Government1449.7%54.5%1
Technology1047.7%50.0%1
Hosting/Platform1144.6%43.3%2

3.4 Notable Findings

Worst Offenders

The following domains had the lowest pass rates for normal text contrast:

Domain Pass rate Failing pairings Worst ratio
adelaide.edu.au0.0%11.0:1
alberta.ca0.0%11.0:1
af.mil0.0%11.0:1
copernicus.org0.0%14.13:1
github.io0.0%14.02:1
gamer.com.tw0.0%11.0:1
kit.edu0.0%11.0:1
iol.pt0.0%13.77:1
mts.ru0.0%11.0:1
ncl.ac.uk0.0%21.0:1

Fully Compliant Sites

49 domains achieved a 100% pass rate across all detected colour pairings at the 4.5:1 threshold. Among those with the most pairings checked:

Domain Pairings checked Mean ratio
desktopnexus.com68.44:1
hatenablog.com613.14:1
baidu.com417.89:1
craigslist.org412.33:1
fu-berlin.de411.21:1
tokyo.lg.jp414.15:1
unt.edu48.54:1
prnewswire.com313.57:1
anu.edu.au219.52:1
google.cn212.77:1

4. Discussion

4.1 The State of Colour Contrast Compliance

Our findings indicate that colour contrast compliance varies substantially across the web’s most prominent domains. With a median per-site pass rate of 62.7% for normal text, a significant proportion of CSS-declared colour pairings fail to meet the WCAG AA threshold of 4.5:1. However, the interpretation of these numbers requires nuance.

First, not all colour pairings carry equal weight in a user’s experience. A site might have several low-contrast pairings defined in CSS that are rarely or never applied to visible text elements. Our static analysis counts all declared pairings equally, without regard to their prominence or frequency of use on the page.

Second, sites that declare fewer colours in embedded CSS tend to show more extreme pass rates (either very high or very low), whilst sites with many colour declarations tend to cluster around the mean. This is partly an artefact of sites with rich embedded stylesheets offering more opportunities for both passing and failing pairings.

4.2 Category Differences

The highest-scoring category was Research with a mean pass rate of 72.2%, whilst Hosting/Platform had the lowest at 44.6%. These differences likely reflect varying levels of institutional attention to accessibility standards, with sectors subject to regulatory requirements (such as government and education) potentially investing more in compliance.

4.3 Implications

These findings highlight that even among the web’s most popular domains, colour contrast barriers remain common. For users with low vision or colour vision deficiencies, these barriers can make content difficult or impossible to read. The prevalence of contrast failures across diverse categories of websites underscores the need for continued advocacy, tooling, and potentially regulation to improve web accessibility.

5. Limitations

This study has several important limitations:

  1. Static analysis only: We parsed CSS declarations present in the archived HTML without executing JavaScript. Modern web applications often inject styles dynamically, use CSS-in-JS libraries, or load styles asynchronously. Our analysis therefore captures a lower bound of the total colour declarations on each page.
  2. No external stylesheets: We did not fetch external CSS files (referenced via <link> elements). Many sites define the majority of their styles in external files. Incorporating external stylesheet analysis would require additional index lookups and WARC fetches for each domain, significantly increasing pipeline complexity.
  3. No rendering context: We cannot determine which CSS selectors apply to which rendered elements, what font sizes are in effect, or whether low-contrast elements are actually visible to users. A pairing that fails the contrast test in our analysis might apply only to decorative elements or hidden content.
  4. Homepage bias: We analysed only each domain’s homepage. Internal pages may have different colour schemes, templates, or accessibility characteristics.
  5. Snapshot in time: The CC-MAIN-2026-08 crawl captures a single point in time for each page. Websites may have been redesigned before or after the crawl date.
  6. Assumed defaults: When only a foreground colour was specified without a background, we assumed white (#FFFFFF), and vice versa with black (#000000). In practice, inherited styles or user agent defaults may produce different pairings.

6. Reproducibility

The complete analysis pipeline is available as open-source Python code requiring only Python 3.9+ and no external dependencies beyond the standard library. The pipeline consists of four steps:

  1. Columnar Index queries via Amazon Athena to locate homepage captures in CC-MAIN-2026-08
  2. WARC byte-range fetches to retrieve archived HTML
  3. CSS colour extraction and WCAG contrast analysis
  4. Aggregate statistics and report generation

All data is sourced from Common Crawl’s publicly available archives. Researchers can reproduce this analysis by running the pipeline against the same CC-MAIN-2026-08 crawl data, or adapt it to analyse different crawls or domain sets.

7. Conclusion

We conducted a large-scale WCAG 2.1 Level AA colour contrast audit of the 500 most frequently crawled domains in Common Crawl’s CC-MAIN-2026-08 archive, analysing 4,327 unique colour pairings across 240 homepages. Our findings reveal that 79.6% of sites contain at least one colour pairing that fails the 4.5:1 normal text contrast threshold, with a median per-site pass rate of 62.7%.

Whilst static CSS analysis provides only a partial picture of a page’s true accessibility, these results establish a reproducible baseline for tracking colour contrast compliance over time. Future work could extend this analysis to incorporate external stylesheets, JavaScript-rendered styles, and per-element rendering context, building toward a comprehensive picture of colour accessibility on the open web.

Many failing pairings cluster near the WCAG threshold, which suggests that minor colour adjustments could substantially improve compliance without major redesign.

This analysis indicates that meaningful accessibility research can be performed directly on crawl archives, without re-crawling or executing live web content.

Future work includes longitudinal comparison across crawl releases and integration of rendered-style analysis to compare declared and computed accessibility properties.

References

1. W3C. (2018). Web Content Accessibility Guidelines (WCAG) 2.1. https://www.w3.org/TR/WCAG21/

2. W3C. (2023). Web Content Accessibility Guidelines (WCAG) 2.2. https://www.w3.org/TR/WCAG22/

3. Common Crawl Foundation. Common Crawl. https://commoncrawl.org

4. Common Crawl. CC-MAIN-2026-08 Crawl Statistics. https://commoncrawl.github.io/cc-crawl-statistics/

5. Common Crawl. Columnar Index documentation. https://commoncrawl.org/columnar-index

6. W3C. (2018). Understanding SC 1.4.3: Contrast (Minimum). https://www.w3.org/WAI/WCAG21/Understanding/contrast-minimum.html