Beyond E2E Testing Part 1

Series: Beyond E2E Testing Post: 1 of 4

This post is adapted from a talk I gave at the 2026 Open edX Conference in Salt Lake City, hosted by WGU. If you'd prefer the slide format, the presentation is here.

TL;DR: Your test suite can be green and your UI can still be broken. Visual regression testing catches the gap: layout shifts, color changes, clipping, font regressions. It compares screenshots pixel by pixel against a known-good baseline. The tooling is lightweight: pixelmatch and pngjs on top of Playwright. Most of the work is in stabilization, not comparison. For larger projects, store baselines in S3 or another object storage bucket instead of git. Do it right and reviewers get baseline, current, and diff images attached to every failing CI run without pulling the branch.

Introduction

We shipped a design system update. A shared CSS variable — --font-size-base — moved from 16px to 14px. The intent was to tighten up the UI on a specific new component. The variable was more widely used than anyone realized.

Everything passed. The test suite was green, the PR was approved, the deploy went out. Typography shrank across the entire application. Not catastrophically — two pixels doesn't sound like much — but enough that the visual hierarchy of every page was off. Headings that were supposed to dominate a section now competed with body text. Sidebar labels that were intentionally small became illegible.

No functional test caught it because nothing stopped working. No accessibility scan caught it because WCAG doesn't mandate a specific base font size — it mandates that text can be resized, which it still could be. The only way to catch it was to look at the page and compare it to what it looked like before.

That's the class of bug visual regression testing exists to catch: automated comparison of screenshots against a known-good baseline.

What Is Visual Regression Testing?

Take a screenshot when the UI looks the way it's supposed to. Store it. On every subsequent run, take another screenshot and compare them pixel by pixel. If more than a configurable threshold of pixels changed, fail the test and write a diff image showing exactly where. That's it.

Visual regression testing works alongside functional tests, not instead of them. Functional tests tell you whether a component rendered without throwing. Visual regression tests tell you it also rendered in the right place, with the right colors, at the right size.

What it catches that functional tests won't:

Layout shifts (an element moved because a sibling changed size)
Color changes from a CSS variable update that affected more than intended
Component overlap or clipping from a z-index or overflow change
Font sizing regressions from a global style rule
Stacking context issues that only appear at certain viewport widths

What it doesn't catch, and shouldn't try to, is logic bugs, data correctness, or user flow breakage. If a form submits to the wrong endpoint, a screenshot won't tell you. Know what it's for.

Why Visual Regression Testing?

Visual bugs reach production because none of the human-in-the-loop steps catch them reliably. Code review is fast and textual; reviewers read diffs, not rendered pages. Manual QA is inconsistent; the same person might catch a four-pixel layout shift on Monday and miss it on Friday. Screenshots in PRs are better than nothing, but they're optional, they're not compared against anything, and they only show what the author thought to screenshot.

Automation doesn't get tired. It compares the same pixels in the same order every time. When something changes, it fails loudly and shows you exactly what changed.

The ROI is highest where a single change can cascade visually across many surfaces: shared CSS variables, design tokens, component libraries. One line touching --color-primary can affect dozens of components. A visual test suite catches the unintended ones immediately instead of at the next audit.

How Pixel Comparison Works

Capture a screenshot in a known-good state. Store it. On every subsequent run, capture the current state and run pixelmatch. It walks both images pixel by pixel, measures the perceptual color distance at each position using YIQ color space (so an equal-brightness hue swap registers as a difference, not just a luminance shift), and counts how many pixels exceeded your threshold.¹ If that count is above zero, fail the test and write a diff image. Unchanged pixels appear gray; changed pixels appear red.

pixelmatch does the comparison work. It's fast, has no native dependencies, and returns a count of differing pixels. You control the output: which color means "changed," how sensitive the threshold is.

pngjs handles reading and writing PNG buffers in Node.² Together they're the only dependencies you need beyond Playwright itself.

Doing It Yourself with Playwright

What Playwright gives you out of the box

Playwright ships with expect(page).toHaveScreenshot().³ First run saves a snapshot, every run after compares against it. For a lot of teams that's enough, and you should start there.

The limitation is control. You don't get a standalone diff image to attach to a CI artifact. The comparison options are minimal. If you want reviewers to see baseline, current, and diff side by side without pulling the branch, or if you want to tune exactly how differences are rendered, you need to wire it up yourself.

The pipeline

The diff image uses pixelmatch's color scheme: unchanged pixels are gray, changed pixels are colored by diffColor or diffColorAlt. When diffColorAlt is set, pixelmatch uses it for dark-on-light differences (a dark pixel where the baseline was light — content appeared) and diffColor for the reverse, giving the diff a rough sense of what was added versus removed at a glance.¹

Note: pixelmatch is an ES module. In a CommonJS Playwright config you need a dynamic import: const pixelmatch = (await import('pixelmatch')).default.

Handling Dimension Mismatches

When the page grows or shrinks (a new section was added, a component changed size), the baseline and current screenshots have different dimensions. Rather than throwing on a size mismatch, pad both images to the same dimensions before comparing:

Now a layout that grew shows as a large red area on the diff rather than a test crash with an unhelpful error.

Making Tests Stable

This is where most of the real work is. An unstable visual regression test is worse than no test because it trains people to ignore failures.

The stabilization sequence

Before every screenshot:

Steps 2 and 3 are easy to miss. waitForLoadState('networkidle') doesn't guarantee images are decoded and painted.⁴ document.fonts.ready prevents font flash from causing diffs on text-heavy pages.⁵

Killing animations via addStyleTag is more reliable than Playwright's animations: 'disabled' screenshot option alone.⁶ It also catches transitions triggered by JavaScript after the screenshot call begins.

Masking dynamic content

Two approaches, depending on what you want to test:

Hide (opacity zero): the element disappears visually but still occupies layout space. Use this when the content changes but you still want to verify the surrounding layout.

Mask (gray fill on both images): the region is filled with a solid neutral color in both the baseline and current buffers before pixelmatch runs. The area is excluded from comparison entirely: both content and layout.

By applying the mask to both images, pixelmatch sees identical gray in both and reports zero difference for that region.

Other sources of flakiness

Problem	Solution
Fonts rendering differently across OS	Pin to a single browser/OS in CI
Viewport inconsistency	Set explicit `viewport` in playwright config
Images that haven't loaded	The `document.images` check above

Baseline Management

First run always passes. There's nothing to compare against, so it saves the current state as the baseline. Every run after that is a comparison.

Treat baselines as source of truth the same way you treat test fixtures. They belong in version control. When you make an intentional UI change: delete the old baseline, run the test to regenerate it, commit the new one in the same PR as the code change. This makes the visual change reviewable: a reviewer sees "baseline updated" in the file list, checks the image diff, and confirms it was intentional. The history is tied to the commit that caused it. No archaeology required.

Where to Store Baselines

Git is the obvious first answer. It's already there, it's versioned, and diffs show up in PRs. For small projects with a handful of pages, it works fine.

The problem is scale. PNG screenshots are binary files. Git does delta-compress objects in packfiles, but already-compressed PNGs delta poorly — in practice each version costs nearly its full size. A project with 50 tested pages, 3 viewports, and a year of active development will accumulate hundreds of megabytes in .git/objects. That's before CI starts cloning the repo hundreds of times a week. Git LFS helps with checkout performance but doesn't reduce storage costs much, and it introduces another system to manage.

A better model for larger projects: keep only the current baseline in git (or skip git entirely) and store all baseline images, current and historical, in an object storage bucket.

s3://your-bucket/
  visual-baselines/
    main/                     ← current baselines, keyed by branch
      account-page.png
      dashboard.png
    history/
      2024-03-15/             ← date-stamped snapshots
        account-page.png
        dashboard.png
      2024-06-01/
        account-page.png
        dashboard.png

The CI pipeline pulls the baseline for the current branch at test time and pushes a new one when it passes:

AWS S3 is the most common choice, but the same pattern works with Google Cloud Storage, Azure Blob Storage, Cloudflare R2, or any S3-compatible provider.⁷ R2 in particular has no egress fees, which matters when CI pulls baselines on every run.⁸

Historical Baselines: A Timelapse of Your UI

This is outside the normal SDLC use case, but it's genuinely interesting.

If you archive a dated snapshot of every baseline on each passing CI run, you end up with a complete visual history of your application. One screenshot per page per day, or per merge, or per release. String those together and you have a timelapse of how your UI evolved over time.

You can then generate a GIF or video from the historical frames:

Practically speaking, this is more useful for documentation and demos than for catching bugs. You already know when your UI changed because you changed it. But there are real scenarios where it pays off: auditing a long-running product to understand how a design system migration progressed, demonstrating the scope of a UI overhaul to stakeholders, or identifying when a visual regression was silently introduced by reviewing the archive rather than bisecting git history.

CI Integration

When a test fails, the reviewer shouldn't need to pull the branch to understand what changed. Attach all three images (baseline, current, and diff) directly to the Playwright test report:

The if: always() is not optional. If a test fails and you don't upload on failure, the evidence disappears with the job.⁹

Tradeoffs

Not every page is worth testing this way. Feeds, dashboards with real-time data, anything that reflects live user state: you'll spend more time suppressing false positives than catching real regressions. Don't add visual tests to pages like that and then wonder why the suite is flaky.

The best ROI is on surfaces that are supposed to be stable but break visibly when they're not: settings pages, auth flows, onboarding, anything downstream of shared CSS or a component library. One visual test on a shared <Button> is worth more than ten tests on a page full of dynamic data.

Threshold tuning is ongoing. 0.1 is a reasonable starting point that catches real changes without failing on sub-pixel anti-aliasing noise. Go lower and you'll start failing on text edge rendering differences between machines. You'll probably end up tuning per-surface as you learn which pages are noisy.

Baseline storage grows. Organize by browser and file path (baselines/chromium/auth/login.png) from day one, not after you have 200 files named screenshot.png in a flat directory.

Conclusion

Your test suite can be green and your UI can still be broken. Functional tests verify behavior, not appearance. Screenshots in PRs don't get compared against anything. Visual regression testing closes that gap.

The comparison itself is easy. The actual work is stabilization: getting the page into a quiet, deterministic state before the shutter fires. An unstable visual test is worse than no test because it trains people to ignore failures.

Get the stabilization right and the rest is straightforward. Baselines in git, artifacts attached to CI runs, reviewers with everything they need to evaluate a failure without pulling the branch. When a change is intentional, the updated baseline goes in the same PR as the code, visible and reviewable, tied to the commit that caused it.

Next up: accessibility testing with Axe Core.

Footnotes

pixelmatch (mapbox) — the comparison engine, its pixelmatch(img1, img2, output, width, height[, options]) signature, and options (threshold, diffColor, diffColorAlt, aaColor, includeAA, alpha). Uses a YIQ perceptual color-difference metric and detects anti-aliased pixels. https://github.com/mapbox/pixelmatch ↩ ↩²
pngjs — pure-JavaScript PNG decoder/encoder; PNG.sync.read / PNG.sync.write and the PNG.bitblt(src, dst, srcX, srcY, width, height, deltaX, deltaY) helper. https://github.com/pngjs/pngjs ↩
Playwright, "Visual comparisons" — expect(page).toHaveScreenshot() and snapshot management. https://playwright.dev/docs/test-snapshots ↩
Playwright, page.waitForLoadState() — load-state values including networkidle. https://playwright.dev/docs/api/class-page#page-wait-for-load-state ↩
MDN, FontFaceSet.ready (document.fonts.ready) — resolves once font loading and layout settle. https://developer.mozilla.org/en-US/docs/Web/API/FontFaceSet/ready ↩
Playwright, "Screenshots" / page.screenshot() — including the animations: 'disabled' option. https://playwright.dev/docs/screenshots ↩
AWS SDK for JavaScript v3, @aws-sdk/client-s3 — S3Client, GetObjectCommand, PutObjectCommand, and the streamed response body's transformToByteArray(). https://www.npmjs.com/package/@aws-sdk/client-s3 ↩
Cloudflare R2 pricing — zero egress (data-transfer-out) fees. https://developers.cloudflare.com/r2/pricing/ ↩
actions/upload-artifact (v4) — uploading workflow artifacts; v3 was deprecated in 2025. https://github.com/actions/upload-artifact ↩