The Web That Used to Be Simple
In the early 2010s, most websites worked in a straightforward way. A visitor typed a URL into their browser. The browser sent an HTTP request to a web server. The server – running PHP, ASP.NET, Ruby on Rails, or a similar technology – assembled a complete HTML page and sent it back. The browser received a fully formed document with all the text, images, and layout instructions already present in the HTML source code.
Archiving these websites was relatively simple. A web crawler could request each URL, receive the complete HTML response, download the associated stylesheets and images, and store everything in a WARC file. The archived page looked essentially identical to the live page because the content existed entirely in the server’s response. The challenge was primarily one of scale and completeness – making sure you followed every link, captured every resource, and stored everything correctly.
This era of server-rendered HTML was the foundation on which early web archiving technology was built. Tools like Heritrix (developed by the Internet Archive), HTTrack, and wget could traverse a website’s link structure, download pages, and produce usable archives. The technology worked because the web was built on a simple model: the server does the work, the browser displays the result.
Then JavaScript frameworks arrived, and everything changed.
The Client-Side Revolution
The transformation began in the early 2010s and accelerated rapidly. AngularJS appeared in 2010. React was released by Facebook in 2013. Vue.js followed in 2014. Within just a few years, these frameworks and their successors fundamentally changed how enterprise websites were built.
The new model inverted the traditional approach. Instead of the server generating complete HTML pages, the server now sends a minimal HTML document – often containing little more than a single empty <div> element and a reference to a JavaScript bundle. The JavaScript code executes in the visitor’s browser, makes API calls to fetch data, and dynamically constructs the page content in real time.
Here is what the HTML source of a modern React application often looks like:
<!DOCTYPE html>
<html>
<head>
<title>Our Company</title>
<script src="/static/js/bundle.js"></script>
</head>
<body>
<div id="root"></div>
</body>
</html>
That is it. The entire visible content of the website – the navigation, the text, the images, the interactive elements – is generated by JavaScript after the page loads. If you view the source code, you see an empty page. The content exists only in the moment, assembled by code running in the browser.
Why Traditional Crawlers See Nothing
For web archiving, the implications were profound. A traditional HTTP crawler does exactly what its name suggests: it sends HTTP requests and stores the responses. When it requests a URL from a modern single-page application, it receives the minimal HTML shell shown above. It stores that shell faithfully in a WARC file. And when someone later tries to view the archive, they see an empty page.
The crawler did its job correctly. It captured exactly what the server returned. The problem is that what the server returned is not the website. The website only comes into existence when JavaScript executes in a browser environment.
This was not an edge case affecting a handful of experimental websites. By the mid-2010s, major enterprises were rebuilding their web presences with these frameworks. Corporate sites, e-commerce platforms, financial service portals, news organisations, and government agencies all adopted client-side rendering. The entire web was moving toward a model that traditional archiving tools could not handle.
Single-Page Applications: An Even Deeper Challenge
JavaScript frameworks did not just change how individual pages render. They changed the concept of a “page” itself.
In a traditional website, every page has a unique URL, and navigating to that URL triggers a full page load from the server. In a single-page application (SPA), the initial page load is the only full load. After that, navigation happens entirely within the browser. When a user clicks a link, JavaScript intercepts the click, fetches new data via API calls, and updates the visible content – all without the browser requesting a new page from the server.
The URL in the browser’s address bar may change (using the History API), giving the appearance of traditional navigation. But from a network perspective, no new page was requested. For a web crawler that discovers pages by following links and requesting URLs, an SPA can appear to be a single page with a single URL, even if it contains hundreds of distinct content sections.
This meant that archiving a modern website required not just downloading pages, but simulating user interaction: clicking navigation elements, waiting for content to load, scrolling to trigger lazy-loaded resources, and tracking client-side route changes.
The Rise of Browser-Based Archiving
The web archiving industry responded to this challenge by fundamentally rethinking how websites are captured. The answer was browser-based archiving: instead of using lightweight HTTP crawlers, archiving systems began using full browser engines to render websites exactly as a human visitor would experience them.
Browser-based archiving works by controlling an actual web browser – typically a headless (no visible window) instance of Chromium or Firefox. The archiving system navigates to each URL, allows the JavaScript to execute, waits for the page to fully render, and then captures the complete state of the page including all dynamically generated content.
This approach solves the fundamental problem of client-side rendering. Because the archiving system runs the same browser engine that real visitors use, it sees the same content. React components render. Vue applications initialise. Angular modules load and display data. The archive captures what was actually visible, not just the raw HTML shell.
Ongoing Complexity
If browser-based archiving solved the JavaScript problem completely, this article could end here. But the web continues to evolve, and each new technique that makes websites more dynamic also makes them harder to archive.
Lazy Loading
Modern websites defer the loading of images and content until the user scrolls to them. An image below the fold does not exist in the page until the viewport reaches it. An archiving system must simulate scrolling behaviour – automatically scrolling through the entire page, pausing to let content load at each position, and capturing resources as they appear.
Infinite Scroll
Some pages have no defined end. Social feeds, product listings, and news streams load additional content indefinitely as the user scrolls. An archiving system must determine how much content to capture and establish reasonable boundaries for what constitutes a complete capture.
WebSocket Connections
Real-time features like live chat, stock tickers, and notification systems use WebSocket connections that maintain persistent, bidirectional communication between the browser and server. Content delivered through WebSockets is not part of a traditional HTTP request-response cycle, creating additional complexity for capture systems that must record all content sources.
Service Workers
Service workers are scripts that run in the background, intercepting network requests and serving cached responses. They can fundamentally alter how a website loads and behaves, serving different content from what the server would normally return. An archiving system must account for service worker behaviour to ensure the captured content accurately reflects the live site.
Client-Side Routing with Dynamic Data
Modern frameworks combine client-side routing with dynamic data fetching in complex patterns. A single page component might make multiple API calls based on URL parameters, user state, and browser capabilities, assembling content from various sources. Capturing the complete state requires the archiving system to understand and reproduce these data-fetching patterns.
Framework-Specific Rendering Patterns
Next.js, Nuxt.js, and similar meta-frameworks have introduced hybrid rendering strategies: some pages are server-rendered, some are statically generated at build time, some are rendered on the client, and some use a combination of all three. An archiving system must handle each rendering strategy correctly, which requires deep understanding of how modern web frameworks deliver content.
How Aleph Archives Handles the JavaScript Challenge
Aleph Archives has been archiving websites since 2010. We watched the JavaScript revolution unfold from the beginning, and we have continuously evolved our technology to keep pace.
Our archiving platform uses full browser-based rendering to capture every website as it actually appears. When we archive a React application, our system executes the JavaScript, waits for all components to render, handles asynchronous data loading, and captures the complete visual and interactive state of every page.
But rendering JavaScript is only the first step. Our system also handles lazy loading by simulating scroll behaviour across entire pages. It manages client-side routing by discovering and navigating all internal routes. It captures resources loaded through dynamic imports, WebSocket connections, and service worker caches. And it does all of this while producing ISO 28500-compliant WARC files with SHA-512 and RIPEMD-160 cryptographic verification.
After fifteen years of continuous development, our technology handles the full spectrum of modern web complexity. Every new JavaScript framework, every new rendering pattern, and every new browser feature is a challenge we actively track and address. This is the advantage of a company that does nothing but web archiving: when the web changes, adapting our capture technology is not one priority among many. It is the only priority.
The Lesson for Organisations
If your website is built with React, Angular, Vue, Next.js, Nuxt, or any modern JavaScript framework, your choice of archiving provider matters more than ever. A provider using traditional crawling technology may be capturing empty shells rather than complete pages. A provider that has invested in browser-based archiving will capture your website as it actually appears.
The difference is not visible until you need your archive – in a compliance audit, a legal dispute, or an internal review. At that moment, the difference between an empty HTML shell and a complete, faithful reproduction of your website is the difference between having evidence and having nothing.
If you want to verify how your website is being archived, contact us for a complimentary assessment. We will show you exactly what your current archive captures – and what it might be missing.


