Skip to main content

“Can We Capture This?”: An Interview on Website Archivability

Earlier this year, the Library of Congress and our web harvest vendor MirrorWeb presented research on website archivability at the 2024 Web Archiving Conference at the Bibliothèque Nationale de France in Paris. The presentation was part of a panel called “Can we capture this?’: Assessing Website Archivability Beyond Trial and Error,” which was moderated by Martin Klein (Pacific Northwest National Lab) and included presenters Meghan Lyon (Library of Congress), Calum Wrench (MirrorWeb), and Tom Storrar (National Archives UK).

The Library of Congress is an active member of the International Internet Preservation Consortium (IIPC), organizer of the annual conference, and this panel was an example of how the Library’s Web Archiving Section collaborates and shares knowledge with web archivists from institutions around the world. A video of the panel is available on IIPC’s YouTube channel, and in this blog post, panel participants Meghan Lyon, Calum Wrench, and Tom Storrar discuss the panel and what they learned from their research on website archivability.

What does “website archivability” mean?

Meghan: Website archivability is the readiness of a website and it’s underlying technology to be discovered and accessed via web crawler, archived to preservation standards by aggregation into warc files, then rendered later using only archived data via web archive replay software. The higher the fidelity at replay, the better usability for researchers of the future.

Tom: To me it means the ability to archive, replay and authentically render content with tools currently available to us.

Calum: Website archivability in a nutshell, is how easy a website is to archive, given the technology currently available with which to archive them. The more complex a site is, the more difficult it will be to archive and preserve access for the future.

How does website archivability relate to the work that we do at the Library of Congress?

Meghan: During the talk I described the scale of our program in terms of relationships: The Web Archiving Section collaborates with internal and external parties to run the program, leading communication between curatorial staff, contractors, IT specialists, policy-makers, and the public. Even as a selective harvesting program, we are collecting around 700TB annually, with thousands of websites in our ongoing crawls at any given time. Our team and others at the Library have developed a suite of quality assurance workflows that take place after resources have been crawled. When we notice issues in the web archives, like missing components in an archived page or content that looks or behaves differently compared to the original site, we make adjustments to subsequent crawls to create better, more complete captures. We are always looking for ways to improve our workflows and hope to develop (or at least start a conversation about) proactive strategies to improve the quality of web archive collections. We also want to use the exploration of archivability expand the Library’s guidance on difficult to capture behavior on our Creating Preservable Websites page.

What would you most like to emphasize about archivability?

Tom: That it really is an interplay of issues, at various levels, from in-page scripting, right up to network-level security measures. It relates to capture but also the ability to replay and render the content in a reasonably authentic manner. It’s a rapidly evolving area that involves buy-in from a wide range of parties, including platforms, designers, developers, managers, and web archivists, including the technologies we use. Good practice, particularly on accessibility, is certainly conducive to successful web archiving but it’s not the whole story. There’s possibly a Venn diagram somewhere illustrating a sweet spot between attractive, interactive web content that is also archivable!

Calum: I think the most important thing to emphasize about archivability in the context of the web is that it’s dynamic. The web is not a static medium, and technology constantly forces us to change how we approach archiving. A long term, sustainable approach to web archiving requires input from both sides of the conversation. Currently the conversation around solutions to archivability focus on improving tools or techniques in the field of web archiving. Reframing this conversation to challenge web creators to think more actively about archivability, in the same vein that they do with accessibility standards such as W3C, would aid in bridging the current gulf in dialogue between web creators and web archivists.

Meghan: The web is complex, interconnected, and rapidly developing. Those features can make the task of archiving with the outcome of high fidelity replay in mind particularly challenging. Websites are selected for inclusion into the permanent collections at the Library of Congress based on collection development policy. The technology used to build those nominated websites varies dramatically and generally is not part of the nomination criteria. We want to explore the concept of archivability so we can help inform our recommending officers on how well a site will archive, if it will be collectable within the scope of our program, and to help site owners discover how they might make their websites more archivable. Part of the work we asked MirrorWeb to do over the last year was conduct a web archiving community survey about the archivability and the challenges faced when it came to quality of capture and replay. We’re planning to share the results in a future blog post, but I want to emphasize that challenges in this area affect collecting institutions large and small–if anyone out there is wondering if it’s ‘just them’–nope, it is not!

This panel was designed as a discussion; what idea or question from the audience stood out to you?

Calum: The discussion section of the panel was great, we had lots of engagement from the audience on various parts of the three presentations. To pick out one highlight, one audience member brought up an online tool with a sole creator/maintainer which checks the preservation-readiness of a page, and voiced a concern about what would happen if that tool were to ever disappear. While this sparked discussion about the longevity of community tools, Ilya Kramer of Webrecorder suggested the idea of a “successor” to Archive Ready in the form of a browser extension. Following the example that tools for accessibility assessment have followed, it would be the next logical step in being able to provide website assessments in a manner that is accessible to both site owners/developers and web archivists.

Tom: I think the point about the existing tool and what it means to content creators and web archivists alike is important. Something that supports proactive checking of “archivability” is a great thing as it addresses issues at source and raises awareness of our needs. It helps to move us away from “trial and error” However, the discussion also highlighted that this tool and others like it need to be adequately supported.

Meghan: We were really lucky to have an engaged audience of professionals who had plenty of thought-provoking questions and comments. One that stood out was how web archivists, and in particular how archivists from member organizations of the IIPC, could persuade creators that archivability is important. Ideas bounced around, one I particularly liked was interviews with developers making sites that archived particularly well, maybe even some sort of recognition for “Top 10 Most Archivable Websites,” or “Certified Archivable”–similar to the way you might see a climate friendly badge on an eco-friendly site. No one individual leaped up to be the arbiter of that, but ‘food for thought’ was our modus operandi for the panel.

What would you have included in your presentation with more time?

Calum: Yes! Ten minutes is a short period to cram a comprehensive take on web archivability, so we chose to focus our presentation on the results of the community survey. While we were able to present some interesting visualizations on the responses provided to areas of particular interest, it would have been nice to have delved more into why some of the specific technical challenges raised by respondents occur. In our presentation for our web archivability report to Library of Congress staff members at the end of March, my developer colleagues Mason Banks (MirrorWeb Engineering Manager) and George Hams (MirrorWeb Senior Developer) covered quite a few topics in extensive detail, complete with live demonstrations that would have made for fantastic presentation material at the panel, had we more time to work with. In particular, Mason’s insights into the challenges of non-deterministic interactions such as the capture of content delivered via HTTP POST requests, or dynamic image resizing in popular web CMS platforms would be of interest to the community at IIPC. Similarly, George’s spectrum of static webpages to single-page applications (pictured) would make for an interesting conversation piece, particularly when placed in conjunction with more web developer-driven archivability initiatives, such as those suggested by Rich Harris of the Svelte JavaScript framework.

Tom: Maybe some more examples and anecdotes about how website owners respond to our requirements!

Meghan: Ditto to Tom! During the panel Tom discussed several challenging use-cases that required special intervention to archive, such as interactive dashboards and a covid-19 artwork site with thousands of tile images, and discussed levels of interactivity and intervention that make certain sites tough to preserve. When content on a page loads only after a user interacts with it on a browser, it can require special scripting to enable capture; asynchronous JavaScript can be challenging to capture for similar crawler-discoverability and reproducibility reasons. Maps are can be a challenge due to the layering images and reliance on requests & responses from a 3rd party service to populate data in a frame on a page. The National Archives UK has a collecting remit to archive UK central government websites, and provides more granular guidance (including a developer checklist!) to make websites technically compliant. I’d love to borrow from this to improve our own guidance. If we had more time during the session, I would have loved more time for the conversation with the audience to continue.

Source of Article

Similar posts