Bound to Browsable: Unlocking the Historical Media Publications Collection

Today’s guest post is from Genevieve Havemeyer-King of the Digital Collections Management & Services Division at the Library of Congress.

Even for those who aren’t movie buffs, the vibrant covers of Cine-Mundial never fail to snag one’s attention, and who could deny the pleasure of watching the evolution of the celebrity magazine unfold over 25 years of Modern Screen. The covers of Radio Age helped listeners transcend the purely auditory experience of early radio and the advertisements that grace the covers of Broadcasting trace the evolution of television as the public’s broadcast medium of choice. These are just a few examples from nearly one hundred titles now available in the newly released Historical Media Publications Collection — a treasure trove of film, television, radio, and recorded sound history published from the late 1800s through the mid-1900s.

Two cover issues of the Modern Screen magazine side by side. The image on the left shows a close portrait of a dark-haired woman's face surrounded by a large collar. The image on the right shows a portrait of a woman and man. The woman leans towards the man, with both looking off to the right. — In Image 1 (left), some understated text sits flush along the bottom cover line on April 1933’s issue of Modern Screen magazine leaving a pastel portrait of Claudette Colbert unobstructed; a stark contrast to the tabloid sensationalism the magazine touted by March 1958 in Image 2 (right).

Collection Background and Presentation Challenges

The path to today’s presentation began over a decade ago with the establishment in 2009 of the Media History Digital Library (MHDL). This initiative, founded by film historian David Pierce, aimed to provide high quality, text searchable scans of trade papers and fan magazines relating to the film, media and broadcast industries. Prior to this project, media scholars were forced to rely on low resolution and, oftentimes, incomplete microfilm facsimiles of the original publications. In 2010, Pierce submitted a digitization plan for serials held by the Library of Congress’s National Audio-Visual Conservation Center, prioritizing publications in the public domain. The Internet Archive was selected as the digitization vendor, and in 2012, scans were made public on the Internet Archive as “multi-part” objects — bound volumes containing multiple issues, presented without individual issue-level navigation. Soon after, the same content became available through UW-Madison’s Lantern search platform.

Fast forward to 2020, when the Library of Congress began development of our next-generation Digital Collections Repository (DCR). To test its end-to-end workflow — from ingest to public access — the Library ingested all of its Internet Archive-scanned content into the DCR, including the Historical Media Publications. This was also part of an effort to consolidate and co-locate the digital content produced through this digitization partnership.

For this collection (as well as other serialized content) the bound-volume format posed major challenges for display on loc.gov, including:

Limited user navigation to specific issues.
Large volume/file size which necessitated infrastructure improvements.
Incomplete descriptive date metadata for titles spanning several years.
Inconsistent or non-standardized entry of date ranges that limited arrangement and display of issues in chronological order (see Images 3 and 4).

Two images showing metadata text over a black background. The image on the left shows the text for the date field as "Jan 1932- Summer 1935." The image on the right shows the text for the date field as "Apr-May 11 1918." — Image 3 (left) and Image 4 (right) show how inconsistency in the ‘date’ field values can be seen in XML data stored alongside resources for individual bound volumes in the DCR, posing a challenge for any fully automated solution to breaking up issues in their loc.gov presentations.

Creating a Workflow to Enable Access

The development of the DCR has streamlined a number of digital access initiatives since its outset, and the challenges posed by this collection presented an opportunity to work with Library software and IT infrastructure teams on feature development that enabled ‘multi-resource display.’ Our solution was ambitious: reprocess every title to present it at the issue level on loc.gov, with each issue accessible as its own digital “resource” within a title.

After brainstorming presentation solutions with colleagues at the National Audio-Visual Conservation Center (NAVCC), who manage the physical collection, Digital Collection Management and Services (DCMS) Division staff developed a detailed workflow to normalize and enhance the metadata, identify individual issues, and map their page ranges. This would result in their presentation as separate, navigable resources. Here’s what that entailed:

Data extraction and analysis
- For this collection, each publication title was given a Library of Congress Control Number (LCCN), which was used as a tracking identifier in our project management system. One LCCN / publication title can have multiple bound volumes that span varying date ranges.
- A Python script was developed to extract and export data from XML files stored alongside the digital resources, representing each bound volume related to the LCCNs in this collection that has a digital surrogate stored in the DCR. The XML data includes inventory and descriptive metadata specific to each resource related to a given LCCN, including date metadata previously entered by digitization staff at the time of scanning.
- Analysis of the XML dataset confirmed that the date metadata was too inconsistent to automate creation of new resources.
- Another script was produced to query the DCR API and output one CSV for each LCCN, containing rows for each bound volume resource, including values for their unique DCR identifiers, which served as our initial data for mapping resources.
- DCMS staff used the LCCN and unique identifiers included in the CSV to locate each resource in the DCR and begin updating the CSV with new date metadata.
Breaking Down Bound Volumes Into Issues
- For each bound volume resource, staff examined the covers to identify where each issue began and ended, recorded the exact filenames for the first and last scanned page of each issue, and noted where pages or issues were missing or out of order.
- For each issue identified, a new row would be created in the CSV, essentially defining a “slice” of the bound volume for the DCR to designate as a new resource.
- Resource label values were entered for each issue, normalized to a consistent style (e.g. Jan. 1931, Fall 1931) while machine-readable dates were recorded in YYYY-MM-DD format when available. The labels display alongside the issue title in the loc.gov presentation, aiding discoverability, while the machine-readable dates are used by internal systems to order resources and to, eventually, display issues in calendar view.
  
  Image 5: A cover page from the bound volume of Screenland magazine is shown via the Digital Collection Repository (DCR) user interface, buried amidst ads for mascara and mouthwash.
Ingest of Resources and Quality Assurance
- Once data entry was complete, staff used the CSV (now enhanced with new metadata) as the input for a Python script that used the new data to produce a new ingest spreadsheet with system-specific data pulled from the DCR’s API. This spreadsheet was then used for the production of JSON manifest files (one manifest for every new resource described in each row of the ingest spreadsheet).
- The JSON data was ingested to create new resources from an existing resource in the DCR, maintaining the association to the original bound volume files and enabling groupings of the image files from the original sequence to be managed as separate digital assets.
- New resources were reviewed to check for mis-assigned pages, and ensure that the best page available was presented as the cover image.
Presentation of Issues
- The new resources were published on loc.gov and associated with the bibliographic records related to their LCCNs.

Screenshot of the collection display on loc.gov, showing four cover images from Hollywood magazine. All four covers feature brightly colored portraits of women. — *Image 6: The new resource items display in loc.gov, showing the gallery-view of Hollywood magazine issues listed in chronological order.*

Results

This reprocessing transforms the researcher experience. Instead of scrolling through hundreds of bound pages to find a single issue, users can now browse each serial’s title page on loc.gov, see a clear list of issues by date, and jump straight to the one they need. Upcoming loc.gov display features, like calendar views for multi-resource items, will also make it easier to explore a title chronologically.

While the process was meticulous, it yielded big returns:

Accessibility: Issue-level navigation improves usability for all researchers, from casual browsers to academic scholars.
Accuracy and Discoverability: Normalized, item-level dates support precise searching and citation.
Foundational development: Our colleagues’ work on loc.gov and the DCR have laid the groundwork for digital processing and presentation of many other serialized collections, expanding access to historical content and making research less cumbersome for patrons.

With nearly a hundred titles completed, there are over 7,000 issues now available on loc.gov, and the Historical Media Publications Collection is more discoverable and enjoyable to explore – a vibrant, visual history of the entertainment industry, page by page, issue by issue.

Source of Article

TemiLib

Bound to Browsable: Unlocking the Historical Media Publications Collection

Collection Background and Presentation Challenges

Creating a Workflow to Enable Access

Results

Similar posts

Where Science Meets Storytelling: Twelve Years of the Science Blogs Web Archive

Updates to the Ithaka AI Product Tracker

Mita’s observations on gatekeeping