Skip to main content

Enhancing access to streaming archival media with transcripts and captions

This post was collectively authored by Andrew Berger, Dinah Handel, and Geoff Willard


Project goals

Digitization of audiovisual resources is only the first step in ensuring their contents are seen by many for years to come. In order for our audiovisual heritage to be truly accessible by all, it needs corresponding captions in a standardized format. There is broad recognition in the audiovisual preservation field that providing resources with corresponding captions is imperative to equitable access, and in some cases legally required (Fox 2020 and Weber 2017). Currently at Stanford Libraries, we do not have a method for creating and delivering transcriptions of audiovisual resources that can be displayed in the media player using the web.vtt standard.

A few weeks into the pandemic last year, we began researching captioning and transcription as part of our work-from-home efforts. Preliminary work with an AI-based vendor was encouraging for certain types of content, namely talking head style recordings, but our sample set was too small to make any hard conclusions. We applied for the Library’s David C. Weber Fund and received funding to expand our sample set and vendors. Our project has two phases, with five goals.

Phase one:

  1. Develop and document the decision-making process for selecting the appropriate vendor and service type based on the content. 

  2. Determine the most efficient and cost-effective way of transcript creation for different types of recordings. 

    1. There are three methods of transcription: 100% human, AI with human correction, and AI. 

    2. Stanford will test and compare all three methods.

  3. Develop an end-to-end workflow for transcription generation, delivery, and accessioning into the Stanford Digital Repository. 

  4. Develop a better understanding of the policy and budget considerations for incorporating transcription services into our production workflows.

Phase two:

  1. Gather requirements for technical implementation in the Stanford Digital Repository and media player. 

We also decided to project the total cost to transcribe all digitized or born digital media objects in the Stanford Digital Repository, broken down by the rights status of the materials. The total cost to transcribe does not include media that is not digitized, and the media stored in SDR comes from many units in the libraries, but primarily Special Collections and University Archives, and the Archive of Recorded Sound. 


Testing and assessing service providers

With this project, we selected 19 resources from University Archives representing a range of content. Source material was primarily spoken word (English), with little music or singing, and runtimes ranged from 5 minute interviews to 90 minute academic panels. All 19 resources were transcribed by AssemblyAI and OtterAI, 8 were transcribed by 3Play, and 6 were transcribed by Rev. We then developed an assessment criteria and analyzed the results from the vendors for accuracy, as well as qualitative observations about the functionality of the service providers tools. Specifically, we looked at speaker recognition, incorrect spelling, proper noun recognition, timing with the visuals, punctuation, and word accuracy. These criteria were judged on a 1-5 scale, and then averaged.

Of the two AI vendors, OtterAI was consistently superior on every metric. Their pricing was also quite favorable assuming we maxed out the amount of material we pushed through the service every month (6000 minutes). However, that cost quickly went up once we factored in the human labor it would take to do correction. We were seeing correction times at roughly 4x the runtime of the resource. While a 60 minute file may only cost $.06 to caption using OtterAI, it could take 4 hours or more to correct. Using $17.10 as our labor cost per hour, it was clear OtterAI was only warranted for “easy” jobs – i.e. those with a single speaker, not many proper nouns, good cadence, etc.    

Making corrections to the delivered caption files also proved to be a challenge. One of the reasons we settled on University Archives material was because of rights concerns. World content could be pulled into a tool we had previously used last spring called Amara, which allowed for very quick and intuitive correction work. As much as we liked Amara though, it really complicated our workflow, and it was effectively unusable for non-world-accessible content. To correct a file, we would have to do the following:

  • In our Amara account, add the streaming link from our repository for the resource we want to work on

  • Pair the .vtt file from OtterAI with this streaming link

  • Make corrections to the video

  • Export the new .vtt from Amara

 Staying within one interface seemed like a much better way to go, and that’s how we ended up with Rev. 3Play, our other human-based vendor, produced great results, yet because their interface was not suited to editing in the way that Amara was, we were forced back into the captioning service/Amara loopty loop. Rev has a very Amara-esque editing interface, and we were not bound by the rights status of the media since we were uploading files directly to their platform. Their costs were actually lower than 3Play while their output was generally of higher quality as far as speaker identification, punctuation, off-screen descriptions, and visual placement was concerned (for instance, Rev was the only vendor that correctly positioned captions in the top third when they would otherwise obscure an on-screen graphic displaying someone’s name). Correcting a Rev-generated caption file took, on average, 1.5x – 2.5x the runtime. It’s worth noting though that the decision to do full corrections might not be universal for all resources. Resources destined for an exhibit would be candidates for such a thorough second pass, but the initial caption file from Rev is so good in most cases that leaving it as is would be sufficient for accessibility.

There are two caveats with Rev that are worth mentioning: 

  1. Difficult audio is not something they like to deal with. We tested some cassette-based audio only resources with them, and they were rejected due to audio quality. 3Play will tackle these types of files, but they will also increase their price per minute accordingly because of the difficulty. 

  2. We have ethical concerns around the low compensation for human-based captioning services, as well as subjecting transcribers to potentially graphic or disturbing content unwillingly (thankfully we have very little of this content, but it is out there in a few of our collections).         

Media in the Stanford Digital Repository

The Stanford Digital Repository is a general purpose repository, serving both the library and the wider Stanford community. As such, it contains a variety of content types and file formats, ranging from images and documents to audiovisual materials and research data. Audiovisual files may be found in deposited data sets as well as in the library’s digital collections. However, media files deposited as research data often do not use formats that are suitable for online streaming, and often do not contain any spoken content. Therefore, our first task was to determine the scope of the media content in the repository that was likely to need captioning.

We settled on the following criteria for analysis:

  • We chose only materials in the SDR that use the content type of “media,” indicating that this content is predominantly audiovisual material that has been prepared for online access.

  • Within this set, we analyzed only files that have been formatted for online access. Most media content is deposited in at least two formats: one intended for long-term preservation and one for online access. Analyzing only the access files prevents us from double-counting this data. The access copies are also the files that will, ultimately, be paired with the caption files.

 Andrew Berger, Repository Manager, used the following process to analyze these files:

  • Made a list of every media object in SDR

  • Used a command-line script to run a metadata extraction tool called mediainfo on every file in every object, saving the output in JSON format. This generated metadata on over half-a-million files.

  • Filtered the list of files to only those matching the filename pattern that the Stanford Media Preservation Lab uses for access copies, resulting in a list of just under 200,000 files.

  • Cross-referenced this list with the current access rights for each item. Files may be world available, limited to the Stanford community, or not currently available online.

  • Finally, produced the total duration of all files for each rights status, as seen below.

The same mediainfo data could serve as a basis for future analysis, such as a breakdown of durations by collection or by type of content, such as audio or video.


We found that the most efficient method for accurate and high-quality transcript creation was to send files to the service provider Rev (human), which required the least time to QA and correct the results. A secondary option would be to send files to OtterAI (AI), which requires slightly more time to QA and correct. With Andrew’s work, we also found that the Stanford Digital Repository contains approximately 54,058 hours of audio and audiovisual media that may need captions. Some percentage of these materials, especially those from the Archive of Recorded sound, would not require transcription and captioning, as they are music without voice. The total cost of transcription, QA and correction, and accessioning for these materials would be approximately $1,454,157. 

See below for a table of the runtime for each rights status and the total cost to transcribe:

 table describing the costs to send media to a vendor and correct the transcript, broken down by rights status

Next steps 

The next phase of the project will be to work with the Infrastructure and Access teams in DLSS to understand the changes needed to preserve transcript files in the repository and display captions alongside our streaming media and audio.

We have the following recommendations on how to proceed with AV transcription work: 

  • Identify a yearly budget for AV transcription creation, QA, and correction. 

  • Begin with sending out all world-accessible (download) media files for transcription.

  • Identify additional tools for transcript correction, aside from Amara and Rev.

  • Consider hiring student workers to correct transcripts using Rev or OtterAI and Amara.

  • Develop a workflow and throughput rate that the Media Production Coordinator can accommodate alongside SMPL’s other operations.

We look forward to continuing to work on this project. If you’re interested in hearing more, please read our findings report or reach out to us with questions! 

Source of Article

Similar posts