Fascinating Discoveries are Waiting Just Below the Surface: A Journey into Geocoding with Sabrina Templeton

Today’s guest post is from Sabrina Templeton, a 2025 Junior Fellow at the Library of Congress. Sabrina is pursuing her MS in Information Studies at the University of Texas at Austin. Prior to starting her degree, she worked as a software engineer and she is passionate about the intersection of library and technology spaces.

As this summer’s Junior Fellow with the Library’s Digital Collections Workflow Section focused on digital scholarship, I worked on a project that introduced me to both geocoding and the Library’s Historic American Buildings Survey / Historic American Engineering Record / Historic American Landscape Survey (HABS/HAER/HALS) collection. The purpose of the project was to conduct a proof-of-concept for computational interaction with the Library’s vast amount of data. While this project was exploratory in nature, if you’re looking to try something similar, you’re in luck! This work is also intended as an introduction and a starting point for anyone considering how to leverage the Library’s data for scholarly purposes. The project in its entirety can be viewed here. To get a better sense of the depth of my experience and the behind-the-scenes of working on this project, read on!

If you’re not familiar with the HABS/HAER/HALS collections, they are collections in the Library’s Prints and Photographs Division that comprise over 46,000 surveys of historic structures and sites. They have photographs, reports, architectural drawings, and more. Full digitization for these collections began in 1996, and today, in terms of data, they are a veritable treasure trove. As part of my project, I had the opportunity to meet with colleagues in the Prints and Photographs Division to learn more about the collections and gain helpful context for working with the data.

Being new to geocoding drove my initial approach, which was to take the titles of items in HABS/HAER/HALS and pass them directly to geocoding software to attempt to generate coordinates. Geocoding typically entails working with structured addresses, either in a broken-down form or a single string. As such, my method is a bit of an alternative one: it tests the bounds of the software, which can handle plain text, but are still expecting only addresses.

While the item titles in the collections contain addresses, they are intended to be human-readable; they often include additional information such as mile markers that would help a person locate an item. In many cases, this extraneous information does no harm to the ability of the geocoder to interpret it, such as with the item shown at the top of this blog post: “Bayonne Bridge, Spanning Kill Van Kull between Bayonne & Staten Island, Bayonne, Hudson County, NJ.” The phrase beginning with “spanning” is not necessary to the geocoder, but it does not prevent it from correctly locating the bridge, as shown in Image 1 (below).

***Image 1:*** Geocoding Result (shown in orange circle) for “Bayonne Bridge, Spanning Kill Van Kull between Bayonne & Staten Island, Bayonne, Hudson County, NJ,” as Compared to Known Location (shown in blue circle).

However, in some cases, the extra words can be picked up on by the geocoder in a way that obfuscates the actual address, as is the case with the item titled: “Falls Bridge, Spanning Schuylkill River, connecting East & West River Drives, Philadelphia, Philadelphia County, PA.” The geocoder latches in on “East & West,” and “Philadelphia,” but does not appear to properly weigh “PA”, as it returns an address of “East Ave & West Ave, Philadelphia, Mississippi,” as shown in Image 2 (below).

***Image 2:*** Geocoding Result (shown in orange circle) for “Falls Bridge, Spanning Schuylkill River, connecting East & West River Drives, Philadelphia, Philadelphia County, PA,” as Compared to Known Location (shown in blue circle).

I had suspected this method would not perform perfectly, especially at scale, so I set up the project in such a way that it would be easy to test how this geocoding method is performing. This testing was made possible by the existence of HABS/HAER/HALS surveys (“items” for this project) for which we have known coordinates: a subset of around 12,000 which would become the focus of this project. Most of these coordinates can be accessed through the loc.gov API, and the rest were fetched from Wikidata. By geocoding this set of known coordinates, I was able to calculate a distance for each item, revealing how far away it was from the known coordinate. Then, with those distances, I was able to analyze the output of the geocoder.

Using the method described above, the geocoder located almost 60% percent of the 12,000 or so items within 1km of their known location. Image 3 (below) is a visualization showing those items. The blue dots represent the known location and the orange dots represent the geocoder-generated coordinates; in this image, the blue dots are hard to see because of how close the corresponding orange dots are to them!

***Image 3:*** Visualization of Geocoded and Original Coordinate Pairs in HABS/HAER/HALS.

In this project, visualizations like the one above represented a turning point for me – working with locations and items at scale all felt just like any other data until I started to put things on a map. Once I zoom in, distances are suddenly put in context of roads, rivers, and borders, and the items are not just items but real historic buildings, landscapes, and structures that can be placed on a map. Some of my favorite moments from the entire course of this project were when I got to dive into those individual examples to try and figure out why the geocoder performed poorly on them. Attempting to follow the “logic” of the geocoder was also a neat exercise, especially when the geocoding process is a bit of a black box. And in doing so, I made fun discoveries of what is in the collection as well.

I did explore other avenues of inquiry that did not make it into the final version of the project to keep it clean and concise. One example of this was trying out a few different methods for geocoding. While I was not able to find a better geocoding method for this collection than the one that I used, that does not mean that one does not exist. It may even be trivially easy for someone else to find. This project is a success not due to any one outcome or result, but because of what the project itself demonstrates: that exploration is always valuable, and that in data as substantial as the collections of the Library of Congress, there are always fascinating discoveries waiting just below the surface.

Source of Article

TemiLib

Fascinating Discoveries are Waiting Just Below the Surface: A Journey into Geocoding with Sabrina Templeton

Similar posts

GOP Cries Censorship Over Spam Filters That Work

An Interview with Thalia Lightstone, Librarian in Residence

2025 Library Design Showcase

Archives

Help & Support

Subscribe