National Monument Audit: Methodology and Technical Documentation

Introduction

This document provides in-depth documentation of the data process of the 2021 National Monument Audit by Monument Lab. It is originally authored by the Audit’s Lead Data Artist and members of the research team who architected the data workflow. This is made available for the members of the public who are interested in understanding how the Audit's "study set" was created from a technical point of view.

Definitions

  1. Data record - metadata about a specific cultural or historical object from a specific data source
  2. Data field - a single piece of information in a data record, e.g. the name of the object or the date an object was created.
  3. Data source - a verified organization, institution, or website that made available a set of digital data records representing cultural and historical objects
  4. Study set - the Audit’s final and official dataset generated from the Audit's data sources with an attempt to include only records about monuments. This is not complete and has known gaps. It also inherits the various issues of its data sources. It attempts to exclude non-monument objects like buildings, bridges, streets, and place names.
  5. Pre-study set - the complete set of data from data sources before excluding non-monument data records.
  6. Data model - how a specific data record is structured. This is made up of multiple data fields such as date constructed, name, object type, subjects, location, etc.
  7. Data ingest - the process of retrieving, filtering, and transforming raw data from data sources into the study set

Scope

  1. This document briefly describes how data sources were discovered, selected, or vetted. It mostly focuses on how the data was processed after it was selected.
  2. This document outlines what data sources were used and which data fields were used for the study set.
  3. This document describes the data ingest process, i.e. how raw data from data sources were retrieved, filtered, and transformed into the study set.
  4. This document describes how a specific data record was determined to be a monument and thus part of the study set.
  5. This document describes the data model of an individual record in the study set.
  6. This document describes how geographical and temporal data was handled.
  7. This document describes how People were identified as honorees of their respective monuments.
  8. This document describes how duplicate records across different data sources were identified and handled.
  9. This document does NOT describe the codebase or give instructions for how to reproduce the data process with the existing or new data. You can visit the project's Github code repository for that information.

Data sources

For this audit, {{dataRecordTotal}} data records were retrieved and analyzed from {{dataSourceTotal}} data sources to generate a study set of {{totalMonuments}} ({{percentMonuments}}%) data records that are believed to represent monuments.

Source selection

The process of identifying and selecting these 42 sources began with an investigation of over a thousand (1,310) potential data sources retrieved through exploration of federal resources, secondary research on each state and territory, email inquiries sent to every State Historic Preservation Office (SHPO) and Tribal Historic Preservation Office (THPO), professional networks, and extensive internet searching. Over five hundred sources (539) were further considered for potential inclusion.

The 42 identified for incorporation into the study set were selected on the basis of a combination of the following factors:

Data ingest process

Each data source provided publicly accessible digital records about cultural objects in a variety of formats. A large part of the work of the Audit was accessing, converting, parsing, and mapping that data into a single, normalized dataset. Here is the rough step-by-step data ingest process after a data source was identified.

  1. The data is downloaded manually. In some cases, the data was not readily available as a single downloadable file, so custom scripts were written to programmatically access and download them (in the case of data embedded in webpages or online maps).
  2. If applicable, the data was converted into a standard, usable format such as CSV, Shapefiles, etc.
    1. This was sometimes done manually for things like geospatial files (like GeoJSON, KML, Shapefiles) that needed to be re-projected into a standard map projection.
    2. This was sometimes done programmatically in the case of HTML or PDF files that needed to be parsed in a customized fashion
  3. Fields were manually mapped to the fields in the Audit's data model when relevant
  4. When applicable, the data was "enriched" using methods outlined later in this document, including:
    1. Geocoding locations if addresses are provided, but lat/lon are not provided
    2. Extracting names of people that the object is possibly honoring
    3. Putting the object into specific object groups (monument, building, structure, etc)
    4. Identify if the object record is a duplicate of another object record
    5. If the object is honoring a person, linking that person to a name authority and thus retrieving more information about that person (e.g. gender, ethnic group, birth date, etc)

For more technical details about this process you can visit the Audit's Github code repository.

Full list of data sources

Click "show details" button to view which fields were used, how the fields were mapped, and any other notes about how the data was processed

Loading...

By the numbers

Share of records, pre-study set (before excluding non-monument data records) and study set (after excluding non-monuments)

Note: from here on in, all numbers and charts will be based on the study set (monument data records only) and not non-monument records unless otherwise noted

Data source quirks

While many of the data sources contain their own quirks, this section will highlight quirks in the larger, more visible data sources in the study set.

  1. The Smithsonian Institution's Save Outdoor Sculpture! dataset is very comprehensive and well-described. For the purposes of this audit, the major flaw in this dataset is that the latitude and longitude was not originally collected when this dataset was created, a time before many of the simple mapping tools we rely on today were available. This means that the dataset was subsequently geocoded by the Smithsonian. This resulted in variably inaccurate geolocations. These can be off slightly or significantly. These are often represented by clusters of points on the map (likely the center of a city or town.) To get a sense of the extent of the problem, approximately half of the SoS data records are within a cluster of 10 or more records, which indicate that they are likely inaccurate coordinates. Of those inaccurate records, we attempt to do our own geocoding (based on their location description) which provides potentially better coordinates of about 40% of those records. All in all, if you see a record from SoS, use its location data with caution. It's likely best to look at it's address or location description to find it's real location. One other quirk in this dataset is that it was created between 1990 and 1995, which means it's missing monuments that have been erected, altered, or removed since 1995.
  2. HMdb is a crowd-sourced website and a major part of the study set. The major quirk of this data source is that they focus on historical markers even if the marker describes an adjacent monument or memorial. There are many records that are technically about a marker, but the marker exists to support and describe an adjacent monument. For example, this entry about the Washington Monument is one of a few HMdb entries that represent markers that surround and describe the Washington Monument. The monument itself does not have an entry on HMdb. So in this case, we attempt to merge all the markers about the Washington Monument on HMdb into a single record about the Washington Monument itself. We do this through the de-duplication process as well as analyzing the captions under the images on HMdb.
  3. OpenStreetMap provides information about historic monuments and memorials with highly accurate geographical precision. However, they use a very specific definition of a monument that can be interpreted inconsistently by OpenStreetMap contributors, resulting in objects that may not fit our definition of monument. Additionally, OSM records often do not contain much more metadata beyond the monument's name and location, so it is difficult to further categorize or validate. Usually we rely on the deduplication process to match OSM records to records from other data sources to get more metadata about them.
  4. National Register of Historic Places is unique in that many data sources refer to or build upon NRHP data records. This creates many potential duplicates across data sources that build upon this data. When possible, we use the NRHP reference number to identify duplication of an NRHP across data sources before it is processed. Otherwise, we rely on the deduplication process to identify NRHP duplicates across data sources.

Metadata

Data sources differed in the availability of metadata. The quality and consistency of the metadata also varied across the different data sources. This section will describe the inconsistencies and quirks of the source data and how data quality was addressed for the purposes of creating the study set.

Geospatial metadata

Most data records have latitude and longitude available. In many cases the data source itself was in a geospatial format (e.g. geojson, shapefile) but needed to be re-projected into a WGS 84 projection using QGIS.

However, there are many records that seem to have automatically generated their lat/lon coordinates. In some cases, this results in incorrect or inaccurate coordinates (e.g. placing it in the center of a state or city.) We attempt to filter these out by identifying clusters of data records that have the same lat/lon coordinate.

In the case where there is not lat/lon coordinate, but there is a full street address, city, and state, we attempt to geocode these using OpenStreetMap's Nominatim geocoding service. Note we use this only when all three: street address, city, and state are present. And we only accept the result if the coordinates are within the expected state.

How the geospatial information was obtained is reflected in a new data field that is generated called "Geo Type." This field has one of six values:

  1. Exact coordinates provided - lat/lon coordinates are provided by data source and they are likely valid
  2. Approximate coordinates provided - lat/lon coordinates are provided by data source but they are likely inaccurate; when possible (if street address is present), we attempt to geocode these ourselves. Otherwise, a warning appears in the interface for these likely inaccurate records.
  3. Geocoded based on street address provided - lat/lon coordinates obtained by geocoding street address with Nominatim geocoding service
  4. No valid geographic data provided - street address was provided, but no valid lat/lon coordinate could be found
  5. No geographic data provided - neither lat/lon nor street address were provided by data source
  6. Coordinates manually corrected from original - in rare cases, we manually correct lat/lon coordinate for those that we have confirmed inaccurate.

Dates

Dates are present in about two-thirds of all the monument data records. Dates can be grouped into the following categories:

  1. Date constructed
  2. Date dedicated
  3. Date designated or listed as a landmark (local, state, national)
  4. Date removed (rare)
  5. Date commissioned (rare)

Some things to note about dates:

  1. A data source may have none, one, or many of these dates present in their records. And sometimes dates may only be present in a subset of records within one data source.
  2. Dates come in all formats (e.g. 01-01-1900, Jan 1, 1900, 1900, 1901-1900). For the purposes of the Audit, all dates are normalized to a single year (e.g. "1900".) In the case of multiple dates (e.g. 1900-1901), the first one is taken.
  3. For the purposes of the Audit, we are only using date constructed and date dedicated for visualization purposes. And for the purpose of showing timelines, we combine date constructed and date dedicated into a single field called "Year Dedicated Or Constructed", however the individual fields ("Year Dedicated", "Year Constructed") are still available. In the case where date constructed and date dedicated are both available, "Year Dedicated Or Constructed" will be the "Year Dedicated."

Object Types

Some data sources provide information about the physical form of the object. This can include very broad groups such as "Monument", "Building", or "Marker" or it can be very specific such as "Bench", "Fountain", "Relief." These values are not consistent across different data sources and the different data sources track these with different levels of strictness and specificity. When available, we heavily rely on this field when determining if an object is a monument (see section What is a monument?).

Note: the next two charts (object types and object groups) are based on the pre-study set, before filtering out monuments

We attempt to disambiguate and normalize object types into major groups, in particular, monument and non-monument groups. See What is a monument? section.

Honorees

Very few data sources specify who or what is being honored:

For records that do not have honoree information (most records), in order to understand who or what is being honored, we try to automatically determine this through a process of entity extraction and entity linking. This process is roughly as follows:

  1. The text of the following fields are analyzed using a tool called Spacy: "Name", "Alternate Name", "Honorees", "Description"
  2. Spacy does many things like tokenization and part-of-speech tagging, but most importantly and relevantly, it does named-entitiy recognition. This essentially tells if people, events, and organizations are included in the data fields we provide.
    • For example, in the following text: Andrew Dickson White, 1832-1918, friend and counselor of Ezra Cornell, and with him associated in the founding of the Cornell University The following named entities are expected:
      1. Andrew Dickson White (PERSON)
      2. Ezra Cornell (PERSON)
      3. Cornell University (ORGANIZATION)
    • Spacy has a number of entity types available (see page 21 in this document), but we only use these:
      1. PERSON - People, including fictional
      2. NORP - Nationalities or religious or political groups
      3. ORGANIZATION - Companies, agencies, institutions, etc.
      4. EVENT - Named hurricanes, battles, wars, sports events, etc.
      And we currently only use "PERSON" in the search interface
    • In order to do this, Spacy uses a large annotated corpus called OntoNotes that leverages a variety of sources such as news, conversational telephone speech, weblogs, usenet newsgroups, broadcast, and talk shows. It is important to note that any bias present in this corpus will be inherited by the entities that are recognized using Spacy.
  3. After we receive named entities, we then attempt to link these entities to the specific "real-world" person, event, or organization. We do this by matching named entities against Wikidata for those entities' corresponding entry. The advantages of this is that we can map multiple named entites ("Martin Luther King", "Martin Luther King, Jr.", "MLK") to a single specific person and we can retrieve additional structured information about this person (e.g. date of birth, occupation, gender)
    • An important note: Just like Spacy, data about specific people on Wikidata inherit whatever biases the Wikidata contributors have.
    • We are considering using the "ethnic group" (e.g. "African American") field when present in a Wikidata entry as a proxy for estimating the diversity or lack thereof of monumental representation. However, this is a very complex and fraught topic, so it requires explicit contextualization if used. Furthermore, if the person being represented is white, "ethnic group" is usually not present, i.e. white is the default-- another topic of discussion. Please see the Note on Demographicsfor a further discussion of demographic information in the audit.

Note: We originally also analyzed plaque and marker text, but even though it helped identify more people, it resulted in too many false positives. For example, a civil war monument may quote Abraham Lincoln in the plaque, but may not be a monument about Lincoln himself. Therefore, in the case of Lincoln, we will only consider a monument honoring him if his name appears in the name of the monument, listed explicitly as an honoree, or contained within the monument description.

Plaque or marker text

About a third of the records contained the inscribed text that accompanies the object, such as the text on an embedded plaque or on an adjacent marker. This text is often very rich with a variety of formats (e.g. quotations, lists of names, narratives)

Subjects

Some sources categorized their objects in a variety of structured and undstructured ways. These would be referred to as fields like "Topic", "Tag", "Type", "Category", "Subject", etc. These are rarely consistent across data sources. Sometimes they contain broad topics (e.g. "Settlements & Settlers"), events ("U.S. Civil War"), groups of people ("Native Americans"), use types ("Cemeteries & Burial Sites"), or physical properties ("Abstract".) These fields are used when keyword searching, for determining if an object is a monument (see section What is a monument?), and for determining if they fall under certain themes (e.g. War & Weaponary.)

Other fields of interest

There were other fields that may be useful for further research and visualization purposes. This includes creators (architect, sculptor, artist, foundry, etc), and sponsors.

What is a monument?

One of the central challenges of this Audit was identifying which data records represented "monuments" in the conventional sense of the word. This section will walk through the practical process and application of the Audit's definition of "monument" as it relates to generating the final study set. This generally means describing the “rules” of what is counted as a monument based on its metadata so that a computer script can categorize a data record as a monument or something else. To start, some high-level points:

  1. Most data sources provided data records that were a mix of different types of objects such as markers, buildings, monuments, and structures. In other words, most sources did not simply provide a set of "monuments."
  2. Data records that represented monuments were often not marked as such, i.e. there was not a field that called it a "Monument" and may have instead been called "Object" or the physical form may not have been described at all.
  3. For those sources that called an object a "monument" either had its own definition of "monument" or did not provide a definition of "monument." The definitions of monuments across data sources differed.

Given that, here is the high-level process for determining if something is or is not a monument:

  1. If a data source explicitly categorized something as a monument, we assume that it is a monument.
    • This is because we tried to defer to the sources' data when possible rather than apply our own logic.
    • Only a few of our data sources had fields that categorized their records in this way (E.g. OpenStreetMap, Contemporary Monuments to the Slave Past, Whose Heritage?, Pioneer Monuments, National Park Service: Points of Interest).
    • This is done with the full awareness that the underlying objects may be mis-categorized as "monuments" or may not exactly match our interpretation of "monuments."
    • Known issue: OpenStreetMap data seems to contain "monuments" that we may not consider to be monuments based on our own definition; this may be because its definition differs from ours and it is crowd-sourced.
  2. If a data source does not explicitly tell us if something is a monument, we attempt to determine if something is a monument by looking for keywords in certain data fields. This process is roughly as follows:
    1. A large number of records are filtered out based on keywords that indicate that they are part of different object groups such as markers ("Washington Slept Here historical marker"), buildings ("Washington School Compound"), streets ("Washington Street"), structures ("Washington's Canal"), sites ("George Washington Oak Tree Site"), or places ("Washington County"). This relies heavily on a record's name and object type fields.
    2. For an object or structure that calls itself a memorial (e.g. "Vietnam Veterans Memorial"), we consider this a monument, as long as it is not already in a different object group (e.g. memorial highways, memorial high schools are not monuments.)
    3. For the records that remain, we then look for keywords that indicate their physical form (e.g. Bust, Statue, Cannon, etc.) and what or who it is representing. Generally speaking, we are looking for large, solid forms (monoliths, statues, obelisks, pyramids) with some indication that they are honoring a person, group, or event. The criteria is roughly:
      1. The object is part of certain physical forms that we consider monumental (obelisks, pyramids, statues, pillars, etc)
      2. The object is honoring a person, group, or event. E.g. the subject may contain words like "human figure" or "heroic" or "honoring", a historical name or event may be included in the name (e.g. "Harriet Tubman Statue")
      3. Conversely, we check for indications that the object is in a group we roughly consider artistic/abstract sculptures that do not honor a person, group, or event. In this case, the subject may contain "Abstract--Geometric" or "Plant--Flowers" with no mention or indication of an honoree or human figure.
    4. For any records that remain, we consider them "Uncategorized" objects. These are likely records with little to no metadata other than its name and location, where the name is ambiguous (e.g. "World War II Roll of Honor" fits our definition or "honoring", but does not give an indication of what type of physical form it is).

Handling duplicates

Handling duplicate records across different data sources was a challenging area of work that is not completely solved. This refers to different records from different data sources that represent the same real-world object. We used a combination of object identifiers, geographic coordinates, and name fields to determine if two or more records were representing the same object. Here is the rough process:

  1. We look for records that have nearly the same latitude and longitude (within some radius)
  2. Within a group of such records, we analyze the name of the object. If the names are similar, then we consider them duplicates. We use a library called FuzzyWuzzy (which uses Levenshtein distance) to compare two names (after the two names are "normalized", i.e. lowercased, punctuation and parenthesized content removed). This allows for similar names (e.g. "Martin Luther King Civil Rights Memorial" and "Doctor Martin Luther King Junior Memorial") to be considered a match.
  3. Groups of duplicate records are then "merged" into a single, combined record. In the case of lists of values, such as "subjects" and "honorees", the lists are unioned with duplicates removed. In the case of single-value fields like name or construction date, more "official" sources (e.g. National Park Service, Smithsonian) took preference. For the case of latitude and longitude, certain sources that we believe have good geocoding (e.g. Open Street Map) took preference.

There is one special case for duplication which are records in the National Register of Historic Places. There are a number of state and local data sources that contain NRHP records and crowd-sourced sources such as Open Street Map often build upon NRHP data. When possible, we use the NRHP reference number to identify duplication of an NRHP across data sources. This happens as a pre-process step, so redundant NRHP records are ignored before they are processed and entered into the study set unless they contain additional metadata (like in Open Street Map.)

The study set

As mentioned before, the Audit’s final and official dataset generated from the Audit's data sources is called the "Study Set", which attempts to include only records about monuments. This is not complete and has known gaps.It attempts to exclude non-monument objects like buildings, bridges, streets, and place names. It also inherits the various issues of its data sources including outdated and inaccurate information. Records in the audit were augmented from their original sources in some cases, but were not corrected so that records in the audit match records in the original dataset. Where there are inaccuracies in the data, users are encouraged to contact the original data creators.

Data model

This dataset is composed of records that have a mix of fields taken verbatim from the data source and there are some fields that Monument Lab generated using those fields. Both unedited fields and generated fields are available within the study set.

A full list of unedited fields taken directly from the data source are:

Field nameDescription
NameThe name of the monument
Alternate NameThe alternate name or subtitle for this monument
Vendor Entry IdThe identifier for this monument supplied by the data source
ImageA URL of an image of the monument. In the case of multiple images, this is the first visible image
DescriptionA long description of the monument
TextThe plaque or marker text with formatting removed
SourceThe data source
URLThe URL of the data source's monument record
Street AddressThe street address where the monument is located
CityThe city where the monument is located
CountyThe county where the monument is located
StateThe state where the monument is located
LatitudeThe latitude of the location of the monument in degrees
LongitudeThe longitude of the location of the monument in degrees
Location DescriptionAn unstructured description of the location of the monument
Year DedicatedThe year that the monument was dedicated
Year ConstructedThe year that the monument was constructed
Year Dedicated Or ConstructedThe year that the monument was dedicated or constructed; whichever is available first
Object TypesList of the physical type(s) of the object (see Object Types)
Use TypesList of what this object is used for (e.g. military, religious purposes)
SubjectsList of open-ended categories or topics that this monument falls under (see Subjects)
HonoreesList of who or what the monument is honoring (see Honorees)
CreatorsList of individuals involved in the construction/creation of this monument (architects, sculptors, foundries, etc)
SponsorsList of entities that sponsored this monument
DimensionsThe physical dimensions of the monument (in a non-standard format)
MaterialA description of material(s) that make up this monument
Year RemovedThe year that the monument was removed
WikipediaWikipedia identifier

A full list of generated fields (not provided by the data source, but created by the Audit to enrich the provided data) are:

Field nameDescription
IdThe unique identifier of this monument (it is a mix of the Vendor Entry Id and the Source Id)
Duplicate OfThe identifier of this record's "parent" record if it is a duplicate record
DuplicatesA list of identifiers of child records if this record is a merged records from duplicate records
Object GroupsThe object group that this object belongs to (e.g. Marker, Monument, Building.) Merged records sometimes have multiple object groups since their child records may disagree. For merged records, if any one of its child records are a monument, it is considered a monument.
Object Group ReasonA list of reasons why an object was assigned a particular object group
Monument TypesThe physical type(s) of monument (e.g. obelisk, bust, pyramid) this if it is in the Monument Object Group
Entities PeopleA list of people that we believe this monument is honoring. See Honorees section
Ethnicity RepresentedThe ethnic group of the person being honored. See Honorees section
Gender RepresentedThe gender of the person being honored. See Honorees section
ThemesThe thematic category this monument falls into
Geo TypeThe type of latitude/longitude this record has. See Geospatial metadata section
County GeoidThe identifier of this record's county where the monument is located

Accessing the data

There are two main ways to access the data:

  1. Through the public online interface accessible through a desktop browser connected to the internet. This will be the way most people access the data since it provides a host of features that make it easy to search, browse, and filter through the study set. You can use the interface at this URL: monumentlab.github.io/national-monument-audit/app/map.html
  2. We also make the data available through data-downloads in .csv format which can be read by most spreadsheet software such as Excel or Google Sheets. This is for users who are comfortable working directly with large datasets or those who want to make their own interfaces or visualizations. You can download these files directly via the links below:
    1. Complete study set (monuments only) in .csv format (59MB)
    2. Complete pre-study set (includes non-monuments) in .csv format (415MB)
    Note that if a cell is a list, it will be delimited by a | (pipe) character. If a cell has no value, it will simply be an empty cell.

The search interface

The Study Set Data Interface allows users to browse records by location, to use filters to limit results, and to do keyword searches across the records. Here are the fields being searched with a keyword search (the fields are weighted so the "weight" or "boost" is in parentheses):

It is important to note that the results of a keyword search may not exactly match the results of a filter for the same term. For example, a monument that includes a quote attributed to Helen Keller will be shown in a search for “Helen Keller” but may not be included by selecting “Helen Keller” through the People filter because she did.

Note on demographics

Introduction

Recognizing that the demographics of the monument landscape are of special interest, and that providing them comprehensively for the 48,178 records in the National Monument Audit dataset is not possible, we would like to offer some insights and trends based on the data sources that track gender, race, and ethnicity amongst the forty-two data sources from which our dataset draws.

We can, for example, state that of those 48,178 records, 4,802 have a result for gender, of which 4,528 are male and 274 are female. This can be done because entity recognition for this selection was approached by cross referencing names stated across multiple categories in each of the 42 data sources, with Wikidata category “human.” Where Wikidata had gender, race, ethnicity and place of birth, this data as provided was incorporated into the record for that individual for the purposes of analysis. However, with such a low percentage of records, less than ten percent, for which our algorithm and matching approach gives us a gender assignment, this ratio tells us very little on its own.

By breaking out the demographic information available in those data sources which actually track gender, race, and ethnicity, we can offer an additional check on the validity of the ratio, as well as highlight some of the nuances at play within and across those data sources. In addition, the Monument Lab National Monument Audit Research Team can provide demographic data with a higher level or reliability for the fifty people for whom there are the most public monuments. This can be done because in addition to the cross-referencing with Wikidata, the public monument records for the top fifty were systematically fact-checked to remove misidentifications, minor mentions, and redundant records across and within data sources, which provided a measure of strict investigation of the individuals named that could not be executed for the dataset as a whole.

From looking at each of these data sources individually and in comparison to one another as well as our top fifty names, we can offer insights into the demographic composition of the American monument landscape.

Gender

Of the forty-two data sources compiled by Monument Lab for the National Monument Audit, only four track gender as an attribute. Each of these does so in a different fashion, with different roles in relationship to the monument (i.e. sculptor, sponsor, subject) and with inconsistent application of gender categories to records within the data source as a whole.

The four data sources are:

  1. Georgia Historical Markers tracks gender through 'marker subject': 9 of 23 records with results appearing under ‘gender’ -- 8 male, 1 female
  2. Smithsonian SOS tracks gender through a 'figure female' tag: 2,780 of 24,191 records with results appearing under ‘gender’ -- 2,617 male, 163 female
  3. Pioneer Monuments tracks gender through a column called 'Design,' which includes options of 'man' 'woman' 'man and woman' 'family' 'group' and 'other': 14 of 141 records with results appearing under ‘gender’ -- 12 male, 2 female
  4. HMdb tracks gender through 'Topics and series,' which sometimes includes 'women,' but is not consistent: 1,768 of 15,263 records with results appearing under ‘gender’ -- 1,674 male, 94 female

In each of the data sources for which we have tracking of gender data, the category female is wildly underrepresented. It is worth noting that the much larger SoS and HMdb datasources show a greater disparity in gender representation and represent a significant portion of the record keeping on the American monument landscape between them. These proportions are in keeping with our findings in the fifty most represented people in our study set, of which only three are women. These are: St. Joan of Arc no. 18, Harriet Tubman no. 24, and Sacagawea no. 28. Two of the three are women of color.

Race and Ethnicity

Of the forty-two data sources compiled by Monument Lab for the National Monument Audit, only six track race and or ethnicity as an attribute. Each of these does so in a different fashion, with different categories, to different roles in relationship to the monument (i.e. sculptor, sponsor, subject) and with inconsistent application of racial or ethnic categories to records within the data source as a whole.

The six data sources are:

  1. Georgia Historical Markers doesn’t have race or ethnicity of subject, but does have a filter 'marker subject' which includes African American History, Native American History, among other historical topics, events, and occupations. 0 of 23 records with results appearing under ‘ethnicity.’
  2. Smithsonian SOS has a “culture” field that provides some demographic categorization and often additional information in the “topic” field that has categorizations for “ethnic” and other information that points to subject demographic information hints. Has a subject tag for ethnic information. 884 of 24,191 records with results appearing under ‘ethnicity’: English Americans (202), African Americans (119), Italians (95), Americans (75), Germans (42), English people (39), Irish Americans (35) Scotch-Irish Americans (26) Criollo people (21), Jewish People (20), Litvin (14), Copts (13), Shoshone People (13), Britons (12), Poles (12) Shakya (12), Swedish American (11), Scottish American (10), German Americans (9), Greeks (9), White Americans (9), French (8), Mohawk people (8) Norsemen (8), Indigenous People of the United States (8), Berbers (7), Indigenous peoples of America (7), Scottish People (5), Cherokee (4), Gujarati People (4), American Jews (3), Armenian American (3), Irish People (3), Ukranians (3), White people (3), Arabs (2), Norwegians (2), Oglala Lakota (2), African Jamaican (1), Ashkenazi Jews (1), German Texan (1), Kiowa people (1), Thembu tribe (1), Yanktonai (1).*
    *SoS data in particular is difficult to disaggregate from the Wikidata sourcing on race and ethnicity because it is a large set, and because both sources offer a greater range of variability in potential response.
  3. HMdb has ‘topic and series’ filters for different race/ethnicities. 307 of 15,263 records with results appearing under ‘ethnicity’: African Americans (69) English American (69) Americans (26) Scotch-Irish Americans (25) English people (12) Irish Americans (12) Litvin (10) Shoshone people (8) German Americans (7) Britons (6).
  4. Jefferson County KY Historic Markers contains a ‘SUBJECT1' tag which includes 'African Americans' 'Indians', 'Germans', etc. ‘SUBJECT1' is also where occupation is logged.0 of 1 record with results appearing under ‘ethnicity.’
  5. NPS National Register tracks race/ethnicity through “Cultural Affiliation” section on paper form. Not all forms are consistent in set, some lack this section. 17 of 729 records with results appearing under ‘ethnicity’: American (3) English American (3) Italians (2) indigenous peoples of the United States (2) English people (1) French (1) Germans (1) Litvin (1) Scotch-Irish Americans (1) Welsh People (1) indigenous peoples of America (1).
  6. Pennsylvania Historical Markers has the following options in 'Category' dropdown: Native American, African American, Ethnic & Immigration, women. 3 of 112 records with results appearing under ‘ethnicity’: African-American (1) Americans (1) Litvin (1).

In each of the data sources for which we have tracking of race/ethnicity data, there are few records which actually track race or ethnicity. Where we do have records, the majority fall under categories that we would most likely term “white” colloquially, but which are noted with a specificity that gestures to a moment in which that category had not yet consolidated or expanded to include groups now within it. By extension, the notation of race or ethnicity seems to function as a means of marking an “other,” making it likely that monuments featuring individuals who are non-white are tagged as such with higher frequency. The low number of non-white records is in keeping with what we find in our fifty most represented people in public monuments. Of the fifty most represented people in our study set only five are people of color. Three are African-American/Black; two are Native American/Indigenous. They are: Martin Luther King Jr. at no. 4, Harriet Tubman no. 24, Tecumseh no. 25, Sacagawea no. 28, Frederick Douglass at no. 29.

There are no United States born Latinx people and no Asian Americans or Pacific Islanders in the fifty most represented people in our study set.

Conclusion

Relative to the monument landscape and to their proportions within modern or historical populations, women and non-white individuals are underrepresented. While we believe that the data presented here offers compelling evidence for that conclusion, despite its inconsistencies, we do not believe that any data we might present would be more compelling than the evidence presented through a brief stroll around the civic centers of most American towns and cities.

As is clear from the relatively small number of sources that track gender, race, and ethnicity, this is an undertracked aspect of the record keeping associated with the American monument landscape. This should be addressed systematically as part of an effort to improve monumental record keeping more broadly. It is, however, worth noting that our research, as well as that of our friends who created the New York City Public Art Inventory dataset, indicates that retroactively applying gender, race, ethnicity, sexual orientation or other categories of identity we currently deem important can come with additional challenges.

  1. As we found with Wikidata, the retroactive assessment of an individual’s identities can apply twenty-first century categories to historical figures inappropriately, for example providing national identities that predate the creation of a nation state.
  2. The external assessment of an individual’s identities can also carry the biases of the standpoint of the assessor, for example USian erasure of the casta system and it’s identity forms or the inconsistent assignment of two-spirit people.
  3. Identities are not a. consistent, or b. consistently public. A major concern of both the Monument Lab research team and the NYC Public Design Commission was in the risk that presenting individuals with queer identities dating only from the period in which those identities could be openly claimed by some within USian society risks not only erasing millenia of others who shared those sexual orientations, but provides ammunition for those who would claim that that legacy of presence does not exist.

The inconsistency within data sources indicates a few potential factors:

  1. That data collection over a period of time, or under the guidance of a variety of coordinators, may collect different demographic information or differing amounts of demographic data based on interest, changing norms, or shifting legal or political priorities.
  2. That data collection which includes or relies upon the written text of the monument, marker, or other commemorative language associated may apply ethnic or racial categories no longer extant (i.e. Criollo people), whose state/political meaning has shifted (i.e. Litvin), which incorporate term which are now considered ethnic slurs, or which have broadly been collapsed under the racial category ‘white’ (i.e. German-American, Scotch-Irish American).

The technical process

Please visit this Github code repository for technical details of the data process and interface