Art Tracks: Visualizing the stories and lifespan of an artwork
AbstractThe Carnegie Museum of Art is attempting to structure provenance data so curators, scholars, and software developers can create visualizations that answer questions that would be difficult or impossible to answer without computer assistance. Provenance, the written description of the history of ownership and custody of art, is typically written as a list of the periods, places, and owners of an artwork. It captures the current best understanding of this history in a succinct and precise manner, and illustrates the gaps and uncertainties that still remain. Provenance is typically written as semi-structured text, following an institution-defined format. It would be useful to have a structured, computer-readable format for this data, allowing for search, visualization, and aggregated research. The American Alliance of Museums suggested standard, widely used across museums, is not defined with enough specificity to allow automated extraction of the structured data contained within provenance texts. Also, the provenance record model in collection management systems (CMS) is often not designed for structured data or does not provide a way to verify that the provenance text matches the structured data. A comprehensive text-based provenance standard, paired with a software library that can parse records written using this standard and convert them into structured data, would allow existing workflows to remain in place while allowing structured data to be automatically extracted from provenance records. The records could continue to be stored within existing CMS databases but contain machine-readable data for use in research and visualization. Outside of data itself, the stories these objects hold are often moving and sometimes astonishing. This ability to ask impossible questions and receive answers previously inaccessible across a museum’s collection and (eventually) across many museums’ collections is a resource art historians and scholars will find extremely valuable.
Keywords: Collections, Linked Open Data, Data Visualization, Provenance,
The understanding and research of provenance has always been an essential part of the museum’s role. Provenance is one of the most comprehensive ways in which museums determine the authenticity of the works within their collection, and as such provenance has often been recorded as part of sales and auctions of artwork. More recently, the American Alliance of Museums (AAM) has included guidelines for provenance research in their Code of Ethics (http://aam-us.org/resources/ethics-standards-and-best-practices/code-of-ethics) and their standards for collections stewardship (http://aam-us.org/resources/ethics-standards-and-best-practices/collections-stewardship) to encourage museums to identify and publish their artworks that may have been looted during the Nazi era.
For years, the Carnegie Museum of Art (CMOA) has been undertaking provenance research as part of its commitment to this standard, as well as conducting research in response to specific exhibitions and as part of normal stewardship of the permanent collection.
2. Initial concepts
In early 2014, CMOA began formal work on a project known as Art Tracks to bring its provenance research into the digital world. This is part of an Institute of Museum and Library Services (IMLS)–funded three-year project to develop a technology-based interactive framework employing standard museum catalog data and best practices to allow users to visually chart the life cycles of art objects over time and across distances. The goal was to transform dry, unengaging museum provenance and exhibition records into lively historical narratives about art, museums, and history, and thus enhance visitors’ experiences of art both in the museum and on the Web.
Additionally, our goal has been to develop this as a tool not only for CMOA, but for use across the industry. Information about the history of collecting is not institution-specific, and being able to aggregate this data across multiple museums will provide significantly greater value to the research community. When multiple institutions integrate their data and link narratives, a more holistic picture of history told through the lens of art collecting becomes apparent. One can see, for instance, the ways in which regional and global events—from wars and economic depressions to migrations and technological developments—impact artistic production, presentation, and innovation. As such, we have worked to make sure that the technological and content decisions we make are based on existing standards and best practices, and that any tools developed are broadly useful by many museums, not just CMOA.
Over the past year, we have worked to lay the technological foundation for this ambitious goal. Phase One of this project has been focused on quantifying our understanding of provenance and preparing the tools we will need as we begin to understand what it means to use provenance to help tell the stories of artwork. Coming out of Phase One, we have developed four specific projects, which together allow us to demonstrate the potential for structured provenance information for museums.
The first of these projects is a recodification of the AAM standard for writing provenance, to allow automated structuring from provenance texts. The second is a software parser that performs this destructuring, converting semi-structured text into structured data. The third is a user interface that allows museum researchers to quickly read, modify, and verify the automated conversion, and to assist in the conversion of current records to the new standard. The fourth is a prototype gallery interactive using the structured data.
3. Current provenance standards
Traditionally, a provenance record for a specific artwork was a summary of the known information about the transfers and previous owners of the artwork, with references to original documentation. The goal was to communicate to potential buyers the authenticity of this particular artwork, and it was rare outside of the academic world to view multiple provenance records at the same time. As such, there was not a strong need for a standard for writing provenance.
As museum collections grew, it was often useful to have an internal standard for recording this information, to allow the research to be understood by multiple readers. There was not a need for standardization across institutions, or for guidelines to be publicly discussed: the main consumers of the information were internal, and when information was to be communicated to people outside of the institution, it was being shared with those with an understanding of the source material who were aware of both the context and history of the information and could infer information using their expertise.
The Association of Art Museum Directors issued guidance to its members on identifying works with Nazi Era provenance (AAMD, 1998) and followed that in 2007 with a document outlining a series of questions for members to use to evaluate their institutional practices in identifying and restituting these works (AAMD, 2007). In 2000, the AAM “Standards Regarding the Unlawful Appropriation of Objects During the Nazi Era” (http://aam-us.org/resources/ethics-standards-and-best-practices/collections-stewardship/objects-during-the-nazi-era) mandated the publication of records for works that may have been unlawfully appropriated. This mandate included requirements for the publication of the provenance of artworks and for the formalization and publication of the standards by which these records would be communicated. As part of this mandate, the AAM provided an optional, suggested standard for publishing provenance (Yeide, 2001).
CMOA and many other museums have adopted the AAM suggested format for the publication of their provenance data. As part of the initial design work done for the Art Tracks project, an informal survey of other institutions’ published standards was conducted. Of the twenty institutions surveyed, half published their provenance standards on their websites. Of those ten, eight standards were closely based on the AAM suggested standard, one was a variation of the standard, and only one institution defined a standard that was not based at all on the AAM suggestion. Additionally, of the ten institutions that do not publish their standard, eight publish provenance research online, and those eight all appear to be writing their provenance based on the AAM suggested standard.
Through this standardization of provenance text, researchers are able to more quickly understand the information communicated. Standardization also ensures that essential pieces of information about a work’s history are communicated. There are also benefits in workflows and in the education of provenance researchers, through the codification of style for this information.
However, there is room for improvement within this standard. It was designed to allow for the information about a single work of art to be tersely expressed and for research on that work to be done effectively, and for those uses it performs admirably. However, the current practice does not allow for research to be performed over aggregations of work: we can ask if this painting was in England during the 1920s, but we can not ask which paintings were in England during the 1920s. In order to ask that sort of question, a researcher would have to identify a subset of potential works and examine each record to determine if that work qualified. For specific, important queries, this is a workable solution, as shown by the work done by museums in compiling their lists of problematic artwork. It is not a solution that scales easily, however, and limits such queries to those deemed of sufficient importance to merit dedicated work.
To allow these questions to be answered quickly, an automated solution is required. Due to the publication requirements in the AAM mandate, most institutions have moved to digital storage for their provenance records. Often this involves the paragraph that describes the provenance of the work being stored within an institution’s collection management software (CMS). Once the text is digital, it is possible to perform full-text searches on the records. Full-text search allows for an initial level of automation: it becomes trivial to collect all records that reference a specific name or location. While this is a dramatic improvement, it is not sufficient to allow querying of the data at the level we’d like; full text search does not capture implicit dates or take into consideration the semantic meanings of text: a search for “paris” will return information for both “Paris, France” and “Paris Hilton,” and searching for “the 1920s” will not return records that refer to October 1, 1921.
4. A new standard for provenance
In order to allow for research employing both semantic faceting and calculated date ranges, we need a structured representation of provenance data. This structured representation should contain specific fields for each unique fact that is contained within a provenance record and could also be useful as part of a structured query. As with all data modeling problems, there is the need to balance sufficient specificity for accurate scoping with sufficient simplicity to allow both reasoning and data entry. Additionally, it’s important to restrict the scope of the structure to the problem at hand. At its core, the model of provenance needs to represent the periods, parties, and locations of ownership for works of art, and the transfers between these owners.
The other issue with structured data is that it needs to be digitized and stored. Some CMS implementations allow for the recording of structured provenance data within the database, while others only provide a place to store text, and none of the existing implementations currently provide granularity at the level expressed within the AAM standard. The needs to store this structured data and to maintain the existing paragraph of provenance data presents a structural problem: how do you maintain information in two formats without allowing those two formats to diverge?
Given the underlying structure of the provenance information, the Art Tracks project considered several possibilities for a canonical reference for this information. We developed the following goals:
- Provenance must be represented as unambiguous structured data.
- There should only be a single source of truth for provenance data.
- The provenance information should follow industry standards when available.
- Provenance should balance simplicity, capturing nuance, and unambiguity.
- Information should not be discarded.
- The system should be designed for usability.
- The system should be as simple as possible.
- The system should be technologically feasible.
These design goals were contrasted with the following observations:
- Not all information can be unambiguous.
- Not all information can be structured.
- The current industry standard for provenance records is a paragraph of standardized text.
The duplication of information between a the traditional, text-based provenance record and any comprehensive structured data model presents problems, both for usability and for our goal of a single source of truth for provenance.
One option was to ignore the requirement for a single source of truth and have both structured and unstructured representations of provenance data. While technically this is the simplest option, it is easy for the two sources of data to diverge, and there would be no automated method to detect such divergences. This would require the enforcement of consistency to be done via workflows and policy. Constraining data via policy tends to be ineffective without a detection mechanism, and once inconsistencies exist, experience has shown that one of the two systems will become untrusted and unused.
Given that maintaining two versions of the data is non-optimal, the second option is to convert all provenance to structured data. In such a model, the history of a work would be stored structured data, and the provenance text could be constructed using automated textual generation techniques. This would allow the structured data to be the single source of truth. This is a technically appealing solution, since it allows the use of existing technological tools, such as relational databases, and simple data modeling constraints.
The major issue with treating structured data as the primary source revolves around usability and feasibility. Given that most museums do not have structured provenance data, in order for the structured data to be the single source of truth, all existing records would have to be converted immediately, or a transition period would have to be enforced during which either no new information would be added or all information would be added to both places. Extensive training would have to be provided on how to read and write structured provenance information, since the existing workflows for this data are designed around editing and maintaining a paragraph of text. Another constraint is the flexibility of the existing CMS. Either the CMS has to be extended to handle the new data, or there has to be a new tool added and maintained to store and handle the data. Our feeling is that these issues would present an insurmountable barrier to adoption of the structured-data-only solution to provenance.
Given these issues, the Art Tracks team decided that the primary representation of the structured data should remain the paragraph of provenance text. This has the benefit of working with existing tooling and for allowing a gradual conversion of records without problems. This decision is not without its own challenges, the largest being that text is not traditionally an excellent source of unambiguity. Neither the structure of the English language nor the AAM guidelines is sufficiently constrained to guarantee automated parsability of a provenance text.
5. Designing the standard
Our first attempts to extract the semantic content within the provenance text were done using existing Natural Language Processing (NLP) software, specifically the Stanford Natural Language Toolkit (Manning, 2014). This software allows for entity extraction and the automated tagging of text. However, our experiments show they are not currently designed to work well with cultural heritage data: they require training on data sets, and most of the training to date has been done using vernacular data sets like the New York Times. The extremely terse syntax of provenance records, as well as the historic and somewhat unusual names that appear throughout provenance, means that the existing training corpuses are insufficient to accurately tag and parse provenance records. There is room for additional research in this area, and once a sufficient volume of tagged data exists it should be possible for researchers to create a cultural heritage corpus for use in training and automated extraction. At the current time, however, automated NLP solutions were deemed inadequate for the project.
Instead, the Art Tracks team created a strict superset of the AAM provenance standard, designed to resolve ambiguities and provide structure and machine-readability. Paragraphs written in this format conform to the AAM suggested standard and, as such, are adequate for all current tasks using provenance data, but they have the additional benefit of allowing a software tool to automatically extract structured data from them without human intervention.
In order to verify that this new standard was adequate to represent the structured data contained within provenance, we began by constructing a data model. Our reference for modeling was the CIDOC-CRM (http://www.cidoc-crm.org). This linked open data (LOD) semantic representation of cultural heritage data has explicit modeling tools for representing much of the information contained within traditional provenance records. Unfortunately, the model does not always map well to the current divisions of information within the museum. CIDOC-CRM represents periods of ownership, location, and custody as three related but discrete concepts, with events modifying one or more of these variables.
Traditionally, provenance is concerned primarily with ownership: it tracks location as a function of ownership changes, but it does not record changes of location during a single period of ownership. Provenance may also record sales and auctions, and these could be seen as custody changes, not ownership changes, but the distinction is often unclear in the records as they are written, or even in the historical record. A strict modeling of the information contained within provenance using the CIDOC-CRM would involve three interrelated timelines and would require a fundamental change in the museum’s understanding of history. Communicating this new model would be a significant challenge, both as a user interface and as a conceptual model, and instigating such a fundamental change to the museum’s understanding of their data seemed unrealistic for this project.
Instead, we have chosen to maintain the representation of data given in the AAM suggested standard, while maintaining as much compatibility with the CIDOC-CRM model as possible. We have avoided representations of data that would preclude a modeling of the data using the CIDOC-CRM.
6. Defining the standard
This model of provenance is best understood as a single timeline capturing the history of ownership of a single work. This timeline is represented as a linked series of periods, one for each period of ownership of the artwork. Each period comes before the following period in time, and they are listed chronologically, starting with the artist and ending with the current owner. These periods are represented in the current provenance text as sentence clauses, separated by either periods or semicolons.
Each period in our model is a representation of a specific interval of time representing the period of ownership or custody of a work. Due to the inherent uncertainties of historical data, often periods are missing data, or the data contained within the record is uncertain or imprecise. Each period may contain the following information:
- An acquisition method, which is the form or method of transfer between two parties. Examples include purchases, gifts, and inheritances. Additionally, it’s important to record if it is known if the transfer was a direct transfer of ownership between the current party and the following party.
- A location, which is the physical location of the artwork at the time of acquisition within this period.
- A party, which is the individual, gallery, dealer, company, museum, estate, or other entity that had custody or ownership of the artwork during this period. It is important to also make a distinction between dealers, auctioneers, and private owners, and it is also useful to have birth and death dates of individuals to allow for disambiguation of individuals with similar or identical names.
- The timespan of the period. Periods are made up of two events: the acquisition of the work and the deacquisition of the work. It is also important to realize that each deacquisition may also be the following party’s acquisition.
- Additional unstructured data that clarifies the information or provides a bibliographic reference is captured as footnotes.
Each of these concepts presents specific modeling challenges.
The simplest concept to model is acquisition method. Within the AAM suggested standard, the only mention of acquisition methods is the need to indicate direct or indirect transfers: whether a work passed directly between two owners or if they are merely known to precede the next owner in history. In our survey of existing records, however, the means of transfer is consistently recorded when known, and as such appears to be important and relevant.
We have chosen to model the methods of acquisition as a controlled vocabulary, which we intend to publish as part of the documentation for this project. We have not yet modeled an ontological hierarchy of these terms, but it appears that such a hierarchy exists and could be modeled with sufficient domain knowledge. Additionally, this hierarchy could be used in constructing the CIDOC-CRM modeling; some of these methods indicate ownership and custody changes, some merely of custody.
We have kept separate the concepts of direct/indirect transfer and acquisition method. This is for ease of modeling; the acquisition method relates to a specific period of ownership, whereas the transfer relates to the relationship between adjacent periods. This is done to minimize difficulties when we know the method by which a work was deacquired by a party, but not the party to whom it was transferred. Currently, our representation of this would be “John Doe, 1850; sold to unknown party, 1880.” This was seen as less conceptually problematic—better to have a placeholder entity than it would be to model both acquisition and deacquisition methods per party. Doing so prevents duplicated information, since the acquisition by one party is the deacquisition by another, as in a theoretical record such as “John Doe, 1850, sold 1880; purchased by Jane Doe, 1880.” The transfer information now appears in two places that could be updated independently, creating uncertainty. A record such as “John Doe, 1850, sold 1880; gift to Jane Doe, 1880.” is logically inconsistent, but would be easy to create.
As such, we follow the AAM standard of representing direct transfers with a semicolon, and indirect transfers with a period.
Recording locations presents a series of constraints. One has to do with spatial precision: locations are known to a varying degree of specificity. This ranges from the continent (“John Doe, Europe, 1500;”) to the street (“Jane Doe, 15 Main St., Boston, MA”). Also implicit are the inherent hierarchies of location; 15 Main St. is contained within Boston, which is contained with Massachusetts, which is contained within the United States, which is contained within North America, which is on Earth. Time matters here, too: Saint Petersburg is contained within Russia, but at previous dates was part of the Soviet Union, and has also known been known as Leningrad and Petrograd. Often these hierarchies are not explicitly statable; as of 2015, Feodosia is either part of Russia or the Ukraine, but specifying which requires choosing between the sovereign claims of two nations, which is out of scope for this project.
Currently, our practice is to specify the city and country in which the party was located at the time of acquisition, as it would have been known to that party. If the building is both named and notable or historic, we record the building (e.g., “Queen Elizabeth II, Buckingham Palace, London, England”). If the state or province is required for disambiguation, record it (e.g., “Pittsburgh, PA”). If the location has changed names over time, use the name of the location at the time of the acquisition: Leningrad, not St. Petersburg, if the acquisition took place between 1924 and 1991.
This standardization allows for improved searching, but neither explicitly captures the hierarchical information contained within the location nor easily allows geolocation. Our anticipated next step is to use linked data to unambiguously represent the location using an external authority such as Geonames (http://www.geonames.org/) or the Getty TGN (http://www.getty.edu/research/tools/vocabularies/tgn/), which would allow both location hierarchies and latitude/longitude information to be inferred without having to be explicitly stated. It would also allow for automated replacement of identifying strings for differing locations: while CMOA may leave off “United States” from “Pittsburgh, PA, United States”—considering it implied in the record, for an international audience it might be useful to state it explicitly. Additionally, it would allow locations to be translated into other languages without human intervention, provided that the authority has translations of place names.
Parties are simpler than locations, though only because we have chosen to ignore the complexities of human relationships. The hierarchical representations of people are complex enough that it would be foolish to attempt to model them as part of provenance. We do, however, have to address the problem of disambiguation of names. Often there are multiple people with the same name—this tends to occur within noble families, but can occur anywhere. The most common disambiguators are the years of birth and death, which we model explicitly within provenance. An additional benefit of including birth and death years is that it allows us to put outer bounds on the period of ownership. Since we assume that people cannot own art before birth or after death, this often highlights inconsistencies within our data.
An additional, untapped source of data are relationship clauses or titles. Often provenance records are written with “Matilda Wormwood, daughter of previous” or “James Henry Trotter, the sitter”; these secondary clauses allow for a deeper understanding of the relationships between people, but are not currently converted into a structured system. Additionally, they may present issues when performing automated disambiguation, even though they can be extremely helpful in determining which specific person a name is referring to. We are using a white list of suffixes and secondary clauses to determine when a particular clause is part of a name. This has been reasonably effective, but additional work should be done to determine if this will present barriers to entry in the future.
Additionally, there are complications that extend from the concept of joint ownership in marriage. For works owned before the 1950s, the relationship between marriage and ownership is reasonably straightforward: unless explicitly owned by a woman, the man was assumed to own everything. This is represented in provenance texts reasonably well; occasionally, one will see a record mentioning ownership via descent through a daughter, but most records exclude women entirely. This changes as laws involving co-ownership of property begin to appear—more regularly, you see records referring to “Mr. and Mrs.” This presents a semantic difficulty for us: we are used to thinking of the owner as being either a person or a legal entity. It is least problematic to consider marriages as a legal entity, formed at the moment of marriage and dissolved via divorce, annulment, or death. However, this means that “Mr. Marshall Field III” and “Mr. & Mrs. Marshall Field III” have no obvious connection, and a search for one might not bring up the other. Within our system, this remains an unsolved problem. Our hope is that there will be a way to leverage the work of others to bring clarity to this problem.
There has been significant internal debate at CMOA over the importance of signifying dealers and auctioneers as a different class of entity than private owners. Our understanding of this requirement is based on tradition, and at this point we are unsure what semantic importance it conveys—or even if it is possible to effectively flag these situations. Often a dealer or auctioneer “buys in” a work of art, at which point they are both a dealer and a private owner. The flag is then not based on their intrinsic characteristics, but their role in the specific acquisition. We have included this flag in an attempt to maintain a parity with the AAM standard, but would welcome a discussion on the semantic meaning encoded within.
Much like with location, the inclusion of LOD will allow for a much richer experience of provenance. Being able to use external sources to maintain the hierarchical relationships between people, as well as their dates of birth and death, will allow for a more nuanced research environment. This remains a high priority for future research.
Periods of ownership
The period of ownership is the most conceptually dense concept within provenance. It involves dates and their relationship to the period to which we are referring. The other entities within a provenance record are either discrete or hierarchical, but dates are implicitly calculable. As such, we need to represent them both as mathematical constructs and as human-readable information. Our understanding of current best practice for cultural heritage involves modeling periods using the CIDOC-CRM, so we have used that to allow interoperability with other projects.
We assume that the work is always owned, even if we don’t know by whom. We also assume that that it is owned by a single party; if it is co-owned, the entity is the co-owners. This occurs regularly when considering marriage, for instance. We also assume that the transfers of ownership are discrete: when one party ceases ownership, another assumes it.
Additionally, when representing our knowledge of a date, it is useful to capture both the precision to which we know a date and our certainty about that date. Sometimes we know that a transfer happened on January 15, 1995. Sometimes we know that a transfer happened in the sixteenth century. Sometimes we merely believe that a transfer happened in 1995, but are unwilling to specify it with certainty.
Further, there are often gaps in our knowledge of dates. We may know that a work was purchased after 1995, or before January 2000, or just that they had it in 1880. Representing this information unambiguously is the most important function of provenance.
There are two conceptual ways to represent the the dates relevant to a period. One is as a series of ownerships, the other as a series of transfers. For example, “George, 1950; Mary 1965.” describes two periods of ownership: George’s ownership between 1950 and 1965, and Mary’s ownership from 1965 onward. It also encodes two transfers: George’s acquisition from an unknown party (likely the artist) in 1950, and Mary’s acquisition from George in 1965.
Both ways of thinking about the period are correct and contain the same information. It is also important to realize that in both of these models there is still uncertainty. When we state that George acquired the painting in 1950, what that means is that we know that he acquired the painting between January 1, 1950, and December 31, 1950. Similarly, from Mary’s record, we know he deacquired the painting between January 1, 1965, and December 31, 1965. Phrased as periods of ownership, we know that George definitely owned the work between December 31, 1950, and January 1, 1965, and that he possibly owned the work between January 1, 1950, and December 31, 1965.
This distinction is important when discussing periods where uncertainty is caused by date imprecision—it’s even more important where this imprecision is caused by gaps in knowledge.
Consider, “George, by 1960 until sometime after 1962; Mary, in 1970.” This record could refer to the same events as the previous representation; but it captures a fuzziness in our knowledge with accuracy. We know that George acquired the painting after 1960, we know that he deacquired the painting between 1962 and 1970, and we don’t have an earliest date he could have owned it—worst case, it is the earliest possible date of creation of the work. This also means that George definitely owned the work between December 31, 1960, and January 1, 1962, and that he possibly owned the work between its creation and December 31, 1970.
It is also possible to use the additional data encoded in a series of periods to infer data that is not explicit within a particular period. For example, given “George, 1960; Thomas; Martha, 1980.”, we can infer that Thomas owned the work sometime between 1960 and 1980, that George possibly owned the work until 1980, and that Martha could not have owned it before 1960. This implicit information is extremely useful in date-range search.
Knowing that these two models represent the same information, but in two different forms, is essential, because the CIDOC-CRM represents time spans of ownership using the “definitely/possibly” conceptual model, but the AAM suggested standard uses the “acquisition/deacquisition” model. Treating them as identical allows for automated conversions between these models.
The level of complexity involved in accurately describing a specific time period has the potential to make data entry onerous. There are four dates involved, each with a level of specificity and certainty, and the conceptual model for understanding their relationships is not intuitive. As such, we have put significant effort into building a software tool to allow for automatic extraction of this model from human language constructs.
It’s important to make a distinction between quantifiable structured data and relevant data. General information that should be either faceted or calculatable needs to be represented as a specific field within our model, but information should not be discarded just because it is not general. Often some of the most interesting stories are not contained within the quantifiable data; that the work was commissioned in response to a specific request, that the first owner was also the subject of the work, or that the work had been misattributed in a specific sale would be essential to the historian, but this information is not general enough to be structurable. Rather than ignore this information, we use footnotes to store information that is outside the scope of the model. This retains the information, links it to specific periods, and allows it to be found using a full-text search strategy.
Another component that is important to provenance is the certainty of the information. The use of hedging words like “possibly,” “probably,” or “likely,” or even a question mark is used to indicate that the information is not trusted by the researcher. There is a gradient of uncertainty communicated via these terms; unfortunately, there does not appear to be any cross-institution consistency in what is communicated through this gradient. As such, we treat all indications of uncertainty as being equivalent; they function as a binary flag, not a gradient. That being said, we allow uncertainty to have significant granularity. An entire ownership period may be uncertain, as may any specific date, party, or location.
We have developed an initial reference implementation of a parser for this provenance standard, designed to take unstructured provenance text and convert it into structured data. This open-source software toolkit, published on the CMOA GitHub account (https://github.com/cmoa/museum_provenance), comes with an extensive test suite, designed to prevent design regressions from the standard. Over the past six months, we have used not only our own data, but data from outside institutions that follow the AAM standard to verify that the tool is capable of parsing a wide variety of provenance records and to automatically convert them.
The software is capable of converting a paragraph of semi-structured text, written using this strict standard, into a JSON structure containing the same information. Additionally, it can take the produced JSON structure and convert it back into semi-structured text. Being able to convert both ways allows us to compare the text that we provided with the text that the tool generates; if the two texts are identical, we assume that the text parses accurately. Note that this does not mean that the structured data is guaranteed to be semantically correct—only that there has been no information lost in the conversion to and from structured data.
There are two benefits to this: first, it provides a list of records that need to be manually reviewed, due to inconsistencies in parsing; and second, it means that reversions or changes that might break the structured parsing can be automatically flagged. This helps prevent some of the brittleness inherent in the parsing.
Future enhancements of this tool involve entity recognition and linking of the data to authority files. Currently, we have no way to know if what the tool has parsed as a party or location has the correct semantic connotation or if it has misinterpreted the data. Additionally, there is no way to disambiguate parties or locations that appear in multiple records, or to connect them to each other or to external authorities. As part of the future phases of Art Tracks, we hope to improve this tool to take into account both our authority records and those of other institutions to perform this disambiguation, which will improve our ability to recognize errors and increase the value of the structured data created.
This tool has been designed to not be dependent on any institution-specific requirements; any institution that can generate a provenance text can create structured data with it. It is not designed for data persistence: the intention is to use it against the CMS record and produce either a real-time or temporarily cached representation of the data. Running the tool against the approximately thirty thousand records in the CMOA collection database takes approximately five minutes on a 1.7 GHz Intel Core i5, which is sufficiently fast to allow regular batch processing. Additional performance could almost certainly be obtained—no work has been put into optimizing the tool for speed at this time.
As an automated tool, this is useful within the context of a development process or research project. However, if there is a problem with parsing a record, there is no easy way to correct the problem. To address this, we have developed an experimental user interface (UI) for editing and viewing structured provenance data. This tool, built on top of the parser, is designed to allow a non-technical researcher or museum staff to evaluate the results of the automated conversion process, make changes, and update the record.
Additionally, the UI has been designed to allow for visualizations of the provenance information as a timeline. This has proven essential for verification of data: inconsistencies or contradictions in the data that are difficult to understand without close reading are extremely obvious when presented as a visualization.
Why are we doing this project? The origins and history of Carnegie Museum of Art’s varied collections are key to its unique identity as an art museum and a source of pride to its regional audience and stakeholders. Art Tracks will allow museum staff, visitors, and website users to see this history unfolding as it traces the movement of collection objects through time and space, with the museum as the final destination. Museum curators and educators are excited about the potential to communicate complicated stories with precision and ease to a wide public. Thus, every museum involved in this project will benefit at a local level. When provenance information from many museums is aggregated, the ability to answer much broader questions may transform our understanding of art and cultural history.
Even more so, creating a standard for provenance that reflects the best practices of the museum world as well as the new practices of the digital humanities will open up opportunities both within and beyond the museum. Standardized, structured data will also allow easy interchange between institutions, and as linked open data becomes less theoretical and more practical, this project positions provenance as another rich, linked data source.
Since museums and the IMLS are deeply concerned about the public impact of such projects, our hope is to involve user testing and evaluation early and often in our development process. Additional input and use of these tools by museum professionals will also help validate the standard and increase its potential usefulness.
This project is funded by an IMLS “Museums for America” grant to support learning experiences in museums. This project would not have been possible without the entire Art Tracks team at the Carnegie Museum of Art. Specifically, we would like to thank Lulu Lippincott and Costas Karakatsanis for their countless hours of research and invaluable advice. We would also like to thank Jeff Inscho, without whom this project would never have begun. We would also like to thank our colleagues at the Yale Center for British Art and The National Gallery for providing us with advice, assistance, and data from their collection.
American Alliance of Museums (AAM). (1991). Code of Ethics for Museums. Adopted 1991, amended 2000. Consulted January 15, 2015. Available http://aam-us.org/resources/ethics-standards-and-best-practices/code-of-ethics
American Alliance of Museums (AAM). (n.d.). “Collections Stewardship.” Consulted January 15, 2015. Available http://aam-us.org/resources/ethics-standards-and-best-practices/collections-stewardship
American Alliance of Museums (AAM). (n.d.). “Standards Regarding the Unlawful Appropriation of Objects During the Nazi Era.” Consulted January 15, 2015. Available http://aam-us.org/resources/ethics-standards-and-best-practices/collections-stewardship/objects-during-the-nazi-era
Association of Art Museum Directors. (1998). Report of the AAMD Task Force on the Spoliation of Art during the Nazi/World War II Era (1933–1945). Consulted February 16, 2015. Available http://obs-traffic.museum/sites/default/files/ressources/files/AAMD_report_spoliation.pdf
Association of Art Museum Directors. (2007). Art Museums and the Identification and Restitution of Works Stolen by the Nazis. Consulted February 16, 2015. Available https://aamd.org/sites/default/files/document/Nazi-looted%20art_clean_06_2007.pdf
Carnegie Museum of Art. (2015). “cmoa/museum_provenance.” Consulted January 15, 2015. Available https://github.com/cmoa/museum_provenance
CIDOC Documentation Standards Working Group. (2014). “The CIDOC Conceptual Reference Model.” Last updated May 12, 2014. Consulted January 15, 2015. Available http://www.cidoc-crm.org
Manning, Christopher D., Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, & David McClosky. (2014). “The Stanford CoreNLP Natural Language Processing Toolkit.” In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 55–60.
Yeide, Nancy H. (2001). The AAM Guide to Provenance Research. Washington DC: American Association of Museums.
. "Art Tracks: Visualizing the stories and lifespan of an artwork." MW2015: Museums and the Web 2015. Published January 15, 2015. Consulted .