A new DOR opens: How the J. Paul Getty Museum is reimagining digital collection information management
Daniel Sissman, The J. Paul Getty Museum, USA
Abstract
The J. Paul Getty Museum recently undertook a project to research, design, and develop an entirely new open-source collection information infrastructure for managing, authoring, and delivering collection information and related data to its downstream applications. This paper details the evolution of the project from development to launch, some challenges we faced, and significant benefits we gained. Beyond sharing our experiences with the wider cultural heritage community, we would also like to explore opportunities to share the open-source tools we have built, in the hopes that they will be useful to other institutions for their own collection information infrastructure projects.Keywords: Infrastructure Upgrades, Collection Information Management, Digital Object Repository (DOR), Application Programming Interface (API), Authoring/Workflow Tools, Open Source
1. Introduction
Over the last twenty years, the J. Paul Getty Museum, like many other large cultural heritage institutions, has invested tremendous resources into developing ambitious and innovative digital collection information projects: websites, interactive exhibition presentations, multimedia guides, and applications. These cutting-edge presentations broadened access to collection information and educational resources in ways that had not been seen before.
Prior to this project, most of our collection information projects were distinct and independent entities within our larger systems infrastructure. The majority of projects maintained their own data silos and utilized their own methods of obtaining information from other systems and were very efficient and effective at supporting the needs for which they had been designed. Eventually we realized that we needed a system that could support our growing digital ambitions and potentially anticipate many future needs as well.
Over time, some of our legacy projects began to succumb to issues of scalability, maintenance, and reusability. Faced with aging content management applications, redundant (and occasionally contradictory) presentations of collection information, and a pending upgrade of our collection management system—upon which many of our legacy systems were highly dependent—we found ourselves at a crossroads. If we were to prolong the status quo, we would continue to experience the same challenges with our infrastructure, while only increasing the scale and complexity of our maintenance responsibilities.
These circumstances compelled us to reimagine our information infrastructure, our longer-term digital strategy, and how we might develop new methods of providing and managing data for all our downstream applications, regardless of the platform—websites, mobile, publications, desktop, or kiosks—and how we could provide greater flexibility and simplify maintenance for the future. Thus we embarked upon the creation of an entirely new architecture and platform, the Digital Object Repository (DOR). Designed to act as an abstraction and service layer that sits between our numerous collection information databases and our applications, the DOR has enabled us to centralize efforts, reduce the overall complexity and redundancy in our systems, and simplify ongoing maintenance.
The new platform is already providing many advantages over our legacy systems. It has eased and shortened development efforts for new projects, provides transformative functionality, and is enabling the creation of a suite of collaborative authoring workflow tools. These tools are starting to offer staff the functionality needed to create, manage, and control content for all our distribution channels, greatly reducing the amount of manual (and often repetitive) data entry compared with our legacy tools and workflows.
Ultimately, we believe that broader adoption of platforms like the DOR will enable institutions to more easily author, utilize, and share their collection information and interpretive content in a dynamic and integrated fashion.
We are excited for the opportunity to share our work with the wider museum community—including coverage of the challenges we faced and the benefits we gained, and the process undertaken as we developed, built, and refined the DOR—as well as participating in a dialogue to explore the future potential of DOR-driven/compatible environments.
2. Collective history
It became apparent several years ago that the J. Paul Getty Museum’s collection information infrastructure, which had been in place for almost a decade and had served the institution and its visitors so well, was starting to become difficult to maintain or modify for new uses.
Our legacy systems and processes had largely been designed to address specific issues and support specific needs, and they had been built over many years by different teams of staff from different departments or by external contractors. These systems and processes performed many functions, including extracting data from our collections information management system (TMS) for presentation in our curated GettyGuide online collection pages, mobile multimedia tour app, in-gallery kiosks, and for publication use; providing data for our multi-collection Getty Search Gateway, our overall website search, and our collection management system’s website interface, TMS Go; as well as supporting our Imaging Services department’s photography workflow. In each case, specific requirements for collections data, interpretive content, or media had been identified, and processes had been developed and instituted, but the solutions were largely inflexible, so it was difficult to add new features or support new presentations. The most challenging issues related to maintaining our workflows and the increasingly complex dependencies between our systems, processes, databases, media repositories, search engines, and presentation layers (figure 1). These mounting concerns made the redevelopment of our information architecture an ever more compelling, timely, and appropriate endeavor.
Furthermore, a compounding factor was looming on the horizon: the need to upgrade our collections management database, TMS, which would bring with it one of the most significant database schema changes the application had seen in years. As such we were facing a significant undertaking either way: we could choose to maintain our legacy infrastructure, for which we would need to go through the challenging process of updating all the dependencies each legacy process had on TMS, but be little better off for the effort; or we could take the opportunity to redesign and rebuild our infrastructure to better support our current needs and offer the flexibility we anticipated needing in the future.
3. Legacy infrastructures: Problems and opportunities
A complex web of dependencies among systems characterized our legacy infrastructure. Even seemingly minor changes to components within our information architecture could affect everything downstream. Performing upgrades or applying patches was becoming a difficult task, requiring planning and testing beyond the norm. Making major changes to our collection information applications was an even more complicated assignment, and sometimes it became a completely impractical one.
Performing necessary upgrades had to be postponed in some cases until all affected applications could be updated; consequently we were stuck with the software we had in place and the challenges of keeping it operational until we were able to update our entire infrastructure. Staff had to manage with aging or outdated systems, in some cases for a number of years. This would affect curatorial, registrarial, editorial, and other content generation and authoring tasks, which could become more time consuming to complete with the available tools.
To give some sense of the challenges we were dealing with, figure 1 offers a simplified, high-level overview of our legacy collections information systems architecture. It highlights many (but for the sake of legibility, not all) dependencies among the systems, processes, and presentation layers built over the last ten years. As can be seen from the illustration, even in its simplified form, our infrastructure had grown to become a complicated web of interdependencies, where systems and processes were deeply integrated with each other, and where change was becoming increasingly difficult to manage or even consider.
Figure 1: a simplified overview of our legacy collection information systems architecture
Many of the specific issues that emerged from our legacy infrastructure are outlined below. These include issues that had repercussions upon ongoing management and maintenance, knowledge, and training for staff, as well as availability of skills and, to some extent, information security:
3.1 Management and maintenance concerns
- Our authoring and publishing workflows were becoming increasingly labor and time intensive.
- Each use of collections data required direct access to TMS, leading to a proliferation of views, tables, and stored procedures being added to its database.
- Myriad dependencies on TMS meant that updates or changes to TMS could affect many downstream processes, potentially breaking them entirely.
- Almost all our legacy applications had their own ad-hoc data models, data replication/synchronization logic, and implementations of business logic, creating further layers of architectural complexity and functional duplication.
- A largely redundant replication of collections data was present across our applications, providing more opportunities for data replication errors and formatting issues.
- The lack of a consistent or apparent method for determining the last modification date for most TMS records necessitated that all extracted data had to be regularly updated in each application database, imposing additional resource demands upon TMS.
- Most downstream applications had their own unique media requirements. Our kiosks required images in the FlashPix format, for instance. We were therefore generating and maintaining multiple sets of media. These unique needs contributed further complexity to our workflows, sometimes prompting data-storage and data-consistency concerns.
- Our legacy tools had not evolved or been maintained with surrounding technology, making them susceptible to failure when operating systems and other dependencies were patched or upgraded. In some cases, we actually lost the use of workflow tools due to required infrastructure upgrades. The only interim option was to maintain workflows through the use of difficult and time-consuming manual updates.
- Although documentation of varying quality and completeness was available for some of our legacy systems, it often lacked sufficient detail about data dependencies, and we were often left wondering why certain architectural decisions had been made. As such, this information had to be gleaned by carefully studying source code and database schemas of the legacy applications.
- There had been an inconsistent use of version control for code and database schemas, making it difficult to determine what had changed over time or why.
- There were many cases of doing (practically) the same things numerous ways, creating further opportunities for data duplication or inconsistency.
- Our systems could not easily be adapted to support the growing needs on data.
3.2 Knowledge and skills concerns
- Our legacy infrastructure required too many developers with highly specialized knowledge to support and maintain it. When key staff departed the institution, finding replacements was complicated and delayed by the need to find individuals with the required niche skills and knowledge.
- The lack of comprehensive documentation for what had already been built often led to new projects simply starting from scratch, with developers sometimes duplicating existing functionality without realizing, creating additional complexity and maintenance issues.
3.3 Information security concerns
- Most applications needed privileged database access permissions for business critical applications such as TMS.
- Each application’s need for direct access to TMS meant that too many processes, and potentially too many staff, could have unwittingly had access to restricted or confidential information stored within TMS.
- Many staff required access to other internal systems to access or manage data and media assets, presenting complex security management issues. If any of these privileges had been misused accidentally or even intentionally, systems or operational issues may have ensued.
The above areas were ripe for improvement; these issues combined with the increasing challenges we were experiencing with our legacy systems, and the pending upgrade of TMS, led us to the considered decision that our current and future needs would best be served by redeveloping our collection information architecture, replacing it with a more compartmentalized, scalable and flexible platform.
4. Reimagining our infrastructure, reimagining our digital future
As we reflected upon the issues we had faced with our legacy infrastructure, we considered our current needs and our anticipated future needs and began the process of planning how we would build a new infrastructure to support our applications, while building upon the lessons of the past.
Before embarking on our own development effort, we researched alternate options, including the use of various CMS and DAM systems; we also looked into the feasibility of pushing data into various “schemaless” data stores that offered a number of the features required for our new architecture. While we considered a few possibilities, our research and testing indicated that the available options were unable to fully support our needs or offer the extensibility we desired without extensive modification.
As our research continued, we encountered digital object repositories as a mechanism for storing and providing controlled access to data. These repositories were influenced by the work of Kahn and Wilensky (1995), who, in a publication for the Corporation for National Research Initiatives (CNRI), proposed a framework for a distributed digital object service that influenced the development of one of the first DORs. Further work on these systems at CNRI helped inspire the development of other digital object repositories, such as Fedora, and helped us envisage the use of a DOR for our collection information needs.
The concept behind a digital object repository that differentiates it from other data stores, such as a relational database management system (RDBMS), is the idea that a DOR deals with the storage of digital objects rather than data in other formats, such as tabular data as stored in an RDBMS, and that a DOR makes these objects accessible primarily via universally unique identifiers, which can then be specified at retrieval time. In terms of the data itself, a digital object is a container that maintains a set of attributes and associated values; this representation of data is distinct from binary data, such as image files or text documents, or simple values such as text strings, integers, or dates, although a digital object can actually hold attributes that store or reference any of these data types. Digital object repositories are also especially good at allowing objects to be represented at differing levels of abstraction and in representing varied types of relationship (such as parent, child or sibling relationships) between objects within the repository, which are particularly appropriate features for a building a repository to store collection information with its many different types of data entity and the complex relationships that can exist between them.
5. Modernizing a legacy: New technologies, new techniques, new features
After considering our options, we decided upon building a DOR as the central repository for our current and future collection information needs. We wanted to move away from an architecture model where systems had so many direct dependencies upon one another, and toward a model where the dependencies that needed to remain were lightweight and the inner workings of systems became invisible to each other. These desires led to the decision to build our new DOR upon a Web application framework that would allow us to expose a lightweight interface to any downstream application via a standards-based Hypertext Transfer Protocol Representational State Transfer Application Programming Interface, or more succinctly, a HTTP REST API. Such REST APIs are extensively supported in all modern programming languages, and can be used by any system that can “speak” HTTP and has a connection to the network. Furthermore, we decided that the REST API would be the only method that other systems could use to interact with the DOR. This meant that we could define and expose a protocol for the DOR’s REST API that other applications could utilize, knowing that if an application specified an API request in a particular way it was guaranteed to receive a response that complied with the protocol specification. Beyond this, applications wouldn’t need to know how the DOR operated, only that the DOR would and could understand their API requests, and that it would reply with the appropriate response in the desired data format.
Building upon this foundation, our new centralized DOR platform offers a range of advantages over our decentralized legacy infrastructure. The DOR has been designed to serve several distinct but complementary roles within our new information architecture:
- Acts as the singular aggregator of factual collection information and will become the singular source of interpretive content.
- Offers a highly flexible but structured data model that can represent current collection data entities, such as objects, makers, exhibitions, locations, media, and vocabularies, as well as providing the flexibility to support the addition of any other entity types, attributes and model relationships over time.
- Supports complex queries of the underlying data as well as offering faceted and free-text search across all content, and provides real-time delivery of consistent search results to all applications.
- Provides standardized access to structured and consistently formatted collection information and interpretive content.
- Enables a collaborative authoring platform for a variety of interpretive content types, that will grow to support authoring content for exhibitions, tours, essays, wall labels, lesson plans, and related media, and provide unified control over the publication of this data to our presentation platforms.
- Will offer tools to easily author (and where possible, automatically establish) relationships between data entity types, including:
- Linking events to exhibitions, makers, and related objects.
- Linking entities to controlled vocabulary terms to improve data and support more comprehensive search, and to better enable the ongoing contribution of the museum’s data to controlled vocabularies, such as the Getty Vocabularies (AAT, CONA, TGN, and ULAN).
- Linking objects to educational materials, related media and content, bookshop items, and more.
- Supports effective control of access to collection data and the enforcement of business rules and legal considerations, such as image and reproduction rights. This ensures that our applications and users of collection data only have access to the information they need, and that rights issues are accounted for before any data is exposed via the DOR’s APIs.
- Provides access to data through a standardized API in a number of common serialization formats (JSON, XML, YAML) and will soon support other data formats such as LOD (Linked Open Data), as well as several domain-specific schemas and protocols including LIDO (Lightweight Information Describing Objects) and IIIF (International Image Interoperability Framework).
- Will later support an open public API for external access to collection data.
Once we had selected a digital object repository as the key concept and mechanism with which we would store and expose our data, we decided upon an iterative, phased approach for the project’s development. This consisted of three initial areas of work: the first phase would be dedicated to research and planning, software architecture development, rebuilding our existing data-delivery systems, and presentation/search interfaces; the second phase would be focused upon the development of a suite of collaborative authoring tools, as well as the refinement and documentation of the work conducted during the first phase; and the third phase would focus on continuing to improve the quality and depth of our collection information, as well as exploring avenues and uses for collection data and our new platform, both across the Getty’s programs and in concert with the wider cultural-heritage community.
As we redesigned our information architecture (figure 2), we selected several well-supported open-source frameworks and tools (detailed below) as the basis of our DOR; a complementary Multimedia Derivative Manager (MDM) platform, responsible for generating and managing our new unified set of media derivatives; and over time, a comprehensive suite of authoring, workflow, and publishing tools that could be more easily maintained and extended for current and future needs.
Figure 2: overview of our redesigned collection information systems architecture
Compared with our legacy architecture (figure 1), our new platform, as illustrated above, is much simpler in terms of structure and complexity. There are far fewer interdependencies between systems, and it offers much improved and new functionality, ease of maintenance, and greater flexibility.
6. Desired outcomes, development obstacles, and delivered results
The desired outcome from the first phase of the project was the redevelopment of the underlying software architecture that managed the flow of our collection information, interpretive content, and digital media. Our collection information system (TMS) and digital asset management system (OpenText DAM) were to remain unchanged, as they are essential components of our infrastructure. The only changes we anticipated to TMS and our DAM were the planned upgrades to their respective latest versions. The overall development process comprised building our new architecture and transitioning legacy systems to source their data from the new platform, improving the presentation and access of our collection information, and establishing the architecture upon which later phases would be built.
Specific phase-one project tasks included:
- Building a Digital Object Repository (DOR) that would:
- Source factual collection information from TMS, automatically synchronizing changes in data on a nightly basis or as needed
- Source interpretive content from our legacy ART database and other repositories as necessary, until the DOR itself becomes the source for this data
- Reference media master assets from our DAM and instruct the MDM to automatically generate, update or rescind image derivatives for downstream use based on current data
- Provide a consistent, versioned, and authenticated HTTP REST API as the only means of accessing data in the DOR for downstream applications
- Transitioning all legacy dependencies from TMS to the DOR
- Consolidating our online object pages down from four independent legacy versions—Collection Online, mobile.getty.edu, Search Gateway, and Provenance Portal—into a single page, employing a responsive-design template for maximum accessibility
- Begin creating a suite of collaborative online tools for authoring and publishing interpretive collection content
- Writing detailed documentation to ensure that future developers can easily understand how to maintain and build upon the new infrastructure
The main obstacles we encountered during early development (further explained below) included:
- Performance and scalability issues with our initial implementation
- Some last-minute revisions to scope, design, and feature requirements
- Issues utilizing and modifying the old search interface to handle more than one-hundred-thousand additional objects
6.1 Performance and scalability
In accordance with the project’s architectural requirements (see section 5 above), the original development prototype for the DOR had been built using a website application framework, and in particular the Django framework—a popular framework built using the (Monty) Python programming language. We had no reason to believe that the tools used to build the prototype would not effectively scale up to our production requirements. However, as we revised and improved the data model, added missing functionality to ready the prototype for production, and began conducting full data synchronizations with TMS, we started to encounter performance issues that had not been apparent in earlier testing.
To be fair to the tools and frameworks, many of the issues stemmed from the changing requirements of our project as it went from ideas to reality. The functionality we required from the framework and the data-modeling capabilities needed to represent the complex hierarchies and relationships within our collections data were far more elaborate than anything these tools had originally been designed for.
Django had been initially developed as an application framework at a newspaper publishing company before being released to the developer community. It had originally been selected for the project as it offered many of the core features we required, such as a model-view-controller (MVC) structure, which separates an application into its main functional parts of modeling data, controlling the flow of that data, and exposing that data for viewing; an object relational mapper (ORM) that makes working with records from the database as conceptually easy as working with objects in code (see section 7 below for more information on ORMs); and a large community that offered numerous modules to extend the framework’s functionality. Although Django had grown to become one of the most highly regarded website application frameworks, it became apparent that there were specific requirements for the project that proved to be very complicated to implement within the constraints of the framework. This was particularly true of our data-modeling needs.
Despite these constraints, we were able to successfully implement and test the DOR’s full data model within Django, and build an API backed by both a PostgreSQL relational database and a faceted search index utilizing the Apache Software Foundation’s SOLR faceted search engine. Once we transitioned from testing with a sample data set of a few tens of thousands records to the full data set numbering in the hundreds of thousands of records in some tables and millions in others, the performance of the system started to degrade significantly, with response times for API calls increasing to the point of being untenable for a production system. This led to several rounds of further code optimization and database tuning, and enabling more extensive caching capabilities, and while we were able to improve performance somewhat, it was still short of where we needed it to be.
As such, we had to seriously reconsider our options. By this time, we had already invested several months of intensive development time into the project and had a platform, data model, and feature set that was proving to be incredibly easy to work with. Our Web Group department, for example, had already ported the Museum’s collection pages from the legacy infrastructure to a new responsive-design template that integrated with the API to source the collection data they needed, and our earlier testing (with the smaller data set in place) had been very positive. Thus, we didn’t feel that abandoning the project or completely redesigning our new infrastructure (again) was the right thing to do, and we had faith in the new approach and all the research we had done.
We then began researching other suitable frameworks we could transition to and considered numerous options created for a dozen different programming languages. In doing so we found a relatively new but well-supported and actively developed open-source PHP framework known as PhalconPHP, which had been created by its developers with performance as their top priority from the very start. Unlike most other frameworks built for scripting languages like Python, PHP or Ruby, PhalconPHP was created as a precompiled module, written in the C programming language. The developers had spent a significant amount of time optimizing the framework to speed up common operations like requesting information from databases and routing HTTP requests, which happened to be some of the most important and common operations for the DOR.
Within three weeks, we had replicated the core functionality of the DOR using PhalconPHP, and after loading in the same data set used under Django, we starting running timing tests. The results spoke for themselves. PhalconPHP was much faster than Django when responding to the same API calls and for performing the same operations that we needed from a digital object repository. Requests for data from the API that could take our Django-based DOR several seconds or more to respond to were taking on average half a second or less with PhalconPHP. The overheads were significantly lower with this PHP framework, and the performance was significantly higher, even though we still had a reasonable amount of scripted code that needed to be interpreted on each request. From that point onward, the PhalconPHP framework became our platform of choice for continuing the development of the DOR, and shortly thereafter it was also used to build our new MDM media derivative generation platform.
While we realize there will be additional opportunities to improve performance in the future, and that there are countless programming languages and frameworks to choose from when developing most projects, we feel that we struck an appropriate balance between our ideal performance needs, overall project development time, and the availability of developer skills. When we discovered the PhalconPHP framework, we felt comfortable transitioning the DOR to this platform for several reasons: it supported all of our functional needs; it is well supported by an active and growing developer community; and PHP is one of the most popular programming languages, used by over 80 percent of websites (W3Techs, 2015; Ide, 2015), meaning we would have a larger pool of skilled developers to recruit from in the future and that the project would be more accessible to a wider development community.
Interestingly, we gained some unexpected benefits from the experience of transitioning the DOR from one framework to another even before it went into production use: we confirmed that we could make significant changes to our new architecture–as we did in transitioning the DOR from Django to PhalconPHP–and so long as we maintained the API, nothing downstream (or upstream) would be affected. Having been through this experience, we would aim to set aside more time to incorporate additional testing at the prototyping stage and conduct deeper research into available tools and any likely performance bottlenecks before finding ourselves so far into a future project. Furthermore, if in the years to come we determined that we could better support the needs of the DOR and our users by transitioning to another platform or by replacing certain components of the DOR, we now know that we could do so without having to reengineer our entire infrastructure.
6.2 Last-minute revisions
The project was delayed a few times by changing requirements and design revisions. We started out with the stated aim of redeveloping our back-end architecture without making significant changes to visitor-facing applications. As development progressed, however, the realization came that we had to make changes to the user experience to accommodate the significantly larger data set. Stakeholders additionally requested some changes during the development process, which we wanted to attend to as well.
Through these experiences and changing needs, we developed a more iterative way of working, and started using new collaboration and project management tools (including Atlassian’s Jira bug-tracking/project status application) to ease the development process and improve communications among our team that was distributed between several departments in different buildings. We reached out more to stakeholders and other interested parties within the institution to gather their feedback and refine our development effort. In hindsight, the changes and improvements that were made throughout the first phase of the project were necessary and most beneficial, and although the project wasn’t launched as early as originally hoped, we were able to deliver a far better user experience and improve a number of additional architectural issues along the way.
6.3 Old interfaces and new data
We hoped that phase one would have minimal impact upon end users and that we would only introduce new behavior or functionality where replicating old methods would have been cumbersome or inappropriate. As we started to work on the redevelopment of our online collection pages, we soon realized that we could not simply modify or rebuild the pages to obtain their data from the DOR. Previously, our online collection pages had exhibited approximately 6,000 hand-curated objects deemed to be representative of the collection. However, with the transition to our new infrastructure, the rules for which objects would be displayed online changed, and the new platform made exposing these records much easier, with the set growing to almost 108,000 by the time the project launched.
The biggest impact this change had was on the navigability of the collection. We found the old search functionality to be lacking with the larger data set, and it meant that the old model of presenting objects in hand-curated classification groups (vases, coins, paintings, etc.) with simpler search features became unsuitable. As such, we needed to redesign the user experience. This meant improving the search functionality by enhancing the faceted and full-text search capabilities. We launched the new collection pages after several revisions of design and functionality, and advancements will continue to be made based on further user feedback, and improvements in our infrastructure.
7. Architectural technologies
With technology having advanced greatly since the mid-1990s, when the concepts of digital object repositories emerged, many more methods for building a DOR were now available to us. We opted for storing our normalized digital object data in an RDBMS, and we access and manipulate that data through a fast Object Relational Mapper (ORM), which maps relational database fields to their digital object attributes. This is in effect a hybrid model that allows us to take advantage of tried-and-tested and highly scalable technologies like an RDBMS, but to work with data in an object-oriented way. Beyond the ORM layer, the code only handles digital objects, making it easy to transition to other storage mechanisms in the future if doing so would prove beneficial. Furthermore, the only access other applications have to the DOR is via its HTTP REST API. As such, the internal workings of the system are not exposed to any downstream application; this intentional design decision helps ensure ongoing maintenance will be simpler, as we will no longer be dealing with deep dependencies between systems, as was common with our legacy infrastructure.
We believe that our new information architecture is more sustainable, maintainable, and easier to use and understand. It offers the performance we currently need and is scalable for the future. Should we wish to transition to another data-storage platform or swap out other components in the future, the black-box nature of the architecture (figure 3) will allow us to do so without affecting any upstream or downstream applications, so long as we maintain compatibility with the API. The API is also versioned, so we can offer new features in the future via a new version of the API, and upgrade applications when convenient, as those applications could continue to use an earlier version of the API until we were ready to carry out the upgrade. This architectural change alleviates maintenance issues, as it allows our development teams to work in a more strategic way, planning upgrades to take place when convenient for their overall project schedules and project needs, rather than having to work in a more reactionary manner to address incompatibility issues with other systems as was more common with our legacy infrastructure.
As detailed in section 6.1 above, the DOR project began as a Django framework-based prototype, which we transitioned to a much faster PHP-based framework after experiencing the unresolvable performance issues discussed earlier. There were some other adjustments to the DOR’s software architecture during the first stages of development, including transitioning our RDBMS from PostgreSQL to MySQL and our faceted/free-text search engine from SOLR to ElasticSearch. The major software components selected for the production version of the DOR are PhalconPHP, MySQL, ElasticSearch, and Memcached (depicted in figure 3 below).
We selected the MySQL database platform for the DOR’s ORM data storage due to its proven reliability and its ability to grow and scale with the needs of the repository. It is also one of the primary RDBMS systems already supported for production use by our Information Technology Services department. We selected Memcached as our caching layer of choice as it is able to cache commonly requested data in RAM making retrievals of that data extremely fast, and multiple instances can be installed across the network to provide load balancing capabilities. When it came to evaluating our options for supporting faceted and free-text search for our new infrastructure, we realized that search should be a core feature of the DOR and not necessarily be a feature that an individual application attempts to facilitate itself. By making search an integral part of the DOR we would gain the advantage of ensuring consistent search results across all of our applications. In making our selection, we evaluated several potential faceted search engines for this role, including SOLR, ElasticSearch and Sphinx. After reviewing our requirements, such as being able to index multiple different data types with different schemas into the same search index and the desire to perform complex aggregations and analytics on the data, it became clear that ElasticSearch offered the most comprehensive feature set and the performance we needed.
Figure 3: the “black-box” DOR architecture
8. Benefits
Some of the many benefits that we gained from the creation of the DOR include:
- The authoring and workflow tools we are building will enable collection information and interpretive content to be assembled and published to the Web, our mobile multimedia tour-guide devices (and soon apps on visitor’s own devices), and other outlets.
- The Media Derivative Manager (MDM) platform has streamlined our media derivative generation and management processes, reducing inefficiencies and redundancies in processing and managing media derivatives for multiple platforms.
- Fewer staff are required to publish content; curators, editors and other content creators can now more easily control the release of information without technical staff assistance.
- Many tedious, error-prone manual tasks related to maintaining the previous content publishing landscape were eliminated and replaced by automated, synchronized processes.
- The DOR enables the management of access rights in a standardized and centralized way.
- As we gradually phase out legacy infrastructure, the DOR is becoming the platform from which all applications will access collection information. This helps ensure the consistency and accuracy of our information as presented through all our distribution channels.
- Our new infrastructure allows us to respond more easily and quickly to increasing demands for sharing collection information with stakeholders, partners, and the wider community.
9. Epilogue
When we started this project, we were struggling to maintain our legacy infrastructure and support increasing demands on collection information. Our architecture had grown to become a maze of interdependencies, which sometimes made the simplest of updates challenging. When a number of factors conspired, including the pending upgrade of our collections management system, we knew we either had to find a more sustainable way to maintain the infrastructure we had or develop a replacement. After much research and consideration, we decided to centralize our collection information architecture so that we could regain control and ease future maintenance and updates. Through this process, we believe we have been able to develop a sustainable and flexible information architecture that not only is helping us solve our own collection information architecture issues, but is a solution that could be particularly useful to many other cultural heritage organizations.
Now that our new collection information architecture is operational, we would like to start exploring potential opportunities to share the open-source tools we have built and to collaborate with other institutions. While we realize that collaboration opportunities will likely be better suited to larger institutions initially, we hope that as the project continues to develop and as the documentation becomes more comprehensive, that it will become useful to smaller institutions too. Beyond institutions implementing DORs of their own, we hope that what we have shared here will be of benefit to others as they consider the next steps in the evolution of their own collection information architectures.
Acknowledgements
This project would not have been possible without the support and significant contributions of so many Getty colleagues, past and present: Stanley Smith, Brenda Podemski, Timothy Potts, Thomas Kren, Nik Honeysett, Roger Howard, John Giurini, Jack Ludden, Molly Callender, Ted Dancescu, Will Lanni, JP Pan, Joe Shubitowski, Mike Clardy, Joan Cobb, Steve Gemmel, Cherie Chen, Petrus Williams, Calvin Chan, David Lacey, Philman Wu, Shane Greene, Donovan Williams, Gregg Garcia, Robin Weissberger, Maria Gilbert, Ahree Lee, Erik Bertellotti, Jason Patt, Michael Smith, Brenda Smith, Krystal Boehlert, Kara Kirk, Greg Albers, Autumn Harrison, Peter Bjorn Kerber, Anne-Lise Desmas, Anne Woollett, Elizabeth Morrison, Yvonne Szafran, Lee Hendrix, Christine Sciacca, Alicia Houtrouw, Kristen Collins, Quincy Houghton, Kevin Murphy, Betsy Severance, Carole Campbell, Debby Lepp, Jennifer Alcoset, Irene Lotspeich-Phillips, Elsa Balliet, Heather MacMillan, Ryan Chute, and everyone else who helped make this project successful by offering their time, expertise, and feedback.
This paper is dedicated to my family, friends, and colleagues for their support throughout this project, especially to my wonderful wife Amanda and our darling daughter Madaline.
References
Ide, Andy. (2013). PHP Just Grows & Grows. Netcraft. January 31. Available http://news.netcraft.com/archives/2013/01/31/php-just-grows-grows.html
Kahn, R., & R. Wilensky. (1995). A Framework for Distributed Digital Object Services. Corporation for National Research Initiatives, crni.dlib/tn95-01. May 13. Available http://www.cnri.reston.va.us/k-w.html
W3Techs. (2015). Usage Statistics and Market Share of Server-Side Programming Languages for Websites. January. Available http://w3techs.com/technologies/overview/programming_language/all
Cite as:
. "A new DOR opens: How the J. Paul Getty Museum is reimagining digital collection information management." MW2015: Museums and the Web 2015. Published February 15, 2015. Consulted .
https://mw2015.museumsandtheweb.com/paper/a-new-dor-opens-how-the-j-paul-getty-museum-is-reimagining-digital-collection-information-management/