Digital Collections and Aggregations - Digital Humanities Data Curation

Katrina Fenlon, University of Illinois at Urbana-Champaign
Jacob Jett, University of Illinois at Urbana-Champaign
Carole L. Palmer, University of Illinois at Urbana-Champaign

§ 1 Introduction

Libraries, museums, and archives have been producing digital collections for decades, providing scholars with broad access to countless special collections. Researchers engaged in digital scholarship have also created many digital collections tailored to the interests of their particular research communities. Both kinds of collections are curated, in that they have been carefully selected and assembled for a specific purpose or audience. In the networked information environment, curated collections will become increasingly important as organizational units for the scattered and diverse mass of available digital information and for providing coherent contexts for meaningful engagement with that information. Aggregations, or collections of collections, are essential backbone resources in the evolving e-research platform that also need to be curated if they are to truly support and enhance discovery and innovation across the disciplines. Curatorial activities, such as archiving, preservation, and maintenance, are also important for managing the entire lifecycle of collections, but collection development and collection description are formative curation activities that add value for scholarly inquiry at both the collection and aggregation levels. This chapter is focused on key formative activities, particularly selection, representation, and related practices that support the integration of collections into large-scale digital aggregations. Without careful attention to interoperability and metadata for intellectual and navigational context, for example, aggregations can become incoherent and unwieldy masses of undifferentiated content. The resources presented below cover four key aspects for development of digital collections that are fit for purpose, function effectively in the networked information environment, and can contribute to the creation of rich, extensive, and diverse aggregations for scholarly use.

§ 2 Scholarly Use of Digital Collections

There is a large body of literature on scholarly practices, including how researchers use libraries, archives, and digital collections. It provides a substantial base of knowledge on the information resources of value to scholars and how they search, explore, and use collections. The resources listed in this section address foundational issues related to the role of collections in the research process and the implications for development of digital collections and services, ranging from accounts of the scholarly primitives, or basic activities involved in conducting scholarship in the humanities and sciences, to the kind of sources that are of value in different disciplines and related aspects of organization and description. The following resources identify and analyze the information practices and needs of scholars that well curated collections must support.

Resources: Information needs of scholars

H. L. Lee, H. L. “The concept of collection from the user’s perspective.” Library Quarterly, 75 (1), 67-85. 2005.

Concepts surrounding collections and their functions are examined from the user’s perspective, based on interviews with humanities and natural sciences scholars. Collections are shown to have multiple functions, including selection and collocation of related materials, narrowing of search scope, and clarification of information needs. The authors demonstrate the need for user-centered and flexible collection structures for digital library systems. Curated digital collections facilitate information seeking through well-designed collection structure (in other words, the organization among collection components), which should clarify relationships between collections and sub-collections, and at the same time accommodate expectations of both users and system managers. Advanced technology integrated with collections should allow users to customize collection structures to meet their individual needs.

Unsworth, J. “Scholarly primitives: What methods do humanities researchers have in common, and how might our tools reflect this? ” Symposium on Humanities Computing: Formal methods, experimental practice. King’s College, London. 2000.

The intellectual and physical work practices of humanities scholars are promoted as core constructs for guiding the development of digital resource and tools. Scholarly primitives, defined as common, discrete activities integral to how researchers create new scholarly works, include discovering, annotating, comparing, referring, sampling, illustrating, and representing. While the discussion applies primarily to tool functionality, those functions are dependent on access to rich collections of digital source material targeted for specific scholarly purposes. See also, Palmer et al. on scholarly primitives, below.

Palmer, C. L., Teffeau, L. C., Pirmann, C. M. “Scholarly information practices in the online environment: Themes from the literature and implications for library service development.” Dublin, OH: OCLC Research and Programs. 2009.

As an integrative review of the broad base of literature on scholarly practices and information use, this report provides a framework of information activities involved in a research process. In particular, it extends the scholarly primitives concepts introduced by Unsworth (2000). Primitives of scholarly behavior, such as browsing, direct searching, gathering, rereading, co-authoring, networking, and notetaking, are explicated and grouped within five broader activities: searching, collecting, reading, writing, and collaborating. Taking a comparative approach that draws out differences in how primitives and information practices manifest across disciplines, the analysis establishes an empirical basis for prioritizing the development of digital information resources and services to support scholarship in different disciplines.

Brockman, W. S., Neumann, L., Palmer, C. L., Tidline, T. “Scholarly work in the humanities and the evolving information environment.” Washington, DC: Digital Library Federation/Council on Library and Information Resources. 2001.

Based on a study of the information practices of humanities scholars across several disciplines, applying a combination of semi-structured interviews and case studies, this report identifies tasks that digital collections need to facilitate, based on how scholars engage with information resources. Library collections are highlighted as essential capital for the production of new research in the humanities. Scholars depend on both personal collections and curated library collections, which can inform the development of one another. Features that make a collection valuable include variety and complementarity, in addition to the value-added tools and services that are needed to support interaction and use.

Duff, W. M., Johnson, C. A. “Accidentally found on purpose: Information-seeking behaviors of historians in archives.” Library Quarterly, 72 (4), 472-496. 2002.

Important insights into the use of archival collections are offered in a holistic description of the information seeking process of historians. Approaches shown to be critical to the research process include collecting names of individuals and organizations and conducting provenance searches. Contextual information, such as knowledge of relationships among documents or the way the records are organized, is necessary for scholarly interpretation; context-preservation and representation, through collection organization and careful description, are essential curatorial acts.

Buckland, M. K. “Collections.” In Library services in theory and context (2nd ed.). New York, NY: Pergamon Press. 1999.

This book advances a conceptual framework for library services in general, while Chapter 7 (“Collections”) focuses on issues of particular relevance to collection developers. It may be considered a conceptual starting point for practice. The curatorial role of collections is established; library collections (both digital and physical) are conceptualized as resources that provide evidence for inquiry, serving archival and logistic roles. Thus, an important criterion for evaluating collections is their ability to serve as evidence for learning. In particular, this chapter provides a foundation for practitioners interested in user-centered approaches to collection development that emphasizes careful selection and facilitation of retrieval.

§ 3 Collections Development Principles

Collection curation includes careful selection and organization in anticipation of user needs. Proceeding from the user perspectives established above, this section is concerned with guiding principles for building valuable digital collections. In the following resources, conceptual approaches to collection definition, development, and curation feed into specific and pragmatic guidelines for collections planning, implementation, and sustainability.

Many of the aspects covered impact how well collections can be aggregated into larger, shared resources, in part because the possibility of aggregation or eventual collection repurposing is often neglected by collection administrators during collection-making, despite its commonness in the real world of cultural heritage institutions. The possibility of aggregation – along with other forms of repurposing that might entail loss of context for collections or items in collections – must be accounted for during collection curation: interoperability is essential to sustainability in the networked digital environment. Unfortunately there are currently no sources that provide comprehensive coverage of collection development principles or practices for building large-scale aggregations.

Resources: Collection development

Lee, H. L. “What is a collection?” Journal of the American Society for Information Science, 51 (12), 1106-1113. 2000.

Collections function as information seeking environments, both traditionally and in a digital environment. Tangibility, ownership, a user community, and an integrated retrieval mechanism – traditional presumptions about collections – are shown to require reassessment in light of technological advance. This article is largely conceptual, but its reexamination of traditional notions of collection have implications for practitioners, especially its reexamination of how to align collections with user needs in a digital environment. The final section proposes an expanded conceptual framework for collections.

Lee, C. A. “A framework for contextual information in digital collections.” Journal of Documentation, 67(1), 95-143. 2011.

Context is a prominent and fundamental concept in curation. This article provides a deep analysis of contextual information, in particular its relationship to collections. Nine classes of contextual entities (object, agent, occurrence, purpose, time, place, form of expression, concept / abstraction, and relationship) derive from the relationships between items in a collection. Curators need to be concerned with how these contextual entities are represented in the description and organization of collections to support preservation and use.

Palmer, C. L. “Thematic Research Collections.” S. Schreibman, R. Siemens, and J. Unsworth, eds. A Companion to Digital Humanities (pp. 348-365). Oxford: Blackwell. 2004.

While a number of the chapters may be of interest in this broadly scoped volume of expert articles on the history, principles, and applications of the digital humanities, this specific chapter is particularly relevant since it suggests that collections of primary and secondary materials developed by scholars are important models for the development of specialized thematic research collections. Moreover, it discusses how scholarly collections fit with and can inform library collection development. The characteristics of scholars’ personal collections, such as being extensive and interdisciplinary, yet still thematically coherent, are important considerations for library practitioners working to pull together digital research materials into collections customized for intensive study and analysis in a specific research area. The principle of contextual mass is introduced, which prioritizes the values and work practices of scholarly communities rather than collection size. Instead of striving for a critical mass of content, thematic collections should systematically integrate sources and tools “that work together to provide a supportive context for the research process,” thereby producing a critical mass of context.

NISO Framework Advisory Group. A framework of guidance for building good digital collections. 3rd ed. Bethesda, MD: National Information Standards Organization. 2007.

A NISO recommended practice provides recommendations on the four major components of building good digital collections: collections (groupings of digital objects); objects (the digitized or born-digital information objects in collections); metadata (information pertaining to digital objects and collections); and initiatives (project planning and management). The framework is intended to assist with planning and implementation of digital collections in cultural heritage institutions and to inform funding agencies that want to encourage the development of high quality digital collections. Among recommended features of, or practices for, curated collections are the availability of explicit collection development policies; thorough and user-aware descriptions of collections, including information that would help a user determine collection authenticity, representativeness, and interpretation; curation through active management of a collection’s resources throughout their lifecycles; implementing measures for accessibility; facilitating interoperability; and preparing collections for easy integration into users’ workflows.

§ 4 Descriptive Metadata for Collections and Items

The previous sections covered high-level considerations that should guide collection building into alignment with user needs. This section begins to elaborate collection curation, especially as it relates to description, with greater specificity: how does one set about making a useful, rich, and shareable collection description that will facilitate discovery?

A recurrent theme in the following resources is the curatorial necessity of keeping long-term considerations in mind during collection description, with awareness of how the collection may be shared, aggregated, or repurposed outside of its original context. To preserve for the long term the contextual value that a collection provides, collection descriptions must answer certain questions:

On collection characteristics: What is the title of the collection? Where can the collection be accessed (persistent URL)? How many items are in the collection? What media types and formats are represented in the collection? What is the topical, temporal, and geographic coverage of the collection? What languages are represented?
On the provenance, administration, rights or access restrictions, and institutional affiliations of the collection: Where do the collection and items within the collection come from? How were they brought together and by whom? What is the custodial history of the collection? Are items readily accessible online, and if so, with what restrictions? What copyrights apply to this collection and items in it? Were resources in the collection born digital or do they exist in physical form, and if so, how can they be accessed? For aggregators: What metadata formats are in use? Are there alternative access points (e.g. an OAI-PMH data provider for items in the collection? An API? An RSS/Atom feed for items in the collection)? Does this collection have associated projects or other collections?
On the relationship between items and the collection: Why have these items been brought together as a collection? What is in the collection, or what is the shape of the whole? What is the collection’s target audience? How does the gathering of these items into a single collection create a new information resource with value added beyond the value of individual items?

Collection description

Collection description should rely on relevant element set and vocabulary standards whenever possible, maintaining two kinds of balance:

The collection description should be expressive enough to capture local context and domain-specific information, both for users and for collection administrators, and at the same time shareable and fully expressive outside of the local context.
The collection description should rely both on structured data (for indexing and automated retrieval) and free text description; there is interplay between the two.

Resources: Collection description

Johnston, P., and Robinson, B. “Collections and Collection Description.” Collection Description Focus Briefing Paper, 1. Bath: UKOLN. 2002.

This brief details the differences between archival, library, museum and digital collections, along with the power of collection-level descriptions in resource discovery and management. While the brief does contain some information specific to the UK’s Research Support Libraries Programme (RSLP) and JISC’s Distributed National Electronic Resource (DNER), a more generally applicable aspect is its exploration of features – including topical, temporal, and spatial coverage – around which a collection can best be described to facilitate discovery and to articulate its comprehensiveness and uniqueness, two aspects of collections that scholarly users in particular seek, and on which they base interpretation.

Chapman, A. “Collection-level description: Joining up the domains. ” Journal of the Society of Archivists, 25 (2), 149-155. 2004.

Collection-level descriptions are important tools that bridge domain gaps within heterogeneous collections. Gathering information resources into groups of related resources is a long established practice for managing the organization of information resources so that they better facilitate resource retrieval by users, and for helping collection developers strategically plan future acquisitions based on the strengths and weaknesses of the collections. Further, the authors find that archival collection description and item-level description in museums and libraries, respectively, are not only similar, but are also complementary approaches to resource description. Practitioners can leverage these similarities in the construction of hierarchies of collection description further enhancing the overall characterization of their collections.

Dublin Core Metadata Initiative. Dublin Core Collections Application Profile. 2007.

This document describes the application profile for collection-level description developed by the Dublin Core Collection Description Task Group. It is provided as an example of a standard – maybe the most commonly used standard – for collection description.

Item description for aggregation and interoperability

Certain lessons about creating useful collection descriptions translate to describing and sharing metadata for items within collections:

Metadata should be created with long-term considerations in mind, including the potential for repurposing, relocation, or loss of collection- and institutional context.
Balance between expressiveness and interoperability is critical. Metadata should rely on standards suited to the nature of the items and expressive enough to capture relevant information and context, and yet should prioritize interoperability. Collection administrators should ask themselves the following sorts of questions when creating item-level metadata: Can the data be shared without information loss? Is the data structured and made accessible such that sharing and repurposing are technically feasible? What information would be lost by moving item data to another context?

Metadata standards for describing all types of items from all domains abound; we do not broach specific metadata or sharing standards or protocols. Instead, resources in this section cover best practices for shareable metadata (regardless of format, though in practice the most common example is Dublin Core) in the context of collections and aggregations of collections.

Resources: Item description

Shreeves, S. L., Knutson, E. M., Stvilia, B., Palmer, C. L., Twidale, M. B., and Cole, T. W. “Is ‘quality’ metadata ‘shareable’ metadata? The implications of local metadata practice on federated collections.” Proceedings of the 12th National Conference of the Association of College and Research Libraries (pp. 223-237). 2005.

Using Gasser and Stvilia’s information quality framework (Gasser and Stvilia, 2001), the authors identify three major quality dimension groupings (Intrinsic Information Quality, Relational/Context Quality, and Reputation Quality) by which metadata records can be analyzed to assure minimum levels of quality. While the analysis in this article focuses on Dublin Core records, the authors’ findings may be generalized to the broad categories represented by various Dublin Core elements. Of particular interest to practitioners will be the eight Dublin Core elements that the authors identify as key to record completeness, and most useful in facilitating search and discovery: title, creator, subject, description, date, format, identifier, and rights. The authors note that both structural and semantic consistency are vital to metadata record interoperability.

Hutt, A., and Riley, J. “Semantics and syntax of Dublin Core usage in Open Archives Initiative data providers of cultural heritage materials.” Proceedings of the Fifth ACM/IEEE-CS Joint Conference on Digital Libraries, (pp. 262-270). New York, NY: ACM Press 2005.

This article discusses the expressiveness of metadata standards, especially with regards to best practices verses granularity issues (e.g. dates with respect to Dublin Core best practices verses AACR2 standards, which are much more expressive). Ultimately, describing digitized and born-digital resources needs as much care and planning as the description of more traditional library resources. Awareness of the expressivity and interoperability issues are important when it comes time to decide which metadata standard should be used to describe the individual items within a collection. While adopting a richly expressive standard such as MODS or METS may be cost-prohibitive for some institutions due to the time investment necessary for describing each item in the collection, the use of a generic standard, such as Dublin Core, may lead to the omission of information pertinent to the user.

§ 7 Aggregation issues and problems

Digital aggregations can provide essential metastructures for unifying distributed collections and content, and therefore have a role to play in collection curation. Aggregation at various levels, including but not limited to the following levels, is increasingly common:

Large-scale, national and international digital libraries aggregate highly diverse collections from all kinds of institutions.
Thematic aggregations pull together topically related collections or items from a variety of institutions, usually for a particular audience with a particular topical interest.
Local, institution-specific collections of collections often integrate technically and topically diverse data into a single content management system behind a single point of access.

However, the act of bringing together and providing access to a large number of collections does not guarantee that the resulting aggregation will be a useful resource for researchers. Each kind of aggregation entails challenges for both aggregation administrators and individual collection administrators. Given the commonness of aggregation and its decontextualizing potential, collection curation should anticipate how aggregation, and other kinds of data repurposing, may affect the informative value of collection- and item-level data. For example, aggregations that provide item-level search across hundreds or thousands of collections often collapse or obscure original collection organization, thereby weakening or obfuscating – from both users’ and administrators’ perspectives – the relationship between an item and its original collection. Aggregators should develop the aggregation in a principled way, both to create an aggregate resource of greater value, and to preserve institutional and collection contexts within the aggregation. As collection administrators develop and describe collections for use in a local context, the following resources on aggregation-development principles and strategies may inform how curation activities can anticipate aggregation.

Resources: Aggregation development

Bishoff, L., and Allen, N. Business planning for cultural heritage institutions. Washington, DC: Council on Library and Information Resources. 2004.

Cultural heritage institutions often receive short-term funding to digitize collections. Collection curation and preservation depend on an institution’s ability to assemble plans and resources for sustainability, and to prepare collections for eventual integration into aggregate organizational structures. Libraries and museums have produced thousands of resources that have made substantial contributions at the local level, but most have not planned their digital programs for participation in large national aggregations. This report surveys current business planning practices in cultural heritage institutions, aiming to provide a template to help cultural heritage organizations apply business planning to the sustainability of digital asset management initiatives.

JISC. Clustering and sustaining digital resources. JISC eContent Programme 2009-11. Bristol: JISC, 2011.

This report provides case studies of eleven projects that manage digital resources for higher education, with particular emphases on (1) building and sustaining collections and (2) integrating them to reduce silo effect. While the focus is largely on managing digitization projects, rather than managing digital collections, these case studies reveal strategies for reconciling short-burst patterns of digitization funding with long-term sustainability by more closely linking digital projects with existing institutional structures and through skills and capacity development and inter-institutional services sharing for digitization projects. Case studies also examine challenges to and strategies for integrating resources from various institutions for increased value and sustainability, looking at how best to accommodate the rearrangement, enrichment, harvesting, sharing, and combining of metadata that is essential to long-term digital collection viability, and therefore to curation.

Palmer, C.L., Zavalina O.L., and Fenlon, K. “Beyond size and search: Building contextual mass in digital aggregations for scholarly use.” Proceedings of the American Society for Information Science and Technology, 47 (1), 1-10. 2010.

Applying the concept of conceptual mass discussed in Palmer (2004), above, the collection development strategy presented here relies on a conspectus-style analysis of collection metadata to drive principled development of large-scale digital aggregations. While the article focus is on the aggregation level, the outlined approach models how scholars build their own personal research collections, as they follow leads from collection to collection across institutions near and far, and considers how collection and aggregation can add value that cannot be achieved through conventional retrieval and browsing at the item level.

Cole, T.W., Han, M.J., Moncur, D., and Green, H. “Describing collections and collection services for the BTP.” Proceedings of the International Conference on Dublin Core and Metadata Applications, 2011.

The concept of aggregation must be extended as library collections enter the world of the semantic web: current projects are creating new kinds of aggregation and leading to new uses for collections. Collection descriptions of the future will need to anticipate computer-mediated collection interoperability and computer-agent collection use. This article is heavily technical and forward-looking. However, it may widen the view of practitioners seeking to anticipate the ways and extent to which collection descriptions may be repurposed, and how best to ensure their sustained value in the next generation of digital aggregate resources.