D-Lib MagazineSeptember/October 2012 Identifying Threats to Successful Digital Preservation: the SPOT Model for Risk Assessment
Sally Vermaaten AbstractDeveloping a successful digital preservation strategy amounts to accounting for, and mitigating, the impact of various threats to the accessibility and usability of digital materials over time. Typologies of threats are practical tools that can aid in the development of preservation strategies. This paper proposes a new outcome-based model, the Simple Property-Oriented Threat (SPOT) Model for Risk Assessment, which defines six essential properties of successful digital preservation and identifies a limited set of threats which, if manifested, would seriously diminish the ability of a repository to achieve these properties. We demonstrate that the SPOT Model possesses the attributes of conceptual clarity, balanced granularity, comprehensiveness and simplicity, and provide examples of practical uses of the model and suggestions for future work. 1. IntroductionDigital preservation strategies, as well as the processes and tools that implement those strategies, are designed to secure the long-term future of digital materials. A successful digital preservation strategy must account for and mitigate the impact of various threats to the accessibility and usability of digital materials over time. Digital preservation strategies must address the threats relevant to the specific repository context in which they are expected to operate; this in turn requires an understanding of the full range of potential threats so repository staff can evaluate the likelihood and impact of each in the context of local circumstances, and take appropriate steps to address those threats representing significant risk. Threat models for digital preservation are practical tools that help repositories carry out several kinds of risk assessments. Repositories may use a threat model to ensure that they sufficiently guard against key risks during: development or re-development of a high-level digital preservation strategy; validation of a digital preservation strategy or demonstration of the overall trustworthiness of the repository that employs that strategy; development or re-development of a component process, tool, policy, or procedure in the system that implements a high-level preservation strategy. Threat models can also help organizations formulate preservation requirements for larger business systems and models. The processes, standards, and tools that are used to enact high-level preservation strategies regularly evolve to keep pace with technological and other environmental changes. There is a need for a threat model that is designed to aid users who are developing or reviewing particular preservation system components as well as developing or validating high-level preservation strategies or complete preservation programs. In order to support risk assessments with a smaller scope and duration as well as wider reviews, a digital preservation threat model should be light-weight and directly focused on key preservation outcomes. A threat model of this kind could be used by itself or in conjunction with a more detailed threat model. In reviewing the digital preservation literature, several threat models were identified, reviewed, and compared. In order to conduct a consistent comparison, we posited four essential qualities that a threat model should possess in order to be useful for risk assessment: conceptual clarity, appropriate detail and consistent granularity, comprehensiveness, and simplicity. Use of models which lack one or more of these qualities can lead to a number of problems in the course of risk assessment: for example, application of the model may be overly complex and resource-intensive; one group of threats could be over-emphasized, while other, equally important threats are de-emphasized or ignored; or threats may not be described in sufficient detail to create clear mappings into real-world digital archiving systems. None of the models reviewed was found to possess all four qualities. In short, there is still a need for a light-weight, outcome-focused threat model to support a wide range of repository risk assessment activities. In this paper, we present the SPOT (Simple Property-Oriented Threat) Model, which is designed to be applicable across a variety of repository contexts and is flexible enough to be used in support of both smaller internal risk assessment exercises as well as more complex internal or external assessments. Use of the SPOT model can help repositories identify previously unaddressed threats, perform ongoing monitoring of key threats, and demonstrate that a repository complies with accepted standards by appropriately managing risks. Broadly speaking, digital preservation threats can be divided into two categories: threats to archived digital content, and threats to the custodial organization itself. For example, technical issues relating to the ingest, storage, maintenance, and dissemination of archived content tend to fall into the first category; on the other hand, economic issues relating to the ongoing availability of sufficient resources for the repository to meet its long-term goals, and legal issues which may limit the actions a repository can take to preserve digital content, tend to fall into the second category. The SPOT Model focuses on the first category. Threats relating to the second category fall outside the scope of the model. This is not to diminish the importance of the second category of threats, or to suggest that threats to the repository itself could not indirectly impact archived objects. However, the purpose of our model is to develop a lightweight framework for assessing threats arising from the technical operations associated with preserving digital objects. This framework should be used in conjunction with other threat assessment models that address the second category of threats. The rest of the paper is organized as follows. Section 2 reviews several existing threat models in the context of four desired attributes: conceptual clarity, appropriate detail and consistent granularity, comprehensiveness, and simplicity. Section 3 describes the properties of well-preserved digital objects as the basis for a new approach to risk assessment. Section 4 presents the SPOT Model for risk assessment. Section 5 offers some commentary on the model, while Section 6 provides illustrations of its practical application. Section 7 concludes with some suggestions for further work. 2. Prior WorkThe field of digital preservation has drawn on the vocabulary and principles of risk management for over a decade. Early works such as Conway's 1996 "Preservation in the Digital World" characterized digital preservation activities as risk management processes. (Conway, 1996) Since then, a large number of works have defined types of threats to the preservation of digital content.1 In this section, we review prior work in the development of threat models for digital preservation. We restrict our attention to the most visible or widely-used models, so cannot claim that our review is exhaustive. However, it our belief that we have addressed the models most familiar to digital preservation practitioners today. Please see the Appendix for a list of threat models included in our review. A review of the digital preservation literature was conducted to identify a strong, widely-applicable preservation threat typology that supports risk assessment on a variety of scales. This review suggests existing work on digital preservation threats falls into three groups:
The first group paints a detailed picture of threats associated with a specific aspect of digital preservation, such as threats to file formats (Arms and Fleischhauer, 2003; Lawrence, Kehoe, Rieger, Walters, and Kenney, 2000; Rog and van Wijk, 2007; Stanescu, 2005; ); storage (Wright, Addis, and Miller, 2009); and threats associated with particular content types such as web resources (McGovern, Kenney, Entlich, Kehoe, Buckley, 2004). These models are well suited to risk assessments that center around a particular type of content or system component. These models may complement, but do not substitute for, a more general model of threats to successful digital preservation The second group consists of case studies that describe the use of existing digital preservation threat typologies, risk assessment methods, or risk management principles at specific institutions. These case studies include examinations of actualized digital preservation threats at the Library of Congress (Littman, 2007) and the organizational effects of conducting a risk assessment for hand-held media at the British Library (McLeod, 2008). Rather than presenting new threat models, the contribution of these case studies is the commentary on and demonstration of some of the practical uses of existing threat models and risk assessment methods. Such case studies can help repositories better understand and apply a chosen threat model or risk assessment method. The third group identified in the review presents general models of digital preservation threats. Structurally, these threat typologies range from hierarchical taxonomies (such as Barateiro, Antunes, and Borbinha, 2009) to lists of threat types (Clifton, 2005; PARSE.Insight consortium, 2010; Rosenthal, Robertson, Lipkis, Reich, Morabito, 2005; Thomaz, 2006), to narrative-based characterizations of different threats (Cornell University & ICPSR, 2003; The National Archives, 2009). The conceptual model for preservation developed as part of the Planets project also includes several subclasses of risks and represents the relationship between risks, objects, and environments (Dappert, 2009). The Digital Repository Audit Method Based on Risk Assessment (DRAMBORA) guides repositories through a risk assessment methodology through which the repository develops repository-specific lists of risks. DRAMBORA also includes a hierarchical taxonomy of threats to spark the thinking of those trying to identify risks in their own context (Digital Curation Centre & DigitalPreservationEurope, 2007). The requirements of the Trusted Repositories Audit Checklist (OCLC/CRL, 2007) and the Audit and Certification of Trustworthy Digital Repositories, a draft ISO standard (Consultative Committee on Space Data Systems, 2009), implicitly describe threats by defining and categorizing measures to be taken by repositories in order to mitigate against major threats. The variety of shapes, sizes, and presentations of these models reflect their varied contexts and purposes. In order to benchmark a comparison across the general threat models (i.e., the third group), we identified four characteristics we felt a flexible and light-weight digital preservation threat model could be reasonably expected to possess:
Conceptual clarity A typology of digital preservation threats should maximize conceptual clarity and avoid ambiguity and redundancy by clearly defining and organizing threats in a simple and consistent manner. In the literature review, we found that digital preservation threat models often include some threats framed in terms of potential sources of failure and others framed in terms of results of failure. A focus on sources identifies threats by analyzing the individual components of a system (e.g. software, storage, personnel) and considering potential failures attached to each component (e.g. "software faults," "software obsolescence"). A focus on results, on the other hand, begins with the assumption that the failure has occurred, and identifies threats by the nature of the ensuing problem (e.g., "loss of confidentiality of information," "corrupted bit sequences," etc.) A third approach, less frequently adopted, can be called outcome-based analysis. This begins by identifying desired outcomes and works backward to identify threats that might prevent the achievement of these outcomes.2 There are use cases for which mixing threats framed from multiple perspectives may be appropriate or even helpful. For example, DRAMBORA's generic register mixes source-based threats (e.g. "R34. Media degradation or obsolescence") and outcome-based threats (e.g. "R54. Loss of authenticity of information") to good purpose because in this case the intent is to stimulate thought about threats from many angles to help repository staff craft a customized list tailored to their own environment. In many cases, however, framing threats in multiple ways in a single model creates confusion and complication due to category overlap and conflation. For example, structuring a preservation plan around a threat model with a mixture of source-focused, results-focused, and outcome-based threats could result in unclear boundaries between sections of the plan. Threats framed in different ways may in fact cover the same or nearly the same issues, causing duplication of effort, at least in the analysis stage (e.g. "Loss of authenticity of information" or "Hardware failure or incompatibility" could both cover an inability to verify provenance due to loss of a server that holds key metadata). Mixing threats framed in different ways is also problematic if a taxonomy is used as a basis for quantitative analysis of threats or is used in conceptual modelling or mapping exercises. Conceptual clarity is a feature of several models including TRAC, which takes the form of a list of requirements (mitigations) aimed at guarding against key threats. Though threats are implicit in TRAC's criteria, all requirements focus consistently on sources: that is, they are largely framed in terms of failure or loss of system components. The threat models presented by Barateiro, Antunes, and Borbinha; Rosenthal, Robertson, Lipkis, Reich, & Morabito; Thomaz; Dappert; and Clifton also possess a high degree of conceptual clarity. Appropriate detail and consistent granularity Threats should be described in sufficiently detailed terms to support easy application to real-world digital archiving contexts. Threats which are too general in their definition, or are only vaguely described, will require significant effort in order to adapt them to the purposes of a risk assessment exercise. Threats should also be described at a consistent level of granularity throughout the areas the model purports to address. For example, the model should not include a broadly-defined threat category such as "Economic Threats" alongside a long list of detailed technical threats like "Insufficient Metadata Collected at Time of Ingest". In hierarchical models, no threat should have a parent/child relationship with another threat at the same level. Threats at the same level should have comparable degrees of complexity and importance. Determining if a model's threat categories are of consistent granularity can be difficult, since there is no agreed-upon measure of the complexity or importance of various areas of generic preservation threats. It is possible, however, to observe large imbalances in typologies, especially when comparing across several models. If a repository uses a threat typology with imbalanced granularity in developing or testing a preservation strategy, tool, or system, the product of that development or testing may reflect the skew of the threat typology: it may have insufficient guards against important threats not well represented in the model and an overabundance of guards against minor threats that are explored in great detail. Our literature review found some models were scoped to include both threats to digital content and threats to the repository organization itself, but tended to explore technical threats in comparatively greater detail than other significant areas of threats including legal, organizational, or economic threats. Our review did, however, identify several models that were appropriately and consistently granular across their defined scope. These models include DRAMBORA, TRAC, the digital preservation tutorial developed by Cornell University Library and ICPSR and the threat models developed by the PARSE.Insight consortium, The National Archives UK, and Clifton. Comprehensiveness Major categories of threats should not be omitted from a model unless the model explicitly declares these out of scope. By using a model that is not comprehensive within its defined scope, a preservation activity may miss a key threat type. For example, if a threat model that focuses on sources of threats to digital content makes no mention of threats posed by data carriers, a preservation activity may not identify and manage threats associated with storage of information on CDs or floppy disks. Several threat models were found to be good examples of models that are comprehensive within their scope, including DRAMBORA's generic risk register. Since it is a synthesis of several threat typologies, it is not surprising that DRAMBORA covers a reasonably full spectrum of threats. TRAC, the draft ISO digital repository audit checklist (Audit and Certification of Trustworthy Digital Repositories), and models by Barateiro, Antunes, & Borbinha; Rosenthal, Robertson, Lipkis, Reich, & Morabito; Cornell University Library & ICPSR; PARSE.Insight; and Dappert, also presented comprehensive threat typologies. Simplicity A threat typology should be easy to understand and use. If a typology is complex, it should be modular or convertible to a more lightweight and/or straightforward form. We recognize there are uses for both simple and complex models, and that some use cases may require a richness, level of detail, or contextualization that can only be accommodated by a complex model. For example, complex digital preservation threat typologies can be useful for large and thorough repository audits. The key is that the model be fit for the purpose of its use. For some uses, we suspect there is a point of diminishing returns where additional complexity does not yield added insight, or not enough to justify the investment in time and resources needed to implement the model. This can be the case when the use case is the design or testing of a smaller tool or preservation service, or during the ongoing risk monitoring recommended by ISO 3100 (ISO 3100, 2009). The models developed by Barateiro, Antunes, & Borbinha; Rosenthal, Robertson, Lipkis, Reich, & Morabito; Thomaz; PARSE.Insight; Clifton; the National Archives; and Dappert appear to be simple, easy-to-use and would support quick or focused risk assessments. See the Appendix for summary comparison of all the models discussed above. In sum, our literature review suggests that none of the threats models we analyzed possess all of the attributes defined above conceptual clarity, appropriate detail and consistent granularity, comprehensiveness, and simplicity. We therefore propose a new typology of preservation threats to digital content that: 1) possesses these four key attributes 2) focuses on key preservation goals, and 3) is flexible enough to support a range of assessment use cases. 3. Background to the ModelAn assumption underpinning our analysis is that any threat model used in the evaluation or design of long-term digital preservation repositories should focus specifically on risks affecting the ability of the repository to carry out its core responsibilities. Consequently, it is necessary to identify a set of core responsibilities applicable across a wide range of repository settings. We approached this by examining key elements of the literature underpinning the development of current digital preservation standards. Preserving digital information: Report of the task force on archiving of digital information (Waters, 1996) has been called "the seminal report in digital preservation" and noted as a landmark report by many (Digital Library Federation, 2003; Marcum, 2002; Li and Banach, 2011). In this influential document, preservation archives are defined as repositories responsible for ensuring the integrity and long-term accessibility of our digital heritage. "Integrity" in turn is defined as the maintenance of the content, fixity, reference, provenance and context of information objects. This terminology made its way into the information model of the OAIS Reference Model (Consultative Committee on Space Data Systems, 2012), which bundles content with Preservation Description Information in order to document fixity, reference, provenance and context. The responsibilities of an OAIS include maintaining content and Preservation Description Information, ensuring that content is independently understandable to a designated community, and making it available to that community. Archival responsibilities in an OAIS can therefore be summarized as maintaining fixity, reference, provenance, context, understandability and availability of content. A Metadata Framework to Support the Preservation of Digital Objects, based on the OAIS information model, described preservation metadata as the information needed to support the digital preservation process. The digital preservation process included maintaining the viability, renderability and understandability of digital objects over time. (OCLC/RLG Working Group on Preservation Metadata, 2002) "Viability" ensured "that the archived digital object's bit stream is intact and readable from the digital media upon which it is stored," thus combining the idea of fixity (intactness) with the property of being readable from media. The concept of renderability was an important addition to the canon, as it encapsulated to some extent the discussion of content in the Waters report as the "knowledge or ideas the object contains," recognizing that it might be necessary to transform the original bits of an object in order to ensure that its content can be rendered (delivered) with current technologies. In a follow-on effort to the Metadata Framework report, the PREMIS Working Group detailed the functions of a preservation repository as "maintaining viability, renderability, understandability, authenticity, and identity in a preservation context." (PREMIS Working Group, 2005) Here viability and renderability came directly from the Metadata Framework report, while understandability and identity (reference) come from OAIS. The new requirement, authenticity, combines the concepts of provenance and context in the Waters/OAIS models. Subsequent works have cited all of these functions as specific goals of digital preservation. (Bradley, 2006; Caplan, 2008; Dappert, 2009; Fojtu, Hutař & Pavlásková, n.d.). Bringing this full circle, the 2009 revision of the OAIS Reference model (Consultative Committee on Space Data Systems, 2009) specifically included authenticity as a goal of longterm preservation. For the purpose of this analysis, we posit that the function of a preservation repository is to ensure the availability, identity, persistence, renderability, understandability, and authenticity of digital objects over time. The term "persistence" is used in place of "viability"as a more natural way to denote the quality of being a good, readable bitstream that has not been altered over a period of time. While we acknowledge that other properties could be added to this list, we feel this is a reasonable set and that it reflects the key outcomes of digital preservation highlighted in the literature. Given these essential properties of successful digital preservation, it follows that a threat to a repository is a circumstance that negatively impacts the availability, identity, persistence, renderability, understandability, and/or authenticity of a preserved digital object. If a threat materializes, one or more of these properties can be lost or impaired. Not all threats will materialize, however, and not all would result in the same degree of harm if they did. Risk is the likelihood that a particular threat will materialize weighted by the magnitude of the damage that it would cause. Risk management, therefore, is collectively the actions taken to minimize the likelihood that a threat will materialize, and/or to minimize the damage done if the threat does materialize. 4. The SPOT model for risk assessmentThe Simple Property-Oriented Threat (SPOT) Model for Risk Assessment defines six essential properties of successful digital preservation: availability, identity, persistence, renderability, understandability, and authenticity. For each of these properties, a set of threats is identified which, if manifested, would seriously diminish the ability of the repository to achieve the property in question. The threats are described at a high-level, and focus on outcome: that is, the aspect of the threat that impacts the preservation property with which it is associated. As such, the SPOT Model is an outcome-based typology of threats that individual custodial institutions can use in evaluating their own situational risk and risk mitigation strategies. It is important to keep in mind that the model does not focus on the specific causes of a threat. A given threat can arise from a variety of sources, depending on local circumstances. For example, one threat mentioned in the model is that sufficient metadata is not captured or maintained to support long-term preservation. This threat could be actuated by many agents and many circumstances, such as the creator of the digital object, the object's chain of custody prior to archival deposit, the repository submission agreement, the repository ingest process, and so on. The SPOT Model only outlines the nature of the threat in terms of its impact on its associated preservation property; this can then serve as a starting point for repository staff to consider which causations are most likely and which aspects of their local digital archiving system are most susceptible to this threat. To aid this discussion, each section of the model ends with a brief suggestion regarding the parts of the digital preservation process most relevant to a particular property-threats combination. Availability Availability is the property that a digital object is available for long-term use. In order to ensure availability, the digital object must be ingested into, and subsequently maintained by, a preservation repository. While there could be physical barriers to this, more often it is a question of the priority that decision-makers attach to its long-term value or the permissions granted by those who control the intellectual property rights associated with the object. Threats:
In summary, the major threats to availability reside in pre-repository care, selection policy, and rights management. Identity Identity is the property of being referencable. Identity distinguishes an object from other objects in a group and allows an object to be discovered and retrieved. A limited amount of metadata (e.g. name, unique identification number, date, version number, creator) is often all that is required for the purposes of identification and disambiguation, as opposed to more extensive information that may be necessary to support understandability (see below). In the OAIS information model, this is called reference information. Identity is contextual: some objects are associated with information that allows identification only within a limited context (e.g., an object may be uniquely identified only within the context of objects residing on the same server), while others have enough information to make them globally identifiable (e.g., a global identifier such as a GUID or ISBN). Creation of sufficient metadata to support identity may be the responsibility of the repository or of other agents, such as the creator of the object. Note that identity in this sense does not refer to maintenance of characteristics such as content, context, appearance, structure or behaviors. Ensuring that an object retains its significant characteristics3 is part of renderability. Threats:
The major threats to identity come in pre-repository care and generation and maintenance of descriptive and structural metadata. Persistence Persistence is the property that the bit sequences comprising a digital object continue to exist in a usable/processable state, and are retrievable/processable from the medium on which they are stored. All digital objects exist as a series of bits stored on some form of physical medium, such as magnetic tapes, optical discs such as CDs and DVDs, or hard drives on servers or personal computers. In order for the digital object to remain useful over time, it is essential that the bit sequences are not corrupted in any way, and that they can be read in their entirety from the physical media on which they are stored. Persistence is achieved when these two conditions are met. Threats:
In summary, the major threats to persistence reside in physical media management, media refreshment policy, hardware migration policy, and data security policy. Renderability Renderability is the property that a digital object is able to be used in a way that retains the object's significant characteristics. Human and machine use of a digital object depends upon interpretation of the bitstream by an appropriate combination of hardware and software. An appropriate hardware and software environment allows users to interact with (view, listen, query, etc.) the object in a way that retains characteristics of the original object that are deemed important by stakeholders (Dappert and Farquhar, 2009). For example, if the only software available to open a particular image file is a text editor, the object cannot be considered renderable, as a text editor rendering of the object would not preserve a characteristic likely to be important for nearly all users of an image its appearance. Content, context, appearance, and behavior are common categories of stakeholder requirements for digital objects (Wilson, 2007). It is important to note that while the bits of an object might change due to preservation actions such as migration, the object is still renderable if it can be used in a way that preserves its significant characteristics. Threats:
In summary, the major threats to renderability reside in format management workflows and policies, including preservation strategies and repository knowledge of its stakeholder community. Understandability Understandability requires associating enough supplementary information with archived digital content such that the content can be appropriately interpreted and understood by its intended users. Good metadata is one way to ensure understandability, although it is important to keep in mind that the metadata needed to establish understandability often goes well beyond what is required to establish identity. For example, a data file of survey results from a Roper Poll may be adequately identified by its title, dates conducted, date of publication, sponsoring agency, and key number. However, for the data set to be understandable to future users, the entire associated codebook or equivalent information is required. The codebook, in turn, may require additional metadata and/or supplementary material to be understandable itself. The entire network of materials required to interpret the preserved object is known in OAIS parlance as the "representation network". Understandability is closely tied to the OAIS concept of "designated community" (intended users), because it is generally infeasible and usually unnecessary to make everything understandable to everyone. Generally speaking, the goal should be to archive enough metadata and other materials to make archived content understandable to members of its designated community. Threats:
In summary, major threats to understandability lie in the repository's knowledge of the characteristics of the community of current and future users, and metadata capture and retention policies. Authenticity Authenticity is the property that a digital object, either as a bitstream or in its rendered form, is what it purports to be. It is important to assure current and future users that the digital object managed and disseminated by the repository is a faithful replica of the digital object that was originally ingested into the repository; or alternatively, that any modifications to the original digital object that have occurred since ingest have been carefully documented. Information pertaining to a digital object's authenticity is often contained in the metadata bundled with the object, and might include documentation of: the contents of the digital object; the digital object's provenance, including the object's original source/creator and the chain of custody prior to ingest; and any alterations that have been made to the digital object during the period of archival retention. Threats:
In summary, the major threats to authenticity reside in metadata collection and management practices, security procedures, and workflow documentation procedures and policies. 5. Comments on the modelThe SPOT Model is intended to provide a simple framework within which to organize and carry out a risk assessment in a wide range of repository contexts. In designing the model, the goal was to produce a risk assessment tool that embodied the characteristics of conceptual clarity, balanced granularity, comprehensiveness, and simplicity. It is useful to revisit these attributes and consider how the SPOT Model addresses them. The first desired attribute is conceptual clarity, which is achieved by including only outcome-based threats organized around key objectives of successful digital preservation. The model thereby avoids the overlap and confusion that can arise when threats are framed from different perspectives (e.g., by source, function, objective) in the same model. A conceptually clear model should also draw plain distinctions between categories. The boundaries between abstract and interrelated preservation properties such as "authenticity" and "understandability" can be difficult to define. However, the SPOT model attempts to provide clear definitions drawn largely from concepts that have been debated and refined in the digital preservation literature over the past 15 years. The second desired attribute is balanced granularity. The proposed model achieves balanced granularity by describing properties and threats at a comparable level of detail. No threat combines activities or ideas that are embodied in another threat and no property has a parent/child relationship with another property.The model also maintains a consistent and comparable level of detail by focusing exclusively on first-order threats to digital content (i.e. ultimate causes that directly impact a property, causing a loss of authenticity, understandability, etc.) rather than second-order threats (i.e. proximate causes that start a chain of actions that culminates in one of the primary threats). For example, an improperly trained employee may bring a can of soda into a server room, be startled by the opening of a nearby door, and spill the soda on a hard drive, causing the hard drive to stop functioning, resulting in a loss of persistence for digital objects residing on the drive. Though it is implausible to plan for every step in a chain of causation such as this, it is important to plan for the primary, general threat of "improper/negligent handling or storage." The third desired attribute is comprehensiveness. The SPOT Model approaches comprehensiveness from the perspective of the attributes of successful digital preservation. Based on a review of the literature, the model is organized around a set of properties that represent a widely-accepted view of the general characteristics of well-preserved digital objects. The threats enumerated in the model are derived from these properties. The fourth and final desired attribute is simplicity. Other threat assessment models designed primarily for complete repository assessments can be quite complex and require significant time and resources to understand and implement. For example, the DRAMBORA approach takes as a starting point an organization's own mission and goals, as opposed to a general list of properties of successful digital preservation. To conduct a DRAMBORA analysis, the repository must first characterize and document its mission, then list its constraints and objectives, and finally itemize activities and assets. A risk register is then created in which risks related to each activity are identified and assessed. The DRAMBORA toolkit provides a list of 80 "off-the-shelf" risks to help implementers get started. In contrast, the SPOT Model is based on a widely-accepted list of properties characterizing well-preserved digital objects; each of these properties is then associated with a small set of high-level threats. Threat assessment consists of identifying factors in the local environment that could potentially manifest these threats. Risk management can then focus on reducing the likelihood or impact of these factors. This process reduces the complexity and resources required to undertake a risk assessment procedure. The next section illustrates this point in more detail with several examples. Finally, it should be emphasized that the SPOT Model does not explicitly account for general threats residing in the repository's operating environment that, while not directly embedded in the digital preservation process, nevertheless cut across all of the properties in the model. For example, economic threats, such as budget cuts and other interruptions of funding, might severely reduce a repository's ability to take actions to mitigate any of the threats in the model. Similarly, a diminishment in an organization's commitment to long-term preservation also potentially impacts a repository's incentive to address the threats specified in the model. Although these general threats are not associated with any specific preservation property, and are therefore not explicitly stated in the model, repository staff should nevertheless be aware of them and consider their impact on risk management strategies. The authors would encourage the development of an outcomes-based model or models addressing threats operating on custodial organizations to supplement the SPOT model's focus on threats operating on content. 6. Practical applicationThe SPOT Model is intended to be a practical tool for repository staff. The following examples illustrate how the model can be used to support risk assessment workflows in several repository contexts. The examples were derived from the personal experiences of two of the co-authors of this paper, which helped inform the design of the model. Both examples describe how application of the SPOT Model would simplify and improve existing risk assessment procedures. Statistics New Zealand The SPOT Model offers a simple yet comprehensive list of primary threats that a trusted digital repository should guard against. The model's concise and focused nature makes it particularly useful in brief risk assessment and management activities such as those conducted during tool or workflow design/redesign projects. A recent process review at the Statistics New Zealand Data Archive provides an example of how the model might be used to support small-scale risk assessment. The Statistics New Zealand Data Archive is part of New Zealand's national statistical office. The Data Archive preserves valuable statistical data, including economic, labor and industry, and population data produced by Statistics New Zealand and other public agencies. Archive staff recently conducted a review of the Data Archive's ingest workflow which included an assessment of risks associated with the current process. The review was expected to result in small process adjustments rather than major changes, and was conducted by several staff members over the course of a couple of weeks alongside other duties. Staff did not use formal risk analysis tools during the review; instead, the process relied on staff knowledge and was loosely structured around the current ingest workflow. Staff briefly explored use of DRAMBORA Interactive to help organize the review, but determined that performing a thorough DRAMBORA repository assessment would require more time than could be allotted during this smaller workflow review. Use of a simple tool such as the SPOT Model, however, would have provided a systematic way of finding and assessing gaps in the current workflow. Two examples of how the model might have supported the process review are noted below. 1.Systematically evaluate, threat by threat, whether any ingest workflow steps might cause or increase the likelihood of a threat and, if so, whether appropriate mitigations are in place. For example, one threat to Identity is that "linkages between metadata and the objects that the metadata describes are not captured or maintained." In the Statistics New Zealand workflow, this could be caused by a number of errors during the workflow step of transferring files from an intermediary ingest processing area to a final ingest area. A staff member might fail to copy a complete set of files to the new location, in which case the associated descriptive and structural metadata would reference a file that was not in the expected final location. Alternately, all files may be successfully transferred, but the associated metadata might fail to reference one of the files. A brief analysis would indicate that the threat is mitigated most economically at the end of the process, by adding a quality assurance step which includes both a visual check and a checksum comparison to validate that the files in the intermediary ingest processing area match the files in the final ingest location. This file transfer step has also been identified for automation in a new ingest workflow tool being developed, which would further diminish the likelihood of this threat occurring. 2. Identify gaps in the archiving workflow. Staff would list primary mitigations against each threat in the model in the context of ingest or any other stage. This would reveal whether the repository had insufficiently guarded against any major threats. If there was little or no mitigation against a threat, additional steps aimed at reducing the likelihood or impact of that threat could be added to the workflow. This point can be illustrated by considering one of the threats to authenticity: "A digital object is altered during the period of archival retention (either legitimately, maliciously or erroneously), and this change goes unrecorded". Staff could have quickly listed several primary mitigations against this threat. In order to make a legitimate change to an object, such as might occur during a migration, Data Archive procedures mandate that the object is disseminated from the archive, altered, preservation metadata is recorded, and the object is re-ingested into the archive after undergoing a quality assurance process that verifies that the metadata describing the changes have been recorded. A monthly fixity check which compares checksums of all files in the archive against their ingest checksums will reveal if an archived object has changed and an investigation into the change would begin. Accidental alteration is guarded against by limiting the number of staff who have access to the objects in archival storage and by setting the objects in storage to "read only" status. External malicious alteration is guarded against by robust organizational IT security measures (e.g. firewalls, passwords, etc.). Internal malicious damage is addressed by all of these measures, but no further steps are taken to mitigate this threat because it has been determined to be highly unlikely. These are brief examples aimed at illustrating how the SPOT model's lightweight structure and comprehensive, property-based list of threats make it useful for simple and quick threat assessment processes. The results of such processes can then be used to inform the design or redesign of tools and workflows within the digital preservation system. Florida Digital Archive The Florida Digital Archive (FDA) is a long-term digital repository run by the Florida Center for Library Automation (now part of the Florida Virtual Campus) for the use of the eleven publicly funded universities in Florida. The FDA has been in operation since 2006 and stores roughly 300,000 Archival Information Packages. It uses format transformation (normalization and forward format migration) as the primary preservation strategy against format obsolescence. The FDA has a relatively stable infrastructure, but like all preservation repositories it is subject to circumstances beyond its control, including natural, economic, and technology events. Inventorying and categorizing all FDA activities and assets, and then identifying and assessing all associated threats and risks would be a Herculean task, as past experience has proven. For example, the process of getting a package of content into the archive has three major phases: transmittal, submission, and ingestion. Even the simplest phase, transmittal, has four methods, each of which has a different sequence of steps carried out by the producer, archive operators, and/or software. Analysing the threats and risks involved in each of these steps would require more than eighteen separate assessments. If the steps required for submission and ingestion are added to the total, as well as those associated with the many other functions of the FDA, the result would be hundreds of assessments. The property-based approach of the SPOT Model reduces the number of necessary assessments considerably, because rather than beginning with an inventory of activities and assets, the model bases its assessment on a small number of crucial preservation properties. This makes the analysis much more manageable, because it is limited to the finite number of threats which impact these properties in the context of the FDA environment. To take persistence as an example, the property-oriented threats are:
The FDA outsources facilities management to two regional computing centers which both have hardware and software to run the system and a complete, redundant data store. Therefore, the first threat, improper handling or storage conditions, are guarded against by negotiating appropriately designed service level agreements with the computing centers, and monitoring adherence though periodic reviews. Improper storage conditions could also be caused by dramatic physical damage to computing center facilities, which in the context of the FDA would most likely occur from fire, deliberate sabotage (e.g. a bomb threat) or a category 5 hurricane (facilities are hurricane-proof up to category 4). Other types of natural disasters are unlikely in this area. The second threat is addressed by tracking storage media use, especially computer tape, against conservative estimates of mean time to failure. The Tivoli system used for tape storage will automatically replace primary tapes when an error is detected and refresh the contents from backup. Ongoing fixity checking in which the entire tape is read ensures that latent errors do not go unnoticed for long (even as it exacerbates the risk of media failure by increasing wear and tear on the tape). The third threat, technological obsolescence, is more relevant to end-user media than to professionally maintained enterprise storage systems, and is not really a factor for the FDA. All storage devices and storage management applications are well-supported, enterprise level components from major vendors, so that any intent to discontinue a product is made known well in advance, leaving sufficient time to design and implement a migration path. The fourth threat, malicious damage, is partially guarded against by strict access controls at the computing center facilities including limited access, self-locking doors, frequently changed passwords, and the standard set of internal security procedures. Damage by a renegade employee is always a remote possibility, as is deliberate attack from the outside, such as a bomb threat. To some extent this is ameliorated by the same steps taken to ensure proper environmental conditions. The final threat inadvertent damage via hardware, software, or human error has a relatively high risk of occurrence and is difficult to guard against, as such errors introduced into data can be propagated to backup copies. For example, a bug in DAITSS, the repository software underlying the FDA, could theoretically alter files when a package is refreshed. This is guarded against by a test suite that must be executed before moving application changes to production, controlled change procedures, and a delay in writing the backup copy of each master. Heterogeneity is a good defense against propagation of storage system hardware and software errors, so it is wise to keep copies of data on different devices, preferably by different manufacturers. The FDA uses different technologies for tape and disk, but all disk is IBM SATA and all tape is controlled by Tivoli, which introduces a vulnerability. A hardware review revealed that the robotic tape library is the most error-prone, as tapes get jammed in the unit periodically and can be physically damaged. Since the probability of these threats cannot be brought to zero, steps are in place to minimize the damage that can occur to the persistence of stored files. Viability (readability from media) and integrity are monitored regularly; in general, no file exceeds 21 days without a fixity check. A faulty file can be repaired from any of three copies in two different geographical locations. Loss of data via threats to persistence could occur only in the event all three additional copies were damaged before the faulty file could be restored. There are algorithms that can calculate this probability when all devices are independent and only the second threat is considered. The possibility of renegade employees deliberately coordinating attacks on FDA data in both computer centers can be assumed to be negligible. However, the probability of category 5 hurricanes striking both Gainesville and Jacksonville is real, as is the chance of a natural disaster occurring at one site while a tape robot accident simultaneously occurs at another. This argues for establishing a third site elsewhere in the country. This review of threats to persistence clearly requires fleshing out. However, a risk assessment of FDA operations was conducted in June and July 2011 according to the SPOT model at this general level of detail, and it exposed a number of weaknesses in FDA procedures which will be addressed. It seems reasonable to conclude that an inventory of all FDA activities and assets is not necessary to arrive at a useful assessment of threats to objects stored in the repository. ConclusionsThe SPOT Model described in this paper is intended to be a practical tool for repository managers as they identify the sources of risk to the digital materials in their custody and develop strategies to mitigate these risks over time. In particular, the model could serve as a checklist for identifying threats relevant to a particular repository implementation, and prioritizing resources for mitigating them. In addition to this, the model can also serve as the foundation for several additional applications. One potentially useful area of application would be a mapping of the threats specified in the model to the OAIS functional and information models. Many digital preservation repositories are based on the concepts, high-level workflows, and data architectures described in OAIS; therefore, it would be useful to map the threats specified in the model to the points within the OAIS framework with which they are most likely associated. This would help repository managers identify the components of their repository and data architectures which are most vulnerable to the threats defined in the model. Another application of the model would be to use it as the foundation for a formal repository certification process, performed by an external auditing agency, to assess the robustness of a repository's response to the primary threats it faces. It is important that a repository demonstrate to its clients that it has taken steps to diminish the likelihood of certain threats occurring to the digital content in its custody. The comprehensive nature of the SPOT threat taxonomy makes it applicable across a wide range of repository contexts, and therefore suitable as the basis for a generalized certification process for identifying threats and assessing mitigation strategies. One potential application of the SPOT Model in this regard could be the fulfilment of the TRAC requirement that a "repository has ongoing commitment to analyze and report on risk", and in particular, to provide evidence of this analysis through implementation of "risk management documents that identify perceived and potential threats and planned or implemented responses (a risk register)" (OCLC/CRL, 2007). The SPOT threat taxonomy proposed in this paper could serve as the basis for a TRAC-related risk register. The previous suggestions for future work are of a practical orientation, in the sense of being oriented toward repository staff responsible for managing the day-to-day operation of a repository service. The SPOT Model would also be useful as a basis for more research-oriented applications. One such application is a study of how various organizational forms might serve to either mitigate or amplify the threats specified in the SPOT Model. An organizational form is the structure that brings together digital preservation responsibilities, strategies, funding, governance, and infrastructure in particular ways to meet long-term preservation goals. For example, a LOCKSS network is an organizational form; a third-party digital preservation service would be another; a public/government agency would be another; an internal archive would be another. Each organizational form might have particular strengths and weaknesses in terms of dealing with threats: for example, a PLN (Private LOCKSS Network) might be strong on technical threats to files in storage because of the inherent redundancy built into the system; it might be weak in dealing with threats arising from lack of assured long-term commitment (each node in the network can drop out at any time; they are not bound by a legally enforceable SLA). Finally, the risk taxonomy might serve as the basis for an empirical study aimed at identifying how important particular threats are in practice. For example, a survey could be conducted across a sample of repository managers, asking them to rank the threats specified in the taxonomy in terms of practical relevance to their repository implementation. From this, an assessment can be made regarding the importance of various threats in real-world implementations (accounting, of course, for contextual differences across implementations). The importance of digital preservation is often conveyed by appealing to the threats that could potentially occur in regard to digital materials, but some of these threats may be less relevant from a practical standpoint. An empirical assessment of the threats most relevant and immediate to long-term digital preservation would aid repository managers in allocating their resources appropriately in developing strategies to mitigate them. AcknowledgementsThe authors would like to thank Hamish James of Statistics New Zealand for his thoughtful feedback on this paper. Notes1 Inconsistent terminology is used in digital preservation literature to describe the ideas of a potential problem or system weakness. Some works use words such as threat, risk, and vulnerability interchangeably, others make careful distinctions between them. 2 Katia Tomaz (Thomaz, 2006) has observed a similar breakdown in the way digital preservation issues or threats are described in the literature. Thomaz identifies three different views on digital preservation issues: "functions" (e.g. "ingest"); "challenges" (e.g. "high technological obsolescence"); and preservation requirements (e.g. "preserving context"). 3 Dappert and Farquhar define significant characteristics as "Requirements in a specific context, represented as constraints, expressing a combination of characteristics of preservation objects or environments that must be preserved or attained in order to ensure the continued accessibility, usability, and meaning of preservation objects, and their capacity to be accepted as evidence of what they purport to record." (Dappert and Farquhar, 2009) References[1] Arms, C., & Fleischhauer, C. (2005). Digital formats: Factors for sustainability, functionality, and quality. Archiving 2005. Washington, DC: The Society for Imaging Science and Technology. [2] Barateiro, J., Antunes, G., & Borbinha, J. (2009). Addressing Digital Preservation: Proposals for New Perspectives. In DP'09 First International Workshop on Innovation in Digital Preservation. [3] Bradley, K. (2006). Digital Sustainability and Digital Repositories. VALA 2006 conference. Melbourne, Australia. [4] Caplan, P. (2008). The Preservation of Digital Materials. Library Technology Reports, 44(2). [5] Clifton, G. (2005, September). Risk and the preservation management of digital collections. International Preservation News, 36. [6] Consultative Committee for Space Data Systems. (2009). Audit and Certification of Trustworthy Digital Repositories. [7] Consultative Committee on Space Data Systems. (2009). Reference Model for an Open Archival Information System (OAIS). [8] Consultative Committee on Space Data Systems. (2012). Reference Model for an Open Archival Information System (OAIS). [9] Conway, P. (1996). Preservation in the digital world. [10] Cornell University Library & ICPSR. (2003). Digital Preservation Management: Implementing Short-Term Strategies for Long-Term Solutions. (Online tutorial developed for the Digital Preservation Management workshop, developed and maintained by Cornell University Library, 2003-2006; extended and maintained by ICPSR, 2007-on.) [11] Dappert, A. & Farquhar, A. (2009). Significance is in the Eye of the Stakeholder. ECDL'09 Proceedings of the 13th European conference on Research and advanced technology for digital libraries. Berlin: Springer. [12] Dappert, A. (2009). Report on the Conceptual Aspects of Preservation, Based on Policy and Strategy Models for Libraries, Archives and Data Centres. [13] Digital Curation Centre & DigitalPreservationEurope. (2007). DCC and DPE Digital Repository Audit Method Based on Risk Assessment, v1.0. [14] Digital Library Federation. (2003). Archiving Electronic Journals, edited with an introduction by Linda Cantara. [15] Fojtu, A., Hutař, J. & Pavlásková, E. (n.d.) Tools for long-term preservation of digital documents. [16] ISO 31000. (2009). Risk management Principles and guidelines. [17] Lawrence, G.W., Kehoe, W.R., Rieger, O.Y., Walters, W.H., & Kenney, A.R. (2000). Risk Management of Digital Information: A File Format Investigation. [18] Li, Y., Banach, M. (2011). Institutional Repositories and Digital Preservation: Assessing Current Practices at Research Libraries. D-Lib Magazine, 17(5/6). http://dx.doi.org.org/10.1045/may2011-yuanli [19] Littman, J. (2007, July). Actualized Preservation Threats. D-Lib Magazine, 13(7/8). http://dx.doi.org/10.1045/july2007-littman [20] Marcum, D. (2002). Introduction: the changing preservation landscape. The State of Digital Preservation: An International Perspective. [21] McLeod, R. (2008). Risk Assessment; Using a Risk-Based Approach to Prioritise Handheld Digital Information. iPRES 2008: The Fifth International Conference on Preservation of Digital Objects. London, England. [22] McGovern, N. Y., Kenney, A. R., Entlich, R., Kehoe, W. R., & Buckley, E. (2004, April). Virtual Remote Control. D-Lib Magazine, 10(4). http://dx.doi.org/10.1045/april2004-mcgovern [23] OCLC/CRL. (2007). Trustworthy repositories audit & certification (TRAC): Criteria and checklist. [24] OCLC/RLG Working Group on Preservation Metadata. (2002). Preservation Metadata and the OAIS Information Model: A Metadata Framework to Support the Preservation of Digital Objects. [25] PARSE.Insight consortium (2010). Science Data Infrastructure Roadmap. [26] Preservation Metadata Implementation Strategies Working Group. (2005). Data Dictionary for Preservation Metadata: Final Report of the PREMIS Working Group. [27] Rog, J. & van Wijk, C. (2007). Evaluating File Formats for Long-Term Preservation. [28] Rosenthal, D. S. H., Robertson, T., Lipkis, T., Reich, V., & Morabito, S. (2005, November). Requirements for Digital Preservation Systems. D-Lib Magazine, 11(11). http://dx.doi.org/10.1045/november2005-rosenthal [29] Stanescu, A. (2005). Assessing the durability of formats in a digital preservation environment: The INFORM methodology. OCLC Systems & Services, 21(1), 61-81. http://dx.doi.org/10.1108/10650750510578163 [30] The National Archives (2009). Digital Continuity: An Introduction to the Wider Context. [31] Thomaz, K.P. (2006). Critical Factors for Digital Records Preservation. Journal of Information, Information Technology, and Organizations, (1) 21-39. [32] Waters, D., & Garrett, J. Eds. (1996). Preserving digital information: Report of the task force on archiving of digital information. Washington, D.C. and Mountain View, CA: The Commission on Preservation and Access and the Research Libraries Group. [33] Wilson, A. (2007). Significant Properties Report (v.2). Arts and Humanities Data Service. [34] Wright, R., Miller A., & Addis, M. (2009). The Significance of Storage in the "Cost of Risk" of Digital Preservation. International Journal of Digital Curation, 4(3) 2009. http://dx.doi.org/10.2218/ijdc.v4i3.125 Appendix: Comparison of digital preservation threat taxonomiesAs noted above, the models reviewed were developed for a variety of purposes and are designed for use in a variety of contexts. Our assessments are based on the four characteristics we sought in a threat model and have described in this paper.
*Because of the strong similarities between the criteria in TRAC and the Audit and Certification of Trustworthy Digital Repositories currently in development, these two models are combined here. About the Authors
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|