Digital Preservation Strategies

Many digital preservation strategies have been proposed, but no one strategy is appropriate for all data types, situations, or institutions. Below is a brief tour of the range of current options.

Bitstream Copying—is more commonly known as "backing up your data," and refers to the process of making an exact duplicate of a digital object. Though a necessary component of all digital preservation strategies, bitstream copying in itself is not a long-term maintenance technique, since it deals only with the question of data loss due to hardware and media failure, whether resulting from normal malfunction and decay, malicious destruction or natural disaster. Bitstream copying is often combined with remote storage so that the original and the copy are not subject to the same disastrous event. Bitstream copying should be considered the minimum maintenance strategy for even the most lightly valued, ephemeral data.

Refreshing—to copy digital information from one long-term storage medium to another of the same type, with no change whatsoever in the bitstream (e.g. from a decaying 4mm DAT tape to a new 4mm DAT tape, or from an older CD-RW to a new CD-RW). "Modified refreshing" is the copying to another medium of a similar enough type that no change is made in the bit-pattern that is of concern to the application and operating system using the data, e.g. from a QIC tape to a 4mm tape; or from a 100 MB Zip disk to a 750 MB Zip disk. Refreshing is a necessary component of any successful digital preservation program, but is not itself a complete program. It potentially addresses both decay and obsolescence issues related to the storage media.

Durable/Persistent Media (e.g., Gold CDs)—may reduce the need for refreshing, and help diminish losses from media deterioration, as do careful handling, controlled temperature and humidity, and proper storage. However, durable media has no impact on any other potential source of loss, including catastrophic physical loss, media obsolescence, as well as obsolescence of encoding and formatting schemes. Durable media has the potential for endangering content by providing a false sense of security.

Technology Preservation—is based on preserving the technical environment that runs the system, including operating systems, original application software, media drives, and the like. It is sometimes called the "computer museum" solution. Technology preservation is more of a disaster recovery strategy for use on digital objects that have not been subject to a proper digital preservation strategy. It offers the potential of coping with media obsolescence, assuming the media hasn't decayed beyond readability. It can extend the window of access for obsolete media and file formats, but is ultimately a dead end, since no obsolete technology can be kept functional indefinitely. This is not a strategy that an individual institution can implement. Maintaining obsolete technology in usable form requires a considerable investment in equipment and personnel.

Digital Archaeology—includes methods and procedures to rescue content from damaged media or from obsolete or damaged hardware and software environments. Digital archaeology is explicitly an emergency recovery strategy and usually involves specialized techniques to recover bitstreams from media that has been rendered unreadable, either due to physical damage or hardware failure such as head crashes or magnetic tape crinkling. Digital archaeology is generally carried out by for-profit data recovery companies that maintain a variety of storage hardware (including obsolete types) plus special facilities such as clean rooms for dismantling hard disk drives. Given enough money, readable bitstreams can be often be recovered even from heavily damaged media (especially magnetic media), but if the content is old enough, it may not be possible to make it renderable and/or understandable.

Analog Backups—combines the conversion of digital objects into analog form with the use of durable analog media, e.g., HD Rosetta or the creation of silver halide microfilm from digital images. An analog copy of a digital object can, in some respects, preserve its content and protect it from obsolescence, while sacrificing any digital qualities, including sharability and lossless transferability. Text and monochromatic still images are the most amenable to this kind of transfer. Given the cost and limitations of analog backups, and their relevance to only certain classes of documents, the technique only makes sense for documents whose contents merit the highest level of redundancy and protection from loss.

Migration—to copy data, or convert data, from one technology to another, whether hardware or software, preserving the essential characteristics of the data. This simple definition, by Peter Graham, captures the essence of migration, as well as its ambiguity. To some, migration is used interchangeably with refreshing, but as stated by the authors of Preserving Digital Information

Migration is a broader and richer concept than "refreshing" for identifying the range of options for digital preservation. Migration is a set of organized tasks designed to achieve the periodic transfer of digital materials from one hardware/software configuration to another, or from one generation of computer technology to a subsequent generation. The purpose of migration is to preserve the integrity of digital objects and to retain the ability for clients to retrieve, display, and otherwise use them in the face of constantly changing technology. Migration includes refreshing as a means of digital preservation but differs from it in the sense that it is not always possible to make an exact digital copy or replica of a data base or other information object as hardware and software change and still maintain the compatibility of the object with the new generation of technology.

 

Migration theoretically goes beyond addressing viability by including the conversion of data to avoid obsolescence not only of the physical storage medium, but of the encoding and format of the data. However, the impact of migrating complex file formats has not been widely tested. One of the most comprehensive studies to date is "Risk Management of Digital Information: A File Format Investigation." Some have criticized migration on the basis that neither the authenticity nor the integrity of a digital document can be ensured.

Replication—is a term used to mean multiple things. Bitstream copying is a form of replication. OAIS considers replication to be a form of migration. LOCKSS (Lots of Copies Keeps Stuff Safe) is a consortial form of replication, while peer-to-peer data trading is an open, free-market form of replication. In each case, the intention is to enhance the longevity of digital documents while maintaining their authenticity and integrity through copying and the use of multiple storage locations.

Reliance on Standards—is to software what durable media is to hardware. It seeks a way to "harden" the encoding and formatting of digital objects by adhering to well-recognized standards and favoring such standards over more esoteric and less well-supported ones. It assumes in part that such standards will endure and that problems of compatibility resulting from the evolution of the computing environment (applications software, operating systems) will be handled by the continuing need to accommodate the standard within the new environment. For example, if JPEG2000 becomes a widely adopted standard, the sheer volume of users will guarantee that software to encode, decode, and render JPEG2000 images will be upgraded to meet the demands of new operating systems, CPUs, etc. Like many of the strategies described here, reliance on standards may lessen the immediate threat to a digital document from obsolescence, but it is no more a permanent preservation solution than the use of gold CDs or stone tablets.

Normalization—is a formalized implementation of reliance on standards. Within an archival repository, all digital objects of a particular type (e.g., color images, structured text) are converted into a single chosen file format that is thought to embody the best overall compromise amongst characteristics such as functionality, longevity, and preservability. The advantages and disadvantages of reliance on standards also apply to normalization.

Canonicalization—is a technique designed to allow determination of whether the essential characteristics of a document have remained intact through a conversion from one format to another. Canonicalization relies on the creation of a representation of a type of digital object that conveys all its key aspects in a highly deterministic manner. Once created, this form could be used to algorithmically verify that a converted file has not lost any of its essence. Canonicalization has been postulated as an aid to integrity testing of file migration, but it has not been implemented.

Emulation—combines software and hardware to reproduce in all essential characteristics the performance of another computer of a different design, allowing programs or media designed for a particular environment to operate in a different, usually newer environment. Emulation requires the creation of emulators, programs that translate code and instructions from one computing environment so it can be properly executed in another.

A widely-known, general purpose emulator is the one built into recent versions of the Apple Macintosh operating system that allows the continued use of programs based on an earlier series of CPUs no longer used in Apple computers. However, most emulators available today were written to allow computer games written for obsolete hardware to run on modern computers.

The emulation concept has been tested in several projects, with generally promising results. However, widespread use of emulation as a long-term digital preservation strategy will require the creation of consortia to perform the technical steps necessary to create functioning emulators as well as the administrative work to assemble specifications and documentation of systems to be emulated and obtain the intellectual property rights of relevant hardware and software.

Encapsulation—may be seen as a technique of grouping together a digital object and metadata necessary to provide access to that object. Ostensibly, the grouping process lessens the likelihood that any critical component necessary to decode and render a digital object will be lost. Appropriate types of metadata to encapsulate with a digital object include reference, representation, provenance, fixity and context information. Encapsulation is considered a key element of emulation.

Universal Virtual Computer—is a form of emulation. It requires the development of "a computer program independent of any existing hardware or software that could simulate the basic architecture of every computer since the beginning, including memory, a sequence of registers, and rules for how to move information among them. Users could create and save digital files using the application software of their choice, but all files would also be backed up in a way that could be read by the universal computer. To read the file in the future would require only a single emulation layer—between the universal virtual computer and the computer of that time."
(excerpted from MIT Technology Review, "Data Extinction," by Claire Tristram, October 2002, p.42)