Computer files, the objects normally thought of as the main target of digital preservation, are presented according to pre-defined structural and organizational principles. Those principles, usually referred to as a file format, are typically laid out in a document called a format specification. A format specification provides the details necessary to construct a valid file of a particular type and to develop software applications that can decode and render such files. The actual specifications may vary considerably in length, from well under 100 pages to well over 1000, depending on the complexity of the format.
The table above shows five different versions of Adobe's Portable Document Format (PDF). Each was released with its own software package; however, Adobe has relinquished control of PDF 1.7 and any future versions to ISO, who confirmed it in 2008 as an international standard (ISO 32000-1).
Although some file format specifications are largely independent of specific software (for example, encoding schemes such as ASCII and Unicode), most are tied to individual or related groups of software. The software and its related file format specification usually evolve together and have fates that are tightly bound. Therefore it makes sense to discuss software obsolescence and file format obsolescence together.
What's in a File Format Specification?
Without a format specification, a file is just a meaningless string of ones and zeros. The specification indicates the proper subdivision, encoding, sequence, arrangement, size, and internal relationships that uniquely identify the particular format and allow it to be properly interpreted and rendered. For example, a format specification should indicate the location of meaningful boundaries within the bitstream and whether a particular subunit should be interpreted as an ASCII character, a numerical value, a machine instruction, a color selection, or something else.
Though it's not necessary to pour over the details of particular format specifications, a quick look at one can provide a sense of why file formats are vulnerable to obsolescence. Take, for example, the TIFF 6.0 (Tagged Image File Format) format specification, which describes the popular raster image format. Page 13 of the specification defines the basic building block of a TIFF file and its maximum length, and then goes on, byte by byte, to lay out the internal structure of a valid TIFF file. A file that fails to conform exactly to these requirements would either be unrecognizable or improperly rendered if fed to a TIFF reader.
What Factors Contribute to File Format Obsolescence?
File formats can become obsolete for a number of reasons:
- Software upgrades fail to support legacy files.
- The format itself is superseded by another or evolves in complexity.
- The format "take up" is low or industry fails to create compatible software.
- The format fails, stagnates, or is no longer compatible with the current environment.
- Software supporting the format fails in the marketplace or is bought by a competitor and withdrawn.
Why are File Formats a Challenge to Digital Preservation?
A number of factors have contributed to the challenge presented by digital file formats. During the early decades of computing, the threat of file format obsolescence to the long-term maintenance of digital objects was not widely recognized. No systematic efforts were made to collect software documentation or file format specifications. Without proper documentation, the task of trying to interpret an old file, or even determine what format it was written in, becomes daunting. Thousands of file formats and their variants have been created. Only recently has an effort been made to catalog them, document them, and understand their relationships and variations. Tools are beginning to emerge to automate the process of identifying and characterizing files by their formats.
Most software is upgraded on a regular basis. Although most applications can read files created with the previous version and perhaps the one before that, the ability to read older versions is often dropped. Files that have not been migrated may not be readable by the latest version of the software, and the older version software may no longer be available, or may not run on a current computer, or under a current version of the operating system.
Figure 6. Postcard: front side. The whimsical design was intended to attract attention and to put clients at ease.
Credit: Carla DeMello, Department of Communications, Cornell University Library.
Also, due to the complexity and dynamic nature of many file formats, it can be extremely difficult to determine whether a file moved from one format to another (or to a newer version of the same format) has retained all of its characteristics and functionality.
While we may not have realized the threat of obsolescence when we first started purchasing personal computers over twenty years ago, we certainly experience the frustration of it now. Trying to read an old 3.5 floppy from ten years ago can be frustrating if you don't know what software or hardware was involved in its creation. Say you find a ten year old PC to test an old floppy on, and it is unable to read it. You may believe the floppy is damaged, but it could just as easily be an old Mac floppy, which your PC would be unable to identify because it runs a different OS. Most people would probably throw that floppy in the bin, unaware that those files were just fine. An example of this sort of puzzle in media testing can be found in the RLG article Digging Up Bits of the Past: Hands-on with Obsolescence.
Are Some File Formats Less Vulnerable to Obsolescence than Others?
Since all software is subject to obsolescence, all file formats used by that software are also vulnerable. On the surface, it may seem that the files used by software that is more stable (i.e., not undergoing a lot of change) would be less subject to obsolescence, and that is true in the short term. But software that stands still inevitably also becomes obsolete, because it fails to adapt to the changing computing environment (e.g., CPU architectures, operating systems, encoding schemes, and data transfer protocols) that it must operate in. So users must be watchful of files that either rapidly evolve or stagnate, since both are prone to obsolescence.
To decode an old file format, the format specification must be available. Therefore, the degree of control the creator of a format specification exerts over its publication has a significant impact on the format's vulnerability to obsolescence. Specifications tend to fall into one of three categories.
—Proprietary and closed specifications represent some of the most enduring and successful software in use. However, these also tend to evolve quickly and exist in many different versions for different platforms, with only limited backward compatibility provided. In fact, there is substantial commercial incentive to avoid good backward compatibility, since the need to share files ultimately forces all users, including those who'd prefer to keep using older versions, to upgrade to newer versions. Commercial vendors must regularly release new versions of their software with added features and functionality in order to entice users to upgrade and provide a continued revenue stream.
Unfortunately, experience has shown that even very old specifications for versions of commercial file formats long ago pulled from the marketplace may never be released. Also, as one might expect, proprietary and closed file formats are interpreted with the highest accuracy by the manufacturer's own software. Therefore, such formats are the most vulnerable to obsolescence since they face the dual risk of rapid specification change and being tied to a single product or company.
Furthermore, today's wildly successful software can be tomorrow's also-ran or distant memory. There has been tremendous consolidation in the commercial software industry and many products have disappeared following mergers and acquisitions. Others have succumbed to competition from superior or more cleverly marketed products.
—Some proprietary formats have a lessened risk because the specification has been publicly released, allowing other companies (and non-commercial entities) to produce software that can read them. However, commercial entities can and sometimes do change their minds about leaving specifications open. For example, the DjVu image format was an open specification for a while before its owner decided to make changes and not release them to the public.
Most proprietary but open specifications are still vulnerable to the whims of market forces. In addition to being subject to arbitrary withdrawal, they can be abandoned for commercial reasons.
Adobe acquired the TIFF specification in 1994 when it purchased Aldus. Since then, it has done minimal work on the TIFF specification, which remains at version 6.0, released in 1992. Despite the fact that "TIFF is designed to be extensible—to evolve gracefully as new needs arise" (from p.5 of the TIFF 6.0 specification), it has not been modernized for the current computing environment, other than a few tweaks designed specifically to address issues with Adobe's own software, and the maintenance of extensions to its header tags, most of which are not widely supported. Though TIFF remains well-supported and viable today, it will undoubtedly be eclipsed by more modern standards that are seeing ongoing development.
—In terms of guaranteed long-term availability, published specifications produced by international standards bodies are the safest. Generally, representatives from many different constituencies are involved in creating the standard, helping to ensure that it balances the needs of a wide variety of users and that it isn't beholden to any particular commercial interest. Broad participation also helps provide incentive for wide support once the standard is completed. Backward compatibility with older, related standards is usually a priority and there are no commercial pressures for rapid obsolescence. One of the most recent examples of this is the standardization of the OpenDocument Format (ODF) as an open source format. ODF stemmed from the open XML-based OpenOffice.org specification and was approved by ISO as a standard in 2006.
On the other hand, not all standard formats should be assumed to be best choices. Standards must become widely adopted by both user and developer communities to be bestowed with reduced vulnerability from obsolescence, and yet, that doesn't always happen.
PNG (Portable Network Graphics), a color still-image format, emerged after the GIF (Graphic Image Format) format became mired in patent and royalty issues surrounding its use of the LZW compression scheme. Despite being demonstrably superior to GIF by nearly every technical measure and free of commercial encumbrances, PNG is just now reaching critical mass of acceptance due to the overwhelming number of GIF images already in use.
hoosing File Formats for Reduced Vulnerability to Obsolescence
The following factors should be considered in assessing a file format's long-term stability:
- wide adoption
- history of backward compatibility
- good metadata support (in open format such as XML)
- good range of functionality, but not overly complex
- available interchange format with usable target
- built-in error checking
- reasonable upgrade cycle
Determine the file format status of your digital holdings. What formats and versions are represented and in what quantities? Such an inventory is an important step toward managing file format risk. The range of formats in use should be consolidated to minimize duplication and eliminate problem formats. This process is known as normalization. Those formats most at risk, such as those created by obsolete software or by obsolete versions of current software, should be targeted first.
Not all formats, especially those that are obsolete, can be migrated to newer, less risky formats without some loss of fidelity. If the original software is unavailable, it may be impossible to determine the degree of loss.
Resources for assessing the potential for migration are starting to appear. The PRONOM database can be helpful in determining whether a migration path exists for an old file format using a newer version or a specialized conversion tool. However, it does not yet provide much detail about invariance, i.e., the degree to which the migrated file will reproduce the appearance and functionality of the original. “Risk Management of Digital Information: A File Format Investigation” by Lawrence, et al, is a study of the impact of migration on file integrity and can provide some guidance in assessing the migration process. The INFORM Methodology is an approach for measuring the preservation durability of digital formats.
Only by careful examination of what went in vs. what came out can an assessment of risk and loss be made. This proactive and aware form of risk management is likely to be less risky than a passive “see what happens” approach that could lead to catastrophic loss.
In the absence of a software migration path, it may be possible in situations where the original software is available but no longer runs on modern hardware, to retrieve an old file using emulation. Emulators run on modern hardware, but mimic an obsolete software environment, allowing old software to run. The file can at least be viewed, and may be converted to an interchange format from which it can be migrated forward.