Digital Awareness Month at State Records – halfway through and we’re still thinking about all things digital December 10, 2012
After two wonderful weeks of digital awareness raising, we began the third week of DAM2012 with an update on State Records’ Digital Archive. Cassie Findlay, the Project Manager for the digital archives project, talked about the progress that the team has made so far as well as some of the pilot migration projects currently underway.
Following on from this update about State Records’ own digital archiving project, the third newsletter for DAM2012 looked at some of the digital preservation strategies being tested and implemented by other organisations around the world. To find out what State Records employees were reading about during DAM2012, read on!
Digital preservation strategies
Ensuring long-term access to digital records presents many challenges, mainly due to the technological nature of digital records. In its Digital Continuity Strategy, Queensland State Archives identifies some of these challenges:
- The physical media used to store digital records (CDs, DVDs, hard drives etc.) are inherently fragile and unstable, and even when stored under ideal conditions will deteriorate more rapidly than paper.
- When data becomes corrupted, the error correcting processes built into ICT equipment sometimes do not reveal the amount of corruption until it is too late to recover the data.
- Digital records are created using ICT systems and all require specific hardware and software in order to be accessed and used.
- Developments in computer science and technology, and commercial imperatives, mean that hardware, software and operating systems change very rapidly.
- The relative ease with which digital objects can be changed, and the need to change them to keep them useable and meaningful, leads to significant issues in ensuring their continued authenticity, reliability and integrity.
The Atlas of Digital Damages is a Flikr space where people can post visual examples of digital preservation challenges, failed renderings, encoding damage, corrupt data etc.
This example shows the corruption that occurred when a Betacam SP analogue video tape was converted to a MPEG-2 digital video.
Organisations across the world are testing and implementing different strategies to combat these challenges and preserve digital records over time.
Emulation
Emulation involves keeping digital objects in their original format, and then using software to recreate the objects’ original operating environments on current computer systems.
Software is created which runs on one computer (the ‘host system’) and makes it behave exactly like another, usually older, computer (the ‘target system’). Software which ran in the target system can be run inside the emulator in the host system.
The Nationaal Archief of the Netherlands uses the following example to explain how emulation works:
Twenty years ago, WordPerfect 5.1 was a popular word processing package. To view an original WordPerfect file, an emulator for a 286 PC can be used and DOS and WordPerfect can be installed on the emulated computer. This allows a user to open a WordPerfect file in the original application and run it in the emulated environment, and to work with the file in the same way the creator did 20 years ago.
Proponents of emulation identify a number of advantages to this strategy:
- original digital objects can be left untouched
- periodic migration to newer formats is unnecessary
- the functionality and appearance (i.e. the ‘look and feel’) of the objects are preserved by using their authentic software environment.
Emulation in Europe
Archives and libraries in Europe seem to be very keen on emulation!
The Keeping Emulation Environments Portable (KEEP) Project is a research project in the European Union. Part of the KEEP Project involves developing an Emulation Framework into which existing and future emulators can be plugged. The Emulation Framework will provide a largely automated workflow for emulation: when a user requests a digital file or application, the Emulation Framework will analyse the file or application using external registries and then select the most suitable emulator to mimic the original operating environment.
The Nationaal Archief of the Netherlands envisages implementing two emulation approaches as part of it’s digital preservation strategy:
- The Nationaal Archief is considering how emulator access to certain types of records (e.g. business systems used to aid government decision-making) can be provided, either for visitors to the reading room or remotely over the web.
- The Nationaal Archief also sees a role for emulation as an intermediate step in migration. The Archief’s research shows that migration outcomes improve when the original bitstream is saved in the latest available version of the software before being migrated to another file format (e.g. a WordPerfect 4.2 file should be opened and saved in WordPerfect 10 before being migrated to Microsoft Word). Later versions of the original software often run only in the original hardware environment, so emulation can be used to re-create these environments.
The Nationaal Archief may also use emulation if a digital object has gone through a long ‘unmanaged’ period since its creation (e.g. a collection of records in obsolete formats that have not benefited from a managed process and are inaccessible with present day software). Emulation can be used to run the original software application before the files are converted to a more accessible format.
Encapsulation
Encapsulation involves bundling digital objects with anything else that may be necessary to provide access to the objects, and aims to overcome the problem of technological obsolescence of file formats by encapsulating information about how to interpret the digital bits in an object with the object itself.
Encapsulation can be achieved by using physical or logical structures called ‘containers’ or ‘wrappers’ to provide a relationship between the digital object and other supporting information. This information can be used in the future to develop viewers for displaying the objects, or to develop emulators or converters.
Encapsulation is a similar approach to emulation but without the need to include specifications to exactly rebuild the original hardware and software to render the object. Instead, the metadata provides a hardware and software independent method for understanding the record over time.
The Victorian Electronic Records Strategy (VERS) uses encapsulation to preserve digital records. Record content is accepted in formats including text files, PDF, PDF-A, JPEG, TIFF and MPEG, encapsulated using an XML ‘wrapper’ containing a standard set of metadata elements, and authenticated using a digital signature.
Migration
Migration involves transferring digital objects from older or obsolete hardware and software configurations or generations to current configurations or generations in order to maintain accessibility.
Normalisation is a particular type of migration and involves converting a digital object to a format that is not specific to a particular technology or application and is deemed to be more preservable, i.e. an archival data format. This process reduces the need for repeated migrations.
Generally, archival data formats are open source and/or non-proprietary formats that provide greater potential longevity and fewer preservation constraints. However, a key issue is the determination of which file formats to recommend as archival data formats.
The National Archives of Australia (NAA) uses normalisation as part of its approach to digital preservation. The Xml Electronic Normalising for Archives (XENA) tool allows the NAA to convert digital records from their original creation formats into open, well-documented and accessible formats.
Metadata
No digital preservation strategy can be put in place without considering metadata!
- Recordkeeping metadata, which describes the context, content and structure of records and their management over time, is vital for the management and ongoing useability of digital records.
- Preservation metadata supports and documents the digital preservation process (e.g. metadata documenting custody/ownership, preservation processes, technical dependencies and rights management).
State Records’ Standard on digital recordkeeping requires certain recordkeeping metadata to be captured with all digital records (e.g. title, date of creation and record type (i.e. letter, memo, report, contract etc.)) Much of this metadata can be automatically generated by recordkeeping systems. But some metadata will be manually entered by users, and may be of lesser quality.
A 2001 rant against a ‘meta-utopia’ lists seven problems with metadata, all of which stem from the apparent inability and/or disinclination of users to create useful, accurate metadata.* For example, users sometimes save records with meaningless titles (e.g. ‘Untitled.doc’ or ‘RE: your email’).
* It’s worth a read if only for its quirky references to 2001-era websites and services (AltaVista? Geocities? Napster??) The world of the web has changed considerably since 2001!
Tea room discussion points
- A record might start out as a large, image-rich website with embedded sound. For preservation purposes, however, we might decide just to keep a flat text file of the main written content on the website. What are we losing here and what also are we gaining? How would you make these decisions? How do you prioritise what is important to keep if it is just too complex and expensive to keep it all?
- A record might become corrupted while still in the custody and control of an agency. To what extent is a corrupted record a record? Is it better to have a partial record than none at all? If we decide to accept corrupted records (like the video of George Bush above), how can we describe and explain them for people 100 years in the future?
- Most of us still struggle to put decent metadata on an email message but we will go to a lot of effort to tag all our friends in a Facebook photo. Why do we like some forms of metadata and not others?
Web archiving
Organisations across the world harvest websites for archiving. Some select websites for archiving according to pre-defined selection criteria, some archive all websites in a particular domain, and some are indiscriminate in what is collected:
- PANDORA is an archive of websites relating to Australia and Australians established by the National Library of Australia and built in collaboration with other Australian libraries and cultural collecting organisations. Each of the PANDORA participating agencies selects titles according to its own selection guidelines.
- Other national libraries, such as Sweden, Norway, Finland, Iceland and Austria, take periodic snapshots of their countries’ entire web domains.
- The National Archives UK archives UK central government websites.
- The Internet Archive has archived over 150 billion web pages since 1996. Users can search for websites using the Wayback Machine, and see what websites looked like at certain points in time.
- The Internet Archive also provides the Archive-It service, which allows organisations to build and preserve their own web archive of digital content through a web application, without requiring any technical expertise or hosting facilities. Subscribers to Archive-It can harvest, catalogue and archive their collections, and then search and browse the collections when complete. Collections are hosted at the Internet Archive data centre and made accessible to the public with full text search.
What about blogs?
How many of you have your own blog? Blogs are increasingly common, covering topics as diverse as cooking, interior design and recordkeeping, and can be said to have social and cultural value as evidence of the ways in which people communicate and interact. But blogs are also ephemeral: studies suggest that the average lifespan of a webpage is less than 100 days.
Most existing web preservation methods copy entire websites from URLs, replicating the folder structure. This approach tends to treat each URL as a single entity, and follows the object-based method of digital preservation by which all digital objects in a website (images, attachments, media, stylesheets etc.) are copied and stored in sophisticated wrapper formats. This approach is not appropriate for the dynamic content of blogs.
The BlogForever Project is building a new system to harvest, preserve and manage blog content. BlogForever aims to preserve each post on a blog, including its content, layout, comments, metadata and any linked external resources (e.g. embedded images).
Speaking of blogs, there is a fascinating post on the National Library’s Australia’s web archives blog about the lengths to which a web archive curator will go to provide access to a particular website. The post details the efforts that Paul Koerbin, Manager of Web Archiving, went to in order to harvest Paul Keating’s speeches (the primary content of his website).
The post raises interesting questions about the value of such efforts, and the appropriateness (or otherwise) of making minor changes to Keating’s site to enable the speeches to be harvested.
What would you do?
Servers are critical to the operation of a digital archive. Accumulations of dirt and dust in server rooms can increase long-term maintenance costs and the chances of equipment malfunction, and decrease the lifespan of hardware. Maintaining the cleanliness of server rooms is therefore vital to ensuring that this delicate and expensive equipment can function properly.
If you were responsible for maintaining a server room, which of the following methods would you use to help keep the room clean?
A. I would install a shoe rack outside the server room and require people to remove their shoes before entering the room.
B. I would install sticky mats in the server room to remove dirt from people’s shoes when entering the room.
C. I would require anyone entering the server room to don a biohazard suit.
For a chance to win much honour and glory among your fellow Future Proof readers, post your answer as a comment below.
Leave a Reply
You must be logged in to post a comment.