Unlocking the box that holds the secret to digital preservationS

A box recently sealed in a vault in the Swiss Alps contains tools for future users to reconstruct defunct file formats and retrieve otherwise lost information. But how does it work? We spoke with the project's leaders to find out.

When we first came across this Reuters article from May 19, we were intrigued. The article talks about what it calls a "digital genome", a box supposedly containing all the necessary information to read data formats that might be so obsolete fifty, even twenty-five years from now that future computers could not otherwise access them. Stored at a secret bunker called the Swiss Fort Knox (believe it or not, that's its actual name) that can protect it from any outside disturbance up to a nuclear blast, the box is the culmination of the four-year "Planets" project, which brought together experts from sixteen European libraries, archives, and research institutions to figure out the best way to preserve data long-term.

The problem was that the article didn't mention a rather crucial detail: what those working on the project had actually placed in the box. "Digital genome" is an evocative term, but it doesn't really reveal anything about the inner workings of this preservation. So, we got in touch with the leaders of the Planets project and asked them to explain what they were up to, something they were only too happy to do. Here's our interview with Jacob Lant, science and technology press officer for the British Library, one of the key institutions behind the project.

What's the background for this project? What were the goals you set out to achieve with Planets and what challenges did you face in getting there? (Also why is this particular project called Planets?)

In the past few years the volume of data produced globally has risen from 281 exabytes to over 700 exabytes – over 1 trillion CDs worth of data – much of this is now considered to be at risk from the repeated discontinuation of storage formats. Unlike hieroglyphics carved on stone or ink on parchment which last thousands of years, current studies suggest that common digital information storage formats such as CDs and DVDs have an average life expectancy of less than 20, and that's even if the hardware exists to read the format in question. Anyone using a relatively modern PC who has ever gone back and tried to read material stored on a floppy disc will instantly recognise the frustration of trying to access obsolete formats. Yet the obsolescence of storage formats such as the floppy disc and the decay of individual storage media is just the tip of the iceberg.

A less widely considered risk, yet more immediate, is the obsolescence of the proprietary file formats used to read content stored on the various media. Estimates based on the current rate of development suggest existing software and supporting systems may only be supported for as little as five to seven years, desktop hardware even less. Backing up data is a start, but without the information and software tools to access and read historical digital material it is clear huge gaps will open up in our digital heritage.

To meet this threat, in 2006 the European Commission established the Planets project – Preservation and Long-term Access through Networked Services – bringing together a coalition of 16 European libraries, archives, research organisations, and technology institutions including the Austrian National Library, the University of Technology of Vienna, and the British Library to develop the software solutions to guarantee long-term access.

The project set out to:
- To assess the digital preservation challenges and provide a consistent and coherent evidence-base for the objective evaluation of different protocols, tools, and services to ensure the preservation of our digital heritage
- Develop the methodologies, tools and innovative services which will enable the transformation and emulation of obsolete digital formats providing long term access
- Raise the profile of the digital preservation challenge and disseminate information about the work of the Planets project
- Enable organisations to improve decision-making about long term preservation needs – helping them define, evaluate, and execute preservation planning

What exactly is contained in the vault? "Digital Genome" is evocative but doesn't necessarily help in understanding just how you have gone about preserving these file formats. So what is the digital genome, and how does it work?

Although the Planets project was set up to explore the preservation of many different historical and contemporary digital formats, the time capsule is symbolic of the wider challenge - containing the keys to 5 of the most common file formats in use today.

Marking the end of the first phase of the Planets project the deposit of the ‘Digital Genome' in Swiss Fort Knox, one of the most secure facilities in the world, will help to highlight the fragility of modern data but also protect the tools for unlocking our digital heritage from a whole range of human, environmental and technological risks.

Inside the Digital Time Capsule:
- Five major at risk formats - JPEGs, JAVA source code, .Mov files, websites using HTML, and PDF documents
- Versions of these files stored in archival standard formats – JPEG2000, PDFA, TIFF and MPEG4 – to prolong lifespan for as long as possible
- 2500 additional pieces of data – mapping the genetic code necessary to describe how to access these file formats in future
- Translations of the required code into multiple languages to improve chances of being able to interpret in the future
- Copies of all information stored on a complete range of storage media – from CD, DVD, USB, Blu-Ray, Floppy Disc, and Solid State Hard Drives to audio tape, microfilm and even paper print outs

Essentially the digital genome contains, in numerous formats across a range of storage media, descriptive information on how to recreate every level of hardware and software to read a file – everything from info on the reading software and the supporting operating system, to how to build a floppy disc drive.

As well as protecting this vital information inside Swiss Fort Knox, the contents of the Time Capsule will be made freely available to organisations and individuals alike via the Planets website to help both access historical files and take appropriate measure now to ensure the preservation of their data.

Whilst the hard copy Time Capsule will remain protected from physical threats, the virtual version will be constantly updated and refreshed to ensure the information is kept up to date and easily accessible using cutting edge hardware and software. This virtual version will itself be protected from physical harm on the servers held inside the vaults of Swiss Fort Knox to provide dual protection of the digital genome.

Although the digital genome inside the time capsule covers just 5 file formats it raises the profile of the need to preserve the digital genome of all existing and new file formats – it is this information that will be crucial to providing future access.

Unlocking the box that holds the secret to digital preservationS

How did you choose those particular formats? Why jpeg instead of bmp, for instance? Or would it be possible to reconstruct other picture formats from jpeg? And why were no audio formats chosen, like mp3?

The file formats chosen were selected because they are incredibly common, making it easier for the general public to relate to the importance of the digital genome and the wider issues around digital preservation, i.e. it is highly likely they have used these formats at some point.

For example, jpeg was chosen as the picture format because it is the most widely used compression format for photographic images. Photographs are easy to grasp as important historical documents, so using the jpeg format to highlight the preservation risk to digital images was a direct way of getting a complex message across to a broad audience.

As I understand it, it is not possible to reconstruct other pic formats from the jpeg - the archiving of the jpeg digital genome is representative. The Planets project has worked to create tools to preserve many other formats including things like bmp but these were not selected for inclusion in the digital genome - it would have been simply too much info to store in what is actually quite a compact time capsule.

Also, just to clarify MP3 was included so audio has been covered by the project. Again, MP3 is a vastly used file format so it was vital to include.

Is the focus solely here on preserving file formats themselves? What, if any, steps are being taken to preserve the massive amounts of data itself that is being generated constantly? Is any form of digital preservation of data achievable, or is it only possible to preserve the formats necessary to read whatever data manages to survive?

The focus of the Planets project has been on creating the tools and information to preserve individual file formats and make them accessible in the future, and is not itself trying to preserve the mass amounts of data being created every year. However, Planets is European wide collaborative project, involving numerous institutions who are all involved in collecting mass amounts of digital data.

Looking just at the British Library, the UK Web Archive and the Digital Library Program are two examples of mass digital acquisition projects. These two projects have created the infrastructure to ingest and archive mass amounts of digitally born published material as part of the nation's cultural heritage in preparation for when e-legal deposit legislation is brought in – entrusting the library with the challenge of preserving the nation's digital memory. The projects already store mass amounts of digital data but this is all acquired on a voluntary basis, if/when the legislation is brought in the amount of data being captured by the Library will be immense and we will have a statutory duty to preserve and provide access to this material for research. The work of the Planets project therefore feeds directly back into projects such as these.

As I have already stated, one of the primary aims of the project was also to raise awareness of the challenges of digital preservation, to help both public and private sector organisations assess the situation, and create the tools they will need to manage the vast amounts of data they are creating. It is more about helping them plan and prepare than archiving material on their behalf.

With regards to the actual preservation of material, it is possible to preserve anything if the correct plans are put in place. This involves a combination of storing data properly, in high quality archival formats, and regularly migrating data to new storage media and ensuring updated software supports older formats or emulation software is in place. Backing up data is also essential, and is something most organisations and many individuals are now doing regularly. However, with so much data being created, and much of it not necessarily being considered valuable at the moment, it is highly likely huge volumes of data will be backed up and stored but the tools to read that data will not be kept up to date until it is too late. This is why the Planets project was established and the problem it has tried to address by informing about the dangers of not archiving data properly but also to try and create some failsafe tools which may be able to help extract information in the future.

How often do you think the genome will need to be updated? The Reuters article talks about a time-frame for this project measured in centuries, but won't the digital genome require more frequent maintenance to account for the rapid changes in formats? When will groups be able to access the information in the genome?

The contents of the Time Capsule (the digital genome of the 5 formats) will not need updating. The time capsule contains all the information that will be required to understand and access these formats in the future. Even if the various data carriers inside the time capsule have corrupted, the information is also stored on paper inside the time capsule. The web version of the Time Capsule will continue to be updated to be in line with current formats to ensure it is easily accessible.

The key message to take away from the first phase of the Planets project and the Time Capsule deposit, is that the preservation of the Digital Genome (information on all the hardware and software levels required to access, read and understand the particular format of a digital file) is essential to future access. The Time Capsule doesn't contain all the answers but gives us a model to work from with regards to planning the preservation of future formats and accessing the data we create.

Now at the end of its four year project the Planets project has created the foundations for the preservation of our digital heritage. Out of this project a self-funding organisation – the Open Planets Foundation – will take this work forward. Expecting a 25 fold increase in the amount of data over the next decade, this organisation will continue to play a key role in helping organisations face the increasing financial, legal and social demands to ensure long term access to digital files.