Migrating Business Systems to the Digital Archives (a post from the Digital Archives team) June 17, 2013
One of the challenges that we (State Records NSW’s digital archives team) have been tackling is how we will approach the migration of database-backed business systems to the digital archives.
There are a number of difficulties in preserving databases:
- many database management systems (particularly those used in government) are proprietary and require licenses to use, so transformation to non-proprietary formats is often necessary
- most databases are uniquely configured to support a particular application, which makes it hard to find an appropriate ‘one size fits all’ solution, even for a single database type
- databases often fulfil a number of roles (which makes associating them with particular disposal classes or series/item descriptions difficult) and are associated with a variety of views (which makes it hard to identify singular, canonical records)
- most databases don’t stand alone, but are part of larger business systems. Database-backed business systems also comprise custom application code that defines the way data is manipulated and viewed.
But the fact that preserving database-backed business systems is hard doesn’t mean that we can avoid the challenge. NSW government agencies expect us to have a capacity to accept migrations of business systems to the digital archives. In the series of pilot migrations that we are currently conducting, one of the lessons that we have learned is that, while agencies are often largely content to manage their legacy unstructured data (emails, documents, spreadsheets, images, etc.) for long periods of time, it is much harder to maintain redundant business systems and it is these types of digital records that agencies most want to transfer.
While this post will focus on technical strategies, we should be mindful that such considerations are only a part of the picture. The successful migration of business systems involves several other processes including project planning, metadata analysis and systems documentation (to integrate business systems with other systems in the digital archives and other paper holdings in the State Archives collection), decommissioning strategies, and access determination and planning.
Proponents of emulation as a preservation solution for database-backed business systems may be disappointed in this post’s focus on transformation options (such as conversion of databases to XML). State Records isn’t ruling emulation out as a preservation strategy and will consider it when conducting individual migration projects. During our pilot migrations we’ve found that emulation is a practical solution that agencies themselves are using to manage legacy business systems, for example by keeping old software (such as older versions of Lotus Notes) running in virtual machines. But emulation isn’t our preferred approach at this stage, primarily because it keeps systems siloed from each other and we would like to build an integrated digital archives over which, for example, a user could construct a structured query that accesses data across the whole colletion.
So what is State Records’ preferred approach? In a research paper published on this blog last year (System Migrations to Archives), we indicated that we’d be pragmatic and that, rather than adopting universal solutions for digital preservation, we would approach each migration of digital records to the digital archives as a separate project and plan custom migration strategies (with agency stakeholder involvement) to best suit the needs of particular systems. Over the past year we have been working on a migration methodology to frame this decision making process. We will be releasing this methodology for comment within the next few months.
In creating migration plans for database-backed business systems we will consider the nature of those systems, how they are used, what views or reports they generate, and how users of the digital archives (both from agencies and from the public) will access them. A migration plan for a simple registry-type system (such as a database of hairdresser licenses), which only ever generated a single type of view, might be as simple as transforming it to a CSV (comma-separated values) file. A transactional business system with hundreds of database tables and a complex user interface would, on the other hand, require a more sophisticated migration plan. As an illustration of some of the transformation options we’d evaluate for such a system, consider this scenario…
An example transformation scenario
The Office of Teacher Services has a legacy Teacher Operating System (TOS). The system has two parts:
- the Registry of Teachers is a system for registering information about teachers (e.g. name, DOB, qualifications, and schools taught at). It is a proprietary database with both a custom frontend (for data entry by Office of Teacher Services staff) and a web frontend for teachers and schools to view summary information. The two key tables in the database are the “teacher” and “school” tables
- and Teacher Docs, a system that stores digitized versions of documents relating to teachers’ careers (e.g. job applications, annual performance reviews, working with children checks, etc.). It comprises a proprietary database with metadata about the documents and 4.2 million images on SAN storage.
The Office of Teacher Services has recently upgraded to a cloud hosted MS Sharepoint implementation, Teachers Online, and have indicated they would like to transfer TOS as digital State archives.
N.B. As this scenario illustrates, the problem isn’t just about “preserving databases”. In this example the two databases are components of larger systems within a single business system.
Introducing the users
Agency staffer
I have just had a call from a law firm. They are requesting the complete file for a teacher comprising data in the register and all associated documents (application forms, yearly reviews, and working with children checks). I need it today.
Agency IT support
We seem to be getting many more requests than we expected for that archived data! I’d like to embed search functionality for the legacy data in our new Teachers Online system, what kind of API do you guys offer? Also, our Executive Officer is asking for some data for the Annual Report. Can we get a breakdown of number of teachers per school district? Cheers!
Public user
I’m researching a family member that I believe worked as a teacher in New South Wales. I would like to see her “file” if possible. I typed “Sally Adams” into your search engine. Is there a result I can just click on? I’m OK with computers… I know how to use Word, email, and Google… but I’m not an expert.
Digital humanities scholar/developer
Search for individual records? What is this, the twentieth century? I want access to RAW DATA that I can use to visualize funding levels for the State school system through the 20C.
Strategies for migrating the Teacher Operating System
1. Static reports
State Records NSW’s digital archives team analysed TOS and determined that it had a single main role: to manage the registration of teachers in NSW. On that basis, they decided that static reports could adequately represent the system and exported a series of PDF reports to summarise each teacher’s career. Related imaged documents from Teacher Docs were stored alongside those PDF reports in ZIP archives with XML-encoded metadata files indicating relationships.
What do our users think?
Agency staffer
Thank you for emailing me that ZIP file. It has everything I need and I can send it straight on to our legal officer. We should transfer everything to the Digital State Archives!
Agency IT support
I like the fact that your catalogue has an API which I can integrate with our system. I’ve been able to create a cool little search widget in Teacher’s Online that searches for teacher files and retrieves those ZIP files. I’m not so happy with what you’ve done to our old databases. I’m not sure how I can generate those reports I need. I suppose I could write a script to extract information back out of the PDFs but it will take me weeks.
Public user
Perfect, thank you! I had no trouble opening the ZIP file, Sally’s PDF has a lot of useful information and it is wonderful to see those original documents.
Digital humanities scholar/developer
There’s not much I can do with PDFs. I’ll see what I can find elsewhere. :(
2. XML database normalisation
State Records NSW digital migration team analysed TOS and determined that it comprised two databases and a large set of files. They used specialist database preservation software developed by the Swiss Archives (SIARD) to translate the two databases into XML files. An important feature of SIARD is that it can do round-tripping (turn the SIARD XML file back into a database). The team then ran Xena over the imaged documents to create individual normalised XML-encapsulated files for those.
What do our users think?
Agency staffer
Thank you for sending me those files with the .xena extension. Can you please tell me what to do with them? My IT department haven’t heard of this format. Is the registry information in these files too?
Agency IT support
I’ve managed to create that search widget, but my staff have been complaining about the format of the results (some sort of XML?). Thank you for pointing me to that XML version of the database. I read the SIARD documentation and was able to reconstitute the database and generate the report I needed.
Public user
I typed “Sally Adams” into your search engine. I got some files in a .xena format and, after a lot of fussing around and Googling, I worked out how I could extract some original “files”. I guess that is it then?
Digital humanities scholar/developer
Those XML files were huge and took hours to download but I’ve managed to convert them back into a database, now I can get to work….
3.RDF/XML DB normalisation
After watching an inspirational TED lecture by Tim Berners-Lee, the digital migration team decided to migrate the entire TOS system as linked data. The team created R2RML (relational DB to RDF mapping language) mappings to convert the Oracle DBs into RDF which they stored as RDF/XML text files. RDF was also used for metadata describing links between the registry information and the imaged documents.
What do our users think?
Agency staffer
Thank you for sending the imaged documents. I understand that those RDF/XML files contain information from the registry but I’m having trouble viewing them. Can you provide any assistance?
Agency IT support
Created search widget, check! I’m not very familiar with linked data but I did some research and worked out how to use your SPARQL query service to create the queries I need, nice job!
Public user
I found some documents that have been very helpful. Didn’t know what to do with that RD..? stuff… so I just ignored it.
Digital humanities scholar/developer
Linked Data, RDF, SPARQL – ticks all my boxes!
4. SQLITE DB Migration
After reviewing the TOS system, the digital migration team decided that the best way to maintain the integrity of the system would be to store the relational data in a relational database and preserve the links between that data and the imaged documents. Since the original proprietary databases could not be supported by the digital archives team, they decided to migrate the data in those DBs into an SQLITE database (public domain software that writes databases to a single file for easy storage). The imaged documents were stored separately on disk with metadata in associated RDF/XML files.
What do our users think?
Agency staffer
Thank you for the imaged documents. I note that you’ve retained the registry information in a relational database and I’ll get my IT support to query it, thanks!
Agency IT support
Great all round thanks! I was able to mostly reuse our old SQL queries to generate the reports I needed for our annual report.
Public user
I found some documents that have been very helpful. Didn’t know what to do with that databse, so I just ignored it.
Digital humanities scholar/developer
SQL: kind of traditional :), but I can work with it.
Grasping the nettle
As this scenario illustrates, finding practical solutions that suit all the potential use cases for database-backed business systems is difficult. We will have to make choices. Choices that in some cases may disappoint users. We will seek to mitigate that risk by retaining, whenever possible, original source files alongside transformed system but, because of software obsolescence, we won’t be able to guarantee the continued accessibility of those original source files. As Chris Hurley suggests, in a ‘digital world’, selectivity cannot be avoided:
In the pre-digital world, we were accustomed to finding a singularity and preserving that as untouched evidence of past events. In the digital world, we do not impose a multiplicity, but that is what we are going to find and a multiplication of views (not the duplication of data) may well continue long after the point where in the pre-digital world the data would have been consigned immutably to archival storage and preservation. This multiplication is the result of data-sharing and the diversity of uses to which data can be put (re-use). Under classic theory, an immutable singularity was delivered to us by transfer for preservation and explanation and we selected how much of it to preserve. In a re-imagined view the same thing happens but it arrives in the form of possibilities from among which we must now select ones that both document and preserve (giving full weight to the actual uses to which the data has been put) and support data re-use from which new additional structuration may derive that was not there at the outset.
Chris Hurley, The Hunting of the Snark (Looking for Digital “Series”), 2011, p. 13.
The bright side of this picture is, as Chris’s quote indicates, that, if we do our work well, then, when transforming digital recordkeeping systems, we will have opportunities, not just of maintaining migrated business systems as authentic records, but also of opening up those systems to enable new forms of use that were not even possible in the original manifestations of those systems.
Hi Richard
Great post and a great explanation of what SRNSW is doing with business systems. Really thought-provoking and an approach that we will need to consider ourselves if we have to deal with a whole lot of legacy business systems. I have one question (at the moment): you say that you will be “…retaining, whenever possible, original source files…”. In the TOS example you discuss what do you consider to be the “original source files”?
cheers
Andrew
Hey Andrew,
thanks for commenting. You’ll have noted the ‘whenever possible’. I could have probably added an extra qualification there: ‘and to whatever extent is possible’. In some cases it may be that the size of systems or licensing restrictions prevent us from retaining any ‘original’ files at all. I also think in practice that we’ll probably not be taking in much custom application code (and if we did I imagine it would be for systems where we’d decided they had such significance in their original look and feel that we were going to use emulation to keep them running). So, in the context of business systems, ‘original files’ will probably mostly mean native database backups.
In the TOS example, I’d imagine how we’d interpret this requirement (State Records NSW’s ‘Policy on digital records preservation’ mandates that we consider retaining originals wherever possible) would be to take native backups of the two databases along with the imaged documents.
In the pilot projects we are running we have already accepted transfers of native back-ups of MS Sequel and LotusNotes databases. We intend to transform those databases (currently still planning how) but will retain those original databases in the digital archive too.
Thanks for the additional info, Richard!
Hi Richard,
Very good read! You’ve hit the nail on the head – while we are worrying about how to accession databases, the real challenge is how to present them in a usable way to our stakeholders.
In the long term, I believe we need to see a conceptual separation between the archival format and the delivery format (as in the OAIS distinction between an AIP and a DIP). The accession/storage format should be whatever is convenient to the archive/agency. Transformations should then be applied to present a range of user experiences as required. Some of these transformations might be map to predefined common views (e.g. the common cases of the public user and the legal discovery where both want a ‘file’). Other transformations may be to map to a particular representation (e.g. an SQL database or RDF). These might be done asynchronously – the user makes a request and sometime later a representation is delivered.
These transformations might be created on accession based on an archival appraisal of the function of the system. They might be done post-hoc based on need. They would be particularly powerful where a common transformation (a.g. conversion to RDF) could be applied to all databases in the collection.
andrew
Really nice case study Richard! Line of business db systems is an important but sadly neglected market place for digital continuity solutions in Australia. Active and inactive db archiving (including via the cloud) is an established market in the US but seems not to have taken off locally.
Would really like some exemplar AIPs and DIPs for the students.
Nice one.
Mark.
Great post Richard and I think its a great approach also. I am impressed with SRNSW’s determination to tackle this difficult and complex problem as I’ve been fascinated with how Archives in general are going to tackle these problems for a while.
Back in 2010 I presented at Metadata 2010 about the benefits and practicality of a single metadata standard for all structured data. Something like that would be useful for this situation so that queries could be semantically standardised across data from multiple diverse business systems. In that vein, I’d be interested to hear how you are going to standardise the semantic layer across the various data sources to enable “a user [to] construct a structured query that accesses data across the whole colletion”
For example, how are you going to ensure that queries for “vehicle” include results from the right columns/rows in the various sources when “vehicle” can mean different things in the different databases.
It would also be wonderful if we lived in a world where you could provide access to both the emulated version of the business systems and migrated data from the business systems in the future, but given the costs involved with both your strategy seems very practical. Having said that, have you considered accessioning virtualised systems as disk images, along with the exported data, when they are already available that way from the agency?
Thanks Euan. To answer you last question first, yes we would definitely consider taking virualised systems from agencies as part of a migration process. In the two pilot projects we are currently dealing with where agencies are using virtualisation this probably isn’t an option as in both cases those agencies are having to pay fairly steep licensing and support costs to keep those systems running (those costs are what is motivating them to transfer).
Your first question is harder to answer in a short comment! This area is work in progress stuff. One of the tools we’ve developed as part of our project is a metadata registry. This is based on DC Application Profiles (http://dublincore.org/documents/profile-guidelines/). What we’re hoping it will do is allow us to map structured data and metadata to terms in the registry and, if there is no suitable term, allow us to progressively add new terms to the registry over time. So it is controlled but isn’t a fixed standard, it can evolve over time. Terms in the registry will all be expressed as linked data. Users wanting to perform structured searches can consult the registry to see what terms are queryable and then create structured queries using that knowledge (probably against a SPARQL endpoint). We’ll publish more information about this in the future.
Very interesting post Richard. As Andrew W said, your example users and their responses to the presentation formats show that there has to be a variety of ways of presenting the database content in meaningful ways to a variety of future users with different skill sets, information requirements and motivations. This may vary over time. There may be a time in the not too distant future where SQL skills are rare, so the issue of accessibility will become more problematic.
You’d think that some of the ways users might be interested in accessing the data in a complex database would be similar to the ways users of the business systems did when they were still functioning. Screens giving details of business entities and transactions against them, summary reports and so on. And search functions to look at or report on filtered sub sets of particular entities.
Because of relational database design methodology the data for a business object or entity (eg a customer, a document, a teacher) can be scattered over a number of tables in the database. You need to understand the database structure to bring these all together. If you are lucky there may be SQL queries in view objects in the database layer that are used to construct screens or reports that will pull all these details together for you. But most likely some or all of that view construction would happen in the code in the business system application layer.
So I think, in the absence of a working emulation of the business system some of the views of the data that a business system constructs to present screens and reports might need to be migrated from the application layer code (or from the system documentation, database entity relationship diagram, or the database programmer’s head) and inserted into the database layer as ‘view’ database objects before archiving.
Maybe we could use the SQL query in the ‘view’ as make table queries to create new tables in the database gathering all the entity look up values and transactions. We could export the view-augmented database to a SIARD XML archive to become our AIP in OAIS terms. Then XSL style sheets could be used to render presentation formats similar to the original screens or reports without having to know the database structure or have SQL skills. And the rendering XSL could be part of the whole archive package (along with example screen shots and report samples from the original business system, and documentation of the derivation of the views and any other useful documentation)
Some views might create massive tables if a lot of rows in the main entity table have a lot of transactions, so only partial reconstructions may be possible in these cases.
XSL could be used to render your linked RDF or pdf document use cases. SIARD tools can of course turn SIARD XML back to relational database form to be accessed by SQL.
We could also decompose the view tables into a bunch of files of hierarchical XML, one per entity (customer, document or teacher). We could render these in some way to make them more readable (pdf/A?) and put these into an eDRMS. Or convert the XML files to HTML and load them onto a web site and through an indexing engine. Then we could simulate database search functions with Google like searches and results lists and viewing of individual objects and associated digital images or documents and dare I say records (in the archival sense of the word).
Neal