Machine Learning and Records Management September 14, 2017

Executive Summary

This paper details preliminary research into the application of machine learning technologies to records management in a digital environment.

Machine learning has the potential to automate processes such as classification and disposal. Indeed there are already commercial records management products available that use machine learning, such as auto-classification to improve search.

Blockers to the further uptake of this technology include a lack of compelling case studies, the cost and time needed to configure machine learning solutions, and the difficulty of integrating the technology with complex tools such as retention and disposal authorities.

State Archives and Records NSW is exploring options to advance the use of machine learning for the automation of records classification and disposal. We will run pilots to assess the technology’s capabilities in sentencing unstructured data. We will be seeking partnerships for an agency pilot and will also run an internal pilot using in-house data. The results of these exercises will be shared as case studies and will inform further work, including potentially changes to our own processes and instruments. We will collaborate with other Australasian jurisdictions through ADRI on such changes, including the development of smarter retention and disposal authorities.


Image credit: James Lappin, “Machine Learning for Records Management”, 2 Oct 2013


The application of machine learning to records management offers possibilities for the seamless classification and disposal of records. It sounds ideal … just let business get on with their work and let a machine look after the recordkeeping. We might not be there yet but there are some great tools available now and innovation happening in the application of machine learning to records management. Saying this, there will be a continued need for human intervention, for a while at least!

This post explores the ‘state of the art’ (what are the latest developments in machine learning technology that will impact classification and disposal) and the ‘state of play’ (which organisations in NSW, Australia, and around the world, are experimenting with machine learning for records management and what tools are already available). It identifies blockers to the adoption of machine learning for records management and proposes some steps that State Archives and Records NSW can take to move things forward.

To approach this subject, I’ve delved into a slice of the vast literature on the subject ranging from IT and data magazine articles/blogs, academic literature, theory and even philosophy on the subject of realism. Fei-Fei Li (one of the leading minds in the field of AI) argues that AI and machine learning should be democratised and made available to all hence her sabbatical from Stanford University to Google. Things certainly seem to be heading that way as there are many free online courses available to get you started. In the records management space, a number of other jurisdictions including NAA, TNA and NARA have explored the topic and published research reports. I put calls out for any local record keepers using the technology on our blog FutureProof and this went out on various twitter feeds. The twitter response definitely outweighed the blog post in response and interest. I contacted as many commercial providers in the market as I could identify to set up face to face meetings. I interviewed a wide range of sources that included vendors, data scientists, an international recordkeeping blogger and fellow archivists.

Types of machine learning

Machine learning is a branch of artificial intelligence (AI) research and represents the capacity of computers to learn without being explicitly programmed. Machine learning involves computers taking data and algorithms as inputs to develop models or algorithms that apply this learning to novel data.

There are many types of machine learning but the two main categories of machine learning algorithms are supervised learning and unsupervised learning.

Supervised learning is when you have a dataset that has the required output or result (i.e. labelled/classified data). This is called training data (or seed). The seed is then exposed to a testing set of data. It is from here that tweaking or weighting occurs, then more testing, resulting in the formulation of the function which then can be used on the whole dataset. Once exposed to similar data the algorithm is then able to predicatively output the appropriate results.

Unsupervised learning is when you don’t have the required output (seed) so the algorithm clumps like data together. This also means there is no evaluation set to determine accuracy.

Neither of these approaches entirely remove the need for human input:

Machine learning still requires a human to identify a prediction task, choose a competent ML algorithm, specify relevant inputs and outputs, and procure adequate training and an experimental data set. (Giles Hooker, Cliff Hooker

Pedro Domingos highlights the learning in machine learning, arguing that “the need for knowledge in learning should not be surprising. Machine learning is not magic; it can’t get something from nothing. What it does is get more from less… Learning is like farming, which lets nature do most of the work. Farmers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs.” [1]In a similar vein, machine learning has recently been called ‘Computer Kaizen’. This likens the technology to the Japanese discipline of iterative continuous improvement. Might we see what ‘Kaizen’ did to manufacturing and mass production happen to digital business workflows and processes which once automated will keep making continuous improvements?[2]

The Technology

Cloud computing and the accumulation of big data sets have underpinned the recent acceleration in the fields of AI and machine learning. Advances in areas such as search, discoverability and product recommendations are being led by the major players: Google, Amazon (AWS), and Microsoft (Azure). The chief consumers of this technology have been in the areas where large volumes of data are available to apply the machine learning algorithms. These include medical images/data, language and speech translation, banking corporations looking at fraud and international money laundering, litigation software, and governments looking for suspected terrorist movements internationally on the dark web.

These rapid advances pose regulatory challenges to governments around the world. This passage from the September 2016 “Artificial Intelligence and life in 2030” report describes some of the implications for government: “Given the speed with which AI technologies are being realized, and concomitant concerns about their implications, the Study Panel recommends that all layers of government acquire technical expertise in AI. Further, research on the fairness, security, privacy, and societal implications of AI systems should be encouraged by removing impediments and increasing private and public spending to support it. And though AI algorithms may be capable of making less biased decisions than a typical person, it remains a deep technical challenge to ensure that the data that inform AI-based decisions can be kept free from biases that could lead to discrimination based on race, sexual orientation, or other factors. In AI, too, regulators can strengthen a virtuous cycle of activity involving internal and external accountability, transparency, and professionalization, rather than narrow compliance.”[3]

Applying machine learning to records management

Machine learning technology has a number of potential applications in the field of records management.

At its simplest level, machine learning might just provide incremental improvements to existing technologies, such as improving the accuracy of search and discovery by better classification of content. The TNA recently produced a research report that explored technology that could assist with born-digital records collections in the areas of appraisal, selection and sensitivity review processes when transferring records to the TNA. This trial focused on eDiscovery tools (software designed for the discovery or disclosure of digital information for the purposes of lawsuits).

The report concludes that eDiscovery software can assist and support government departments during appraisal, selection and sensitivity review as part of digital transfers. However, it also notes that there is no completely automated solution (i.e. ‘eDiscovery tools and processes are not a silver bullet that will provide an immediate out-of-the-box solution’) and human input is still required at all stages.

In the future, machine learning might assist in ‘cleaning up’ unstructured stores of records (such as network drives and email systems) by automating classification and sentencing. In the best scenario, such a clean-up tool might be fully automated and trusted. Or it may be fuzzy, dealing with 80% reasonably, but requiring human intervention for the more complex cases.

Finally, machine learning might also be applied in green fields systems, prompting users who are creating new digital records on how to classify and sentence those records – think Siri for records management!

Commercial Solutions

Machine learning is very topical and there is a lot of research and development underway in the commercial sector. In the meantime, the working environment is in flux with many vendors still supporting businesses to move from paper environments to digital; and from unstructured systems such as shared drives to more controlled record management systems; and most recently the introduction of cloud storage and applications.

Several vendors in the ECM/ EDRMS space already offer machine learning products. On the market today are powerful commercial tools that facilitate analytic search of content and discovery. Many of these use machine learning to identify patterns and classify data. The development of business-specific workflows for task oriented record keeping is also well developed within EDRMS products, however auto-suggestive filing is still reliant on recent common filing areas used or working local group filing areas. The next step of supporting auto-classification for records retention remains on the horizon.

According to the Gartner’s Hype Cycle for Emerging Technologies, machine learning has reached the peak of inflated expectations and is 2 to 5 years from mainstream adoption (  We should expect over the next few years for the application of machine learning in records management to mature, particularly in tackling the “low hanging fruit”: around 80–90% of records creation is a repetitive activity, strongly linked to roles and/or business functions, and these records could be conceivably be fully managed by automation, leaving 10%-20% of records created needing human intervention for capture, classification and disposal.

The Blockers

During the course of my research into machine learning I attended a vendor sponsored information conference along with a couple hundred other information professionals and the question was asked, “Who is using machine learning?” No hands went up.  A second question was asked, “Who is including machine learning in their strategic planning?”  One hand went up.

So what is preventing uptake of this technology?

The technology remains in the “vendor space” rather than the “practitioner’s space”: it is available for purchase but no one is currently buying. The need for proven use cases that illustrate how well the tools work and that define the risks involved are required before we will see any up major take up of the technology. It is the early adopter problem: no one wants to go first as the technology needs to be proven before it can be trusted. The need for published use cases that show a failsafe implementation is what is required before a mainstream adoption is seen.

Another obstacle to adoption is that machine learning products are not press-button solutions. The expertise and the time required to curate seed data sets and to train the machine learning systems are a barrier. The need for large data sets and sufficient compute resources could also be barriers, especially with smaller agencies.

We must also reflect on whether the tools we currently use for records management are still fit for purpose. Can they be readily integrated with machine learning technologies? Our retention and disposal authorities in their current form are primarily written for a human mind. When applying these authorities to do sentencing, familiarity with the content and knowledge of relevant context is essential. Some classes require deciphering what is a ‘significant’ or ‘major’ record from a class of records. Common sense also plays a part in sentencing records. We have claimed that our retention and disposal authorities are format neutral. But are they really?  We are now definitely leaving the paper world behind and heading straight into the digital sphere. Is it time to look at appraisal in a purely digital context?

Where do we go from here?

The following recommendations set out the steps we at State Archives and Records NSW will take to help move things forward and support agencies utilise machine learning technologies as it develops in the area of records management:

A research group within the Australasian Digital Recordkeeping Initiative (ADRI) is being established to look at the automation of disposal, including the implications of machine learning for the development of retention and disposal authorities. We’ll collaborate with this group to share knowledge, case studies, code and scripts, and to explore new forms of disposal authorisation.

We will be exploring partnering opportunities with public offices, universities, and commercial vendors to work on proof-of-concept projects where machine learning can provide tangible records classification/disposal solutions.

We will undertake an internal proof-of-concept project that will seek to use machine learning to apply GA28 to a corpus of digital records which have already been sentenced manually.

“Nearly there, but not quite.”

At the moment machine learning is a capability and not a “silver bullet”, but it has arrived and it will produce many changes in our working environments in the years to come. Our goal is good recordkeeping and to achieve it we must move with, and learn, new technologies, utilising them to achieve efficiencies where they are available, and ensuring that the accountability of our systems and process is not compromised. The discussion has just began and we will be engaged in it.


I would like to thank and acknowledge the following people for their time and willingness to share ideas and views that have influenced this blog:

Sonya Sherman (Data and Information Strategist, Objective Corporation)

Jon Palin (Chief Technology Officer, Objective Corporation)

Andre Rootes (General Manager, Ephesoft)

David Gaffy (Sales Manager, Informotion)

James Lappin (International recordkeeping blogger – thank you for the skype meeting and the cartoon images)

Peter Chiu, Khimji Vaghjiani Prakash Soumya (NSW Data Analytics Centre)

Tatiana Antsoupova and members from the National Archives of Australia appraisal and information governance teams

Richard Lehane (Manager, Digital Archives, State Archives and Records NSW)

Peri Stewart, Angela McGing (Government Recordkeeping, State Archives and Records NSW)


[1] A few useful things to know about Machine Learning Pedro Domingos

[2] 44.Hal R. Varian, “Beyond big data,” Business Economics, 2014, Volume 49, Number 1, pp. 27–31,



Cameron Stewart June 22nd, 2018

Is it possible to publish the code to GitHub?

Irene Chymyn June 22nd, 2018

We will be publishing the code in the next post which will be coming out soon. Watch this space!

Leave a Reply

You must be logged in to post a comment.