Archive for September, 2008

This was my most challenging (in a thought-provoking way) visit so far. The Washington State (upper left hand corner of the US, for those whose geography is as hazy as mine was!!) Digital Archives doesn’t seem to be terribly well known in the UK, and I’d certainly recommend colleagues have a look at their website, particularly some of the background documents in the About Us section. The Center for Technology in Government’s case study on the public value of returns to government resulting from the Digital Archives investment (available at http://www.ctg.albany.edu/publications/reports/proi_case_washington?chapter=1) is also well worth reading.

Why was the visit challenging? Well, essentially because this digital repository has been largely conceived and operated as an IT development project, and more recently as a business service and disaster recovery facility for creating agencies and departments within state and local government in Washington (and in this sense has certain parallels with TNA’s Digital Continuity Project). Microsoft, being in Washington’s backyard in Seattle, also have a not inconsiderable influence, and the Digital Archives staff have a strong working relationship and level of support from Microsoft.

Quite a contrast, then, from the Australian operational repositories, whose workflows are firmly rooted in archival and recordkeeping paradigms, often with a strong commitment to the use of open source software and XML open standards.

Initially, I found this approach very difficult to grasp, and indeed the staff at the State Archives freely admit that future developments of the Digital Archives will require a greater degree of partnership between archival staff and technologists. As I learnt more about the detail of the Digital Archives operation, however, I began to see both parallels with other digital archive operations (for instance, in maintaining authenticity and safe transfer of custody of files by means of sealed hard drives and secure FTP transfer) and ways in which the greater level of IT input into this Digital Archives has enabled extremely high levels of automation and efficiency in processing and searching.

The current run rate for ingest of single page TIFF images is over a million a day; use of the website (boosted by the decision to concentrate initially on the ingest of digitised birth, marriage and death records) runs at a level of around fifty to sixty thousand uses a day.

I still struggle, from a conceptual archival point of view, from the way in which different record series are merged together for access, and would hope to see a greater degree of contextual information in series descriptions into the future, although I can understand the processing efficiencies gained through only having to manage the one large database. The approach really makes you think about which of your archival assumptions are vital theoretical foundations for facilitating secondary use of archival resources, and which are merely legacies of a paper world.

Washington will also be of interest to those colleagues who would like to see regional partnerships of digital archives develop in the UK. Washington is leading one of the current round of NDIIPP projects to develop a centralized multi-state digital preservation consortium so that other States in the US can benefit from the expertise and workflows developed in Washington. Further details are available from the project website.

Read Full Post »

Probably the best posting I can make on the Internet Archive, based in San Francisco, is to encourage colleagues to have a look at their Archive-It subscription service, and perhaps attend a free webinar about the tool (details on the site) or at least have a look through some of the collections from partner institutions in US State Archives.

Although not yet listed on the site, some UK colleagues are already experimenting with the tool.  The Internet Archive offers full hosting and storage, or can also ship the results of the web crawl back to the partner institution – as they will be doing for the major full Australian domain web crawl for the National Library of Australia, which had just completed at the time of my visit.  The IA is also working LOCKSS for storage of harvested websites, and hoping to work with the digital repository software platforms DSpace and Fedora.  Tools to enable more sophisticated pre-crawl scoping and to bookmark potential sites of interest before harvesting are also due for release soon.

Read Full Post »

I hope many archivists working in local authority services in the UK have completed the survey on digital preservation which is currently running.  The  results of the survey will be fed into an open consultation event to be held at The National Archives on 12 November 2008.  Those of us who have been working on the survey and event planning are hoping that this will provide a first step towards a new alliance of interested organisations in the UK to co-ordinate action on digital preservation.

Throughout my Fellowship, I will be encountering examples of successful partnerships which have attempted to address the challenges of digital preservation.  In Australia, I was particularly interested in two partnerships – the Australasian Digital Recordkeeping Initiative (ADRI) and the Australian Partnership for Sustainable Repositories (APSR).

ADRI is the partnership most immediately applicable to the local authority sector in the UK, as it is an initiative formed solely of public record keeping authorities across Australia and New Zealand.  Both initiatives, however, identified similar strengths and aims:

  • enabling information sharing on best practice
  • offering encouragement, support and reassurance to practitioners (archivists, librarians) and external stakeholders (eg record creators, users, government) alike
  • identifying areas of joint interest
  • providing a framework for recognition of partners’ work on new models and paradigms for digital preservation (eg testbed software solutions, model business cases, proposed standards for digital preservation)

Both projects also rely heavily on practical contributions from their member organisations, yet emphasise that these are generally projects which the members would be commited to doing anyway.  The benefit to the community comes from pooling these resources towards a common Australasian approach to digital preservation and access within their respective communities (public records bodies in the case of ADRI; University Libraries in the case of APSR).

Read Full Post »

One noteworthy factor about several of the digital preservation initiatives I’m visiting during my Churchill Fellowship is how each approach is underpinned by a certain philosophical world view.

For NLA, a key challenge for the digital preservation community is sustainability:

  • The community needs to know as much about routes which haven’t worked as those which have.
  • How do the parts of the preservation puzzle fit together?  Which parts of the puzzle have still to be solved?  How do we co-ordinate the game?
  • Could we make better use of informal knowledge from enthusiasts?  We should recognise that we can’t be experts in everything (and that we can’t preserve everything – a principle most archivists should be happy enough with).
  • Perhaps we are better at digital preservation than we think we are, but merely lack confidence in presenting this to management.

Read Full Post »

In my previous post, I’ve recommended working with depositors to explain the issues of digital preservation and to suggest simple steps for creating and curating digital records with a view to their long-term preservability.

I guess it would be correct to say though that many local archives staff do not feel confident in giving such advice. Although many UK local archives have been involved in digitisation programmes, much of this work has been outsourced and the funding has rarely extended to longer-term preservation of the digital assets created. In all likelihood, the most pressing digital preservation issue facing most UK local archive services is in fact an ever-growing output of CDs and DVDs from their own digitisation initiatives.

The NLA’s Digitisation of Heritage Materials training course is designed to help organisations with very limited resources design and run an in-house digitisation programme, using free or inexpensive software and hardware. It will be of interest to many UK local authority archive services for just these reasons. However, because it also covers sustainability issues – image file formats suitable for long-term preservation, data storage and backup, legal issues etc. – the course might equally well serve as an introduction to the digital preservation of images. It even includes some free software.

Well worth a look.

Read Full Post »

National Library of Australia

National Library of Australia

The National Library of Australia (NLA) began their web archiving project, PANDORA, in 1996, and the current team consists of four members of staff. The NLA’s web archiving programme is selective, contrasting with approaches in the Scandinavian countries in particular where the aim has been to harvest the entirety of the country’s web domain. The decision to make selective harvests only was resource driven, since there was no extra funding available, although the Library are now doing periodic .au domain harvests in conjunction with the Internet Archive. .au domain harvests have been commissioned since 2005, and the 4th harvest is due this year. It will run over 4 weeks initially, and capture an anticipated 1 billion files, comprising around 40 TB of data. The Internet Archive manage the harvest and carry out a full text index; the results will be shopped to NLA and maintained on NLA servers, although copies will also be available via the Wayback Machine (without the full text indexing).

The terminology ‘PANDORA Archive’ is acknowledged to cause some confusion, particularly within the Australian government, and the Library acknowledge that they are not in fact carrying out an archive role in the traditional sense. Rather, PANDORA is a web collection, a snapshot of a point in time, a representation of what the NLA feels is important in the Australian web domain. PANDORA doesn’t meet recordkeeping needs for recording business transactions; the websites are harvested purely for their content and there is some leeway in the accuracy of dates of collection – for example, a site will be timestamped when it is harvested, but the NLA then perform further quality controls on the harvested site which may take up to a week to complete.

That said, the websites are harvested with the intention that the NLA will attempt to keep them in perpetuity, and permission is sought from website publishers for collection, preservation and public access, with this in mind. Legal deposit legislation does not (yet) cover electronic information in Australia, and considerable effort is therefore required in obtaining the relevant permissions from the website publisher (but not from every contributor). Access restrictions are applied in certain circumstances – for instance, where there is a commercial interest involved. Access can be restricted in several different ways – (a) for a set period of time following archiving, (b) for specific dates, (c) by use of authenticated logins, (d) access restricted to one PC in the NLA’s reading room. One of the problems NLA identify with their current harvesting software is that the restriction mechanism is not sufficiently finely tuned to file level – currently access restrictions can only be specified on a whole website.

The selection guidelines used are under review at the moment. The current priorities include major events (for example, coverage of Australian elections) but can basically cover any original, high quality content not available in print. The websites harvested range from academic e-journals to blogs. PANDORA is just moving into Web 2.0 harvesting, although they have already captured many blogs, some MySpace pages and some online video.

A PANDORA ‘title’ might be anything from a single PDF document to a whole or part website. A particular website might also be harvested at scheduled intervals, how long between captures depending on how regularly the site is updated, whether content is periodically removed as well as new content added, and the general stability of the organisation publishing the website. The harvest interval is re-assessed at each harvest. Currently the most frequent periods to harvest are between 6 months and 1 year. Organisationally, it is more efficient to carry out captures less frequently.

The PANDORA archive currently holds around 2TB of data, consisting of around 20,000 titles and 40,000 harvested instances.

Of most interest vis-a-vis local archive services in the UK, PANDORA has nine partners in State Libraries and other cultural organisations, who can define what they require to be collected via a web browser interface to PANDORA’s in-house harvesting tool, PANDAS. Librarians in partner institutions can also log in to fix minor problems with harvests or log more significant issues for the team at NLA to resolve. Most of the actual capture work, however, is carried out by the team at NLA.

Whilst the PANDORA team has a library background, it is noted that a certain level of technical skills are required. That said, other than the quality control work carried out on each harvested title, little post-processing is currently carried out specifically to promote the longevity of the stored files. 3 copies are created – a preservation master (the original files as harvested), a display master (which includes any quality control changes), and a metadata master. A display copy is then generated from the display master.

Read Full Post »

Operating the Digital Archive

As previously posted, the operation of the PROV Digital Archive is well integrated into the wider organisation, with the same team responsible for transfers of both paper and digital records. This team also creates the disposal authorities (more commonly known as ‘retention schedules’ in the UK – is the different terminology significant??!) for all Agencies within the State of Victoria.

Digital records are only accepted into the Archive if they are VERS compliant, and the Agency’s recordkeeping system can produce VEOS according to the standard mandated under the Victorian (as in ‘State of…’) Public Records Act.

This is obviously a strong advantage for PROV, and not a requirement which can easily be translated into the UK local authority archives context. However it is worth noting that despite the relative strength of their archival legislation, PROV staff still commit considerable effort into consulting with Agencies and carrying out pilot transfers. The team at PROV have noticed that it is harder to encourage deposit in a digital world, whereas historically a lack of physical space for keeping records often triggered transfers to the archives. Whereas traditionally the transfer process was client driven, commencing with an Agency request, PROV are now trying to move towards a programmed transfer timetable for both paper and digital records. PROV are trying to sell this to the Agencies as being cheaper and easier than ad hoc clear-outs of records.

There are in any case many similarities in dealing with transfers of records to the archives whatever the format of the records. PROV needs to maintain intellectual control over the records series, and descriptive lists need to be produced. Background information on provenance and access arrangements or restrictions is gathered prior to transfer by PROV staff through site visits or, increasingly, formalised documentation. The Agency staff are responsible for producing a ‘manifest’ listing the records being transferred. PROV provides advice and training on the process of preparing digital records for transfer, and transfer guidelines are published on the PROV website. Digital archives may be transferred on CD, hard drive or copied remotely into the Digital Archive inbox (though few Agencies have yet taken advantage of this method of transfer, preferring to follow the paper paradigm and copy records onto CD much as they would package paper records into boxes).

The system of intellectual control (assigning of unique identifiers etc.) for digital archives follows much the same pattern as for paper records. My feeling is that Australian practice in the use of consignments and the series system makes this simpler to implement than with the UK practice using accession numbers and hierarchical cataloguing, although clearly we in the UK need to take some time, as did PROV with the revision of their Archival Control Model, to consider how to integrate digital archives into key archival processes.

Where do PROV themselves hope to see improvements? Dealing with digital has highlighted an internal need for improved written procedures for dealing with transfers, whether in paper or digital formats. New staff need to be trained to operate the Digital Archives interface (a heavily customised version of Documentum). Improved guidelines are also needed to help Agencies, and in particular Agency IT staff who are most likely not familiar with archival practices and terminology. One of the technical support staff at PROV pointed out that ‘file’ in IT terms has potentially a completely different meaning to the archival ‘file’. Language needs to be translated into terms which Agency staff are familiar.

Once the digital records arrive at PROV, the manifest is loaded into the Digital Archive system and checked against the records actually received. The records are checked to ensure that they are valid VEOs and that they are virus-free. Various errors can be picked up at this stage – duplicate records, extra records received or too few, problems with the digital signature etc. Simple errors can be fixed by PROV staff, but in general it has been found best to request the Agency to resubmit the whole transfer. The records remain in ‘quarantine’ for seven days, before the checking process is re-run. If successful, the transfer can be finalised and the records become viewable through the PROV online catalogue.

The first pilot transfers to the Digital Archive took place in 2005. The largest accession so far has in fact been digital surrogates from PROV’s own digitisation programme, although another major and ongoing project is the archives of the Melbourne 2006 Commonwealth Games. This has brought its own unique challenges in working with a project organisation in the process of being wound down (for example, password protected records which cannot be processed into VEOs have had to be ignored).

Read Full Post »

Some interesting comments from senior figures at IBM and PGP reported today following the announcement of a US$100,000 donation to the UK National Museum of Computing at Bletchley Park.

As I have looked at how computing history and computing museums might stimulate an interest in digital preservation issues, so this money has been donated in the hope that the museum will help “engage new generations in the next stage of technological evolution by encouraging them not to take computers for granted”.

The full story can be found at http://news.bbc.co.uk/1/hi/technology/7604762.stm.

Read Full Post »

Older Posts »