Archive for the ‘Operational Digital Archives’ Category

National Library of Australia

National Library of Australia

The National Library of Australia (NLA) began their web archiving project, PANDORA, in 1996, and the current team consists of four members of staff. The NLA’s web archiving programme is selective, contrasting with approaches in the Scandinavian countries in particular where the aim has been to harvest the entirety of the country’s web domain. The decision to make selective harvests only was resource driven, since there was no extra funding available, although the Library are now doing periodic .au domain harvests in conjunction with the Internet Archive. .au domain harvests have been commissioned since 2005, and the 4th harvest is due this year. It will run over 4 weeks initially, and capture an anticipated 1 billion files, comprising around 40 TB of data. The Internet Archive manage the harvest and carry out a full text index; the results will be shopped to NLA and maintained on NLA servers, although copies will also be available via the Wayback Machine (without the full text indexing).

The terminology ‘PANDORA Archive’ is acknowledged to cause some confusion, particularly within the Australian government, and the Library acknowledge that they are not in fact carrying out an archive role in the traditional sense. Rather, PANDORA is a web collection, a snapshot of a point in time, a representation of what the NLA feels is important in the Australian web domain. PANDORA doesn’t meet recordkeeping needs for recording business transactions; the websites are harvested purely for their content and there is some leeway in the accuracy of dates of collection – for example, a site will be timestamped when it is harvested, but the NLA then perform further quality controls on the harvested site which may take up to a week to complete.

That said, the websites are harvested with the intention that the NLA will attempt to keep them in perpetuity, and permission is sought from website publishers for collection, preservation and public access, with this in mind. Legal deposit legislation does not (yet) cover electronic information in Australia, and considerable effort is therefore required in obtaining the relevant permissions from the website publisher (but not from every contributor). Access restrictions are applied in certain circumstances – for instance, where there is a commercial interest involved. Access can be restricted in several different ways – (a) for a set period of time following archiving, (b) for specific dates, (c) by use of authenticated logins, (d) access restricted to one PC in the NLA’s reading room. One of the problems NLA identify with their current harvesting software is that the restriction mechanism is not sufficiently finely tuned to file level – currently access restrictions can only be specified on a whole website.

The selection guidelines used are under review at the moment. The current priorities include major events (for example, coverage of Australian elections) but can basically cover any original, high quality content not available in print. The websites harvested range from academic e-journals to blogs. PANDORA is just moving into Web 2.0 harvesting, although they have already captured many blogs, some MySpace pages and some online video.

A PANDORA ‘title’ might be anything from a single PDF document to a whole or part website. A particular website might also be harvested at scheduled intervals, how long between captures depending on how regularly the site is updated, whether content is periodically removed as well as new content added, and the general stability of the organisation publishing the website. The harvest interval is re-assessed at each harvest. Currently the most frequent periods to harvest are between 6 months and 1 year. Organisationally, it is more efficient to carry out captures less frequently.

The PANDORA archive currently holds around 2TB of data, consisting of around 20,000 titles and 40,000 harvested instances.

Of most interest vis-a-vis local archive services in the UK, PANDORA has nine partners in State Libraries and other cultural organisations, who can define what they require to be collected via a web browser interface to PANDORA’s in-house harvesting tool, PANDAS. Librarians in partner institutions can also log in to fix minor problems with harvests or log more significant issues for the team at NLA to resolve. Most of the actual capture work, however, is carried out by the team at NLA.

Whilst the PANDORA team has a library background, it is noted that a certain level of technical skills are required. That said, other than the quality control work carried out on each harvested title, little post-processing is currently carried out specifically to promote the longevity of the stored files. 3 copies are created – a preservation master (the original files as harvested), a display master (which includes any quality control changes), and a metadata master. A display copy is then generated from the display master.

Read Full Post »

Operating the Digital Archive

As previously posted, the operation of the PROV Digital Archive is well integrated into the wider organisation, with the same team responsible for transfers of both paper and digital records. This team also creates the disposal authorities (more commonly known as ‘retention schedules’ in the UK – is the different terminology significant??!) for all Agencies within the State of Victoria.

Digital records are only accepted into the Archive if they are VERS compliant, and the Agency’s recordkeeping system can produce VEOS according to the standard mandated under the Victorian (as in ‘State of…’) Public Records Act.

This is obviously a strong advantage for PROV, and not a requirement which can easily be translated into the UK local authority archives context. However it is worth noting that despite the relative strength of their archival legislation, PROV staff still commit considerable effort into consulting with Agencies and carrying out pilot transfers. The team at PROV have noticed that it is harder to encourage deposit in a digital world, whereas historically a lack of physical space for keeping records often triggered transfers to the archives. Whereas traditionally the transfer process was client driven, commencing with an Agency request, PROV are now trying to move towards a programmed transfer timetable for both paper and digital records. PROV are trying to sell this to the Agencies as being cheaper and easier than ad hoc clear-outs of records.

There are in any case many similarities in dealing with transfers of records to the archives whatever the format of the records. PROV needs to maintain intellectual control over the records series, and descriptive lists need to be produced. Background information on provenance and access arrangements or restrictions is gathered prior to transfer by PROV staff through site visits or, increasingly, formalised documentation. The Agency staff are responsible for producing a ‘manifest’ listing the records being transferred. PROV provides advice and training on the process of preparing digital records for transfer, and transfer guidelines are published on the PROV website. Digital archives may be transferred on CD, hard drive or copied remotely into the Digital Archive inbox (though few Agencies have yet taken advantage of this method of transfer, preferring to follow the paper paradigm and copy records onto CD much as they would package paper records into boxes).

The system of intellectual control (assigning of unique identifiers etc.) for digital archives follows much the same pattern as for paper records. My feeling is that Australian practice in the use of consignments and the series system makes this simpler to implement than with the UK practice using accession numbers and hierarchical cataloguing, although clearly we in the UK need to take some time, as did PROV with the revision of their Archival Control Model, to consider how to integrate digital archives into key archival processes.

Where do PROV themselves hope to see improvements? Dealing with digital has highlighted an internal need for improved written procedures for dealing with transfers, whether in paper or digital formats. New staff need to be trained to operate the Digital Archives interface (a heavily customised version of Documentum). Improved guidelines are also needed to help Agencies, and in particular Agency IT staff who are most likely not familiar with archival practices and terminology. One of the technical support staff at PROV pointed out that ‘file’ in IT terms has potentially a completely different meaning to the archival ‘file’. Language needs to be translated into terms which Agency staff are familiar.

Once the digital records arrive at PROV, the manifest is loaded into the Digital Archive system and checked against the records actually received. The records are checked to ensure that they are valid VEOs and that they are virus-free. Various errors can be picked up at this stage – duplicate records, extra records received or too few, problems with the digital signature etc. Simple errors can be fixed by PROV staff, but in general it has been found best to request the Agency to resubmit the whole transfer. The records remain in ‘quarantine’ for seven days, before the checking process is re-run. If successful, the transfer can be finalised and the records become viewable through the PROV online catalogue.

The first pilot transfers to the Digital Archive took place in 2005. The largest accession so far has in fact been digital surrogates from PROV’s own digitisation programme, although another major and ongoing project is the archives of the Melbourne 2006 Commonwealth Games. This has brought its own unique challenges in working with a project organisation in the process of being wound down (for example, password protected records which cannot be processed into VEOs have had to be ignored).

Read Full Post »

Archival Support Programme

There are an estimated six to seven hundred places in the State of Victoria which hold archive collections, about 120 of which are recognised as Places of Deposit by PROV. PODs in this Australian context are “community facilities that meet the storage standards required by PROV to preserve records of significance to local communities”.

The Archival Support Programme started around ten years ago, originally as a small grants programme for archival supplies, and is run in collaboration with the Australian Society of Archivists and the National Archives of Australia. The programme takes the form of a travelling roadshow, with around four seminar topics presented each year.

This year’s programme included a roadshow seminar on “Computers and Small Archives”. This covered the basics of digitisation, designing online exhibitions, and using a computer to catalogue archival records, all focused on the kinds of practical situations likely to arise in a community archive setting. The seminar also included a session on digital preservation issues. This outlines the preservation issues of obsolescence and poor management, and encourages communities to adopt good practice in selecting appropriate long-term preservation formats, to copy media regularly, and to take care with storage conditions and handling, to take periodic backups, and to ensure documentation about the archives themselves, if maintained on a computer, can itself be exported and preserved over time. The central message is the need actively to manage digital information to ensure its continued accessibility.

The messages conveyed in this digital preservation talk are similar to those I incorporate in a WYAS presentation aimed at local Family History Societies. However, the emphasis in the PROV session on the various simple, yet effective, solutions which might be employed is striking, and is something which I will incorportate in future versions of the WYAS presentation.

Read Full Post »

« Newer Posts