Posts Tagged ‘Web Archiving’

Finally getting around to posting a little something about the web archiving conference held at the British Library a couple of weeks ago.

From a local archives perspective, it was particularly interesting to hear a number of presenters acknowledge the complexity and cost of implementation and use of currently available web archiving tools.  Richard Davis, talking about the ArchivePress blog archiving project, went so far as to argue that this was using a ‘hammer to crack a nut’, and we’ll certainly be keeping an eye out at West Yorkshire Archive Service for potential new use cases for ArchivePress’s feed-focused methodology and tools.   ArchivePress should really appeal to my fellow local authority archivist colleague Alan who is always on the look-out for self-sufficiency in digital preservation solutions.

I also noted Jeffrey van der Hoeven’s suggestion that smaller archives might in future be able to benefit from the online GRATE (Global Remote Access to Emulation Services) tool developed as part of the Planets project, offering emulation over the internet through a browser without the need to install any software locally.

Permission to harvest websites, particularly in the absence of updated legal deposit legislation in the UK, was another theme which kept cropping up throughout the day.  So here is a good immediate opportunity for local archivists to get involved in suggesting sites for the UK Web Archive, making the most of our local network of contacts.  Although I still think there is a gap here in the European web archiving community for an Archive-It type service to enable local archivists to scope and run their own crawls to capture at-risk sites at sometimes very short notice, as we had to at West Yorkshire Archive Service with the MLA Yorkshire website.

Archivists do not (or should not) see websites in isolation – they are usually one part of a much wider organisational archival legacy.  To my mind, the ‘web archiving’ community is at present too heavily influenced by a library model and mindset, which concentrates on thematic content and pays too little attention to more archival concerns, such as provenance and context.  So I was pleased to see this picked up in the posting and comments on Jonathan Clark’s blog about the Enduring Links event.

Lastly in my round-up, Cathy Smith from TNA had some interesting points to make from a user perspective.  She suggested that although users might prefer a single view of a national web collection, this did not necessarily imply a single repository – although collecting institutions still need to work together to eliminate overlap and to coordinate presentation.  This – and the following paper on TNA’s Digital Continuity project – set me thinking, not for the first time, about some potential problems with the geographically defined collecting remits of UK local authority archive services in a digital world.  After all, to the user, local and central government websites are indistinguishable at the .gov.uk domain level, not to mention that much central government policy succeeds or fails depending on how it is delivered at local level.  Follow almost any route through DirectGov and you will end up at a search page for local services.  Websites, unlike paper filing series, do not have distinct, defined limits.  One of the problems with the digital preservation self-sufficiency argument is that the very nature of the digital world – and increasingly so in an era of mash-ups and personalised content – is the exact opposite, highly interdependent and complex.  So TNA’s harvesting of central government websites may be of limited value over the long-term, unless it is accompanied by an equally enthusiastic campaign to capture content across local government in the UK.

Slides from all the presentations are available on the DPC website.

Read Full Post »

Spotted in TNA’s web archive as I was preparing a presentation earlier this week.  What happens if you are still viewing the archived site at 5.01pm*, I wonder?

TNA website

* hint , look at the restrictions on use!

Read Full Post »

Some exciting news today –  the West Yorkshire Archive Service [WYAS] submission to the InterPares 3 Research Project for a case study of the MLA Yorkshire archives has been accepted.  MLA Yorkshire, the lead strategic agency for museums, libraries and archives in the region, closes this week (so that live website might not be available for too much longer! – In fact, I’ve been experimenting with the Internet Archives’ Archive-It package as part of the MLA Yorkshire archives work) as part of a national restructuring of the wider organisation, and I’ve spent much of the past few days arranging the transfer of both paper and digital archives from the local office in Leeds. 

InterPares 3 focuses on implementing the theory of digital preservation in small and medium-sized archives, and should provide an excellent chance for WYAS to build up in-house digital preservation expertise as we feel our way with this, our first large-scale digital deposit.  I’m really excited about this opportunity, and I hope to document how we get on with the project on this blog.

Read Full Post »

Presentations from the successful open consultation day held at TNA on 12 November on digital preservation for local authority archivists are now available on the DPC website – including my report on my Churchill Fellowship research in the US and Australia.  Also featured were colleagues from other local authority services already active in practical digital preservation initiatives – Heather Needham on ingest work at Hampshire, Viv Cothey reporting on his GAIP tool developed for Gloucestershire Archives, and Kevin Bolton on web archiving work at Manchester City. 

Heather and I also reported back on the results of the digital preservation survey of local authorities and a copy of the interim report is also now available on the DPC site.   A paper incorporating the discussion arising from the survey, from the afternoon sessions of the consultation event, will be published in Ariadne in January 2009.

Read Full Post »

Probably the best posting I can make on the Internet Archive, based in San Francisco, is to encourage colleagues to have a look at their Archive-It subscription service, and perhaps attend a free webinar about the tool (details on the site) or at least have a look through some of the collections from partner institutions in US State Archives.

Although not yet listed on the site, some UK colleagues are already experimenting with the tool.  The Internet Archive offers full hosting and storage, or can also ship the results of the web crawl back to the partner institution – as they will be doing for the major full Australian domain web crawl for the National Library of Australia, which had just completed at the time of my visit.  The IA is also working LOCKSS for storage of harvested websites, and hoping to work with the digital repository software platforms DSpace and Fedora.  Tools to enable more sophisticated pre-crawl scoping and to bookmark potential sites of interest before harvesting are also due for release soon.

Read Full Post »

National Library of Australia

National Library of Australia

The National Library of Australia (NLA) began their web archiving project, PANDORA, in 1996, and the current team consists of four members of staff. The NLA’s web archiving programme is selective, contrasting with approaches in the Scandinavian countries in particular where the aim has been to harvest the entirety of the country’s web domain. The decision to make selective harvests only was resource driven, since there was no extra funding available, although the Library are now doing periodic .au domain harvests in conjunction with the Internet Archive. .au domain harvests have been commissioned since 2005, and the 4th harvest is due this year. It will run over 4 weeks initially, and capture an anticipated 1 billion files, comprising around 40 TB of data. The Internet Archive manage the harvest and carry out a full text index; the results will be shopped to NLA and maintained on NLA servers, although copies will also be available via the Wayback Machine (without the full text indexing).

The terminology ‘PANDORA Archive’ is acknowledged to cause some confusion, particularly within the Australian government, and the Library acknowledge that they are not in fact carrying out an archive role in the traditional sense. Rather, PANDORA is a web collection, a snapshot of a point in time, a representation of what the NLA feels is important in the Australian web domain. PANDORA doesn’t meet recordkeeping needs for recording business transactions; the websites are harvested purely for their content and there is some leeway in the accuracy of dates of collection – for example, a site will be timestamped when it is harvested, but the NLA then perform further quality controls on the harvested site which may take up to a week to complete.

That said, the websites are harvested with the intention that the NLA will attempt to keep them in perpetuity, and permission is sought from website publishers for collection, preservation and public access, with this in mind. Legal deposit legislation does not (yet) cover electronic information in Australia, and considerable effort is therefore required in obtaining the relevant permissions from the website publisher (but not from every contributor). Access restrictions are applied in certain circumstances – for instance, where there is a commercial interest involved. Access can be restricted in several different ways – (a) for a set period of time following archiving, (b) for specific dates, (c) by use of authenticated logins, (d) access restricted to one PC in the NLA’s reading room. One of the problems NLA identify with their current harvesting software is that the restriction mechanism is not sufficiently finely tuned to file level – currently access restrictions can only be specified on a whole website.

The selection guidelines used are under review at the moment. The current priorities include major events (for example, coverage of Australian elections) but can basically cover any original, high quality content not available in print. The websites harvested range from academic e-journals to blogs. PANDORA is just moving into Web 2.0 harvesting, although they have already captured many blogs, some MySpace pages and some online video.

A PANDORA ‘title’ might be anything from a single PDF document to a whole or part website. A particular website might also be harvested at scheduled intervals, how long between captures depending on how regularly the site is updated, whether content is periodically removed as well as new content added, and the general stability of the organisation publishing the website. The harvest interval is re-assessed at each harvest. Currently the most frequent periods to harvest are between 6 months and 1 year. Organisationally, it is more efficient to carry out captures less frequently.

The PANDORA archive currently holds around 2TB of data, consisting of around 20,000 titles and 40,000 harvested instances.

Of most interest vis-a-vis local archive services in the UK, PANDORA has nine partners in State Libraries and other cultural organisations, who can define what they require to be collected via a web browser interface to PANDORA’s in-house harvesting tool, PANDAS. Librarians in partner institutions can also log in to fix minor problems with harvests or log more significant issues for the team at NLA to resolve. Most of the actual capture work, however, is carried out by the team at NLA.

Whilst the PANDORA team has a library background, it is noted that a certain level of technical skills are required. That said, other than the quality control work carried out on each harvested title, little post-processing is currently carried out specifically to promote the longevity of the stored files. 3 copies are created – a preservation master (the original files as harvested), a display master (which includes any quality control changes), and a metadata master. A display copy is then generated from the display master.

Read Full Post »