Posts Tagged ‘WYAS’

This post is a thank you to my followers on Twitter, for pointing me towards many of the examples given below.  The thoughts on automated description and transcription are a preliminary sketching out of ideas (which, I suppose, is a way of excusing myself if I am not coherent!), on which I would particularly welcome comments or further suggestions:

A week or so before Easter, I was reading a paper about the classification of galaxies on the astronomical crowdsourcing website, Galaxy Zoo.  The authors use a statistical (Bayesian) analysis to distil an accurate sample of data, and then compare the reliability of this crowdsourced sample to classifications produced by expert astronomers.  The article also refers to the use of sample data in training artificial neural networks in order to automate the galaxy classification process.

This set me thinking about archivists’ approaches to online user participation and the harnessing of computing power to solve problems in archival description.  On the whole, I would say that archivists (and our partners on ‘digital archives’ kinds of projects) have been rather hamstrung by a restrictive ‘human-scale’, qualitatively-evaluated, vision of what might be achievable through the application of computing technology to such issues.

True, the notion of an Archival Commons evokes a network-oriented archival environment.  But although the proponents of this concept recognise “that the volume of records simply does not allow for extensive contextualization by archivists to the extent that has been practiced in the past”, the types of ‘functionalities’ envisaged to comprise this interactive descriptive framework still mirror conventional techniques of description in that they rely upon the human ability to interpret context and content in order to make contributions imbued with “cultural meaning”.  There are occasional hints of the potential for more extensible (?web scale) methods of description, in the contexts of tagging and of information visualization, but these seem to be conceived more as opportunities for “mining the communal provenance” of aggregated metadata – so creating additional folksonomic structures alongside traditional finding aids.  Which is not to say that the Archival Commons is not still justified from a cultural or societal perspective, but that the “volume of records” cataloguing backlog issue will require a solution which moves beyond merely adding to the pool of potential participants enabled to contribute narrative descriptive content and establish contextual linkages.

Meanwhile, double-keying, checking and data standardisation procedures in family history indexing have come a long way since the debacle over the 1901 census transcription. But double-keying for a commercial partner also signals a doubling of transcription costs, possibly without a corresponding increase in transcription accuracy.  Or, as the Galaxy Zoo article puts it, “the overall agreement between users does not necessarily mean improvement as people can agree on a wrong classification”.  Nevertheless, these norms from the commercial world have somehow transferred themselves as the ‘gold standard’ into archival crowdsourcing transcription projects, in spite of the proofreading overhead (bounded by the capacity of the individual, again).  As far as I am aware, Old Weather (which is, of course, a Zooniverse cousin of Galaxy Zoo) is the only project working with archival content which has implemented a quantitative approach to assess transcription accuracy – improving the project’s completion rate in the process, since the decision could be taken to reduce the number of independent transcriptions required from five to three.

Pondering these and other such tangles, I began to wonder whether there have indeed been any genuine attempts to harness large-scale processing power for archival description or transcription.  Tools are now available commercially designed to decipher modern handwriting (two examples: MyScript for LiveScribe; Evernote‘s text recognition tool), why not an automated palaeographical tool?  Vaguely remembering that The National Archives had once been experimenting with text mining for both cataloguing and sensitivity classification [I do not know what happened to this project – can anyone shed some light on this?], and recollecting the determination of one customer at West Yorkshire Archive Service who tried (and failed) valiantly to teach his Optical Character Recognition (OCR) software to recognise nearly four centuries of clerk’s handwriting in the West Riding Registry of Deeds indexes, I put out a tentative plea on Twitter for further examples of archival automation.  The following examples are the pick of the amazing set of responses I received:

  • The Muninn Project aims to extract and classify written data about the First World War from digitized documents using raw computing power alone.  The project appears to be at an early stage, and is beginning with structured documents (those written onto pre-printed forms) but hopes to move into more challenging territory with semi-structured formats at a later stage.
  • The Dutch Monk Project (not to be confused with the American project of the same name, which facilitates text mining in full-text digital library collections!) seeks to make use of the qualitative interventions of participants playing an online transcription correction game in order to train OCR software for improved handwriting recognition rates in future.  The project tries to stimulate user participation through competition and rewards, following the example of Google Image Labeller.  If your Dutch is good, Christian van der Ven’s blog has an interesting critique of this project (Google’s attempt at translation into English is a bit iffy, but you can still get the gist).
  • Impact is a European funded project which takes a similar approach to the Monk project, but has focused upon improving automated text recognition with early printed books.  The project has produced numerous tools to improve both OCR image recognition and lexical information retrieval, and a web-based collaborative correction platform for accuracy verification by volunteers.  The input from these volunteers can then in turn be used to further refine the automated character recognition (see the videos on the project’s YouTube channel for some useful introductory materials).  Presumably these techniques could be further adapted to help with handwriting recognition, perhaps beginning with the more stylised court hands, such as Chancery hand.  The division of the quality control checks into separate character, word, and page length tasks (as illustrated in this video) is especially interesting, although I think I’d want to take this further and partition the labour on each of the different tasks as well, rather than expecting one individual to work sequentially through each step.  Thinking of myself as a potential volunteer checker, I think I’d be likely to get bored and give up at the letter-checking stage.  Perhaps this rather more mundane task would be more effectively offered in return for peppercorn payment as a ‘human intelligence task’ on a platform such as Amazon Mechanical Turk, whilst the volunteer time could be more effectively utilised on the more interesting word and page level checking.
  • Genealogists are always ahead of the game!  The Family History Technology Workshop held annually at Brigham Young University usually includes at least one session on handwriting recognition and/or data extraction from digitized documents.  I’ve yet to explore these papers in detail, but there looks to be masses to read up on here.
  • Wot no catalogue? Google-style text search within historic manuscripts?  The Center for Intelligent Information Retrieval (University of Massachusetts Amherst) handwriting retrieval demonstration systems – manuscript document retrieval on the fly.
  • Several other tools and projects which might be of interest are listed in this handy google doc on Transcribing Handwritten Documents put together by attendees at the DHapi workshop held at the Maryland Institute for Technology in the Humanities earlier this year.  Where I’ve not mentioned specific examples directly here its mostly because these are examples of online user transcription interfaces (which for the purposes of this post I’m classing as technology-enhanced projects, as opposed to technology-driven, which is my main focus here – if that makes sense? Monk and Impact creep in above because they combine both approaches).

If you know of other examples, please leave a comment…

Read Full Post »

A round-up and some brief reflections on a number of different events and presentations I’ve attended recently:

Many of this term’s Archives and Society seminars at the Institute of Historical Research have been been on particularly pertinent subjects for me, and rather gratifyingly have attracted bumper audiences (we ran out of chairs at the last one I attended).  I’ve already blogged here about the talk on the John Latham Archive.  Presentations by Adrian Autton and Judith Bottomley from Westminster Archives, and Nora Daly and Helen Broderick from the British Library revealed an increasing awareness and interest in the use of social media in archives, qualified by a growing realisation that such initiatives are not self-sustaining, and in fact require a substantial commitment from archive staff, in time if not necessarily in financial terms, if they are to be successful.  Nora and Helen’s talk also prompted an intriguing audience debate about the ‘usefulness’ of user contributions.  To me, this translates as ‘why don’t users behave like archivists’ (or possibly like academic historians)?  But if the aim of promoting archives through social media is to attract new audiences, as is often claimed, surely we have to expect and celebrate the different perspectives these users bring to our collections.  Our professional training perhaps gives us tunnel vision when it comes to assessing the impact of users’ tagging and commenting.  Just because users’ terminology cannot be easily matched to the standardised metadata elements of ISAD(G) doesn’t mean it lacks relevance or usefulness outside of archival contexts.  Similar observations have been made in research in the museums and art galleries world, where large proportions of the tags contributed to the steve.museum prototype tagger represented terms not found in museum documentation (in one case, greater than 90% of tags were ‘new’ terms).  These new terms are viewed an unparalleled opportunity to enhance the accessibility of museum objects beyond traditional audiences, augmenting professional descriptions, not replacing them.

Releasing archival description from the artificial restraints imposed by the canon of professional practice was also a theme of my UCL colleague, Jenny Bunn’s, presentation of her PhD research, ‘The Autonomy Paradox’.  I find I can balance increased understanding about her research each time I hear her speak, with simultaneously greater confusion the deeper she gets into second order cybernetics!  Anyway, suffice it to say that I cannot possibly do justice to her research here, but anyone in north America might like to catch her at the Association of Canadian Archivists’ Conference in June.  I’m interested in the implications of her research for a move away from hierarchical or even series-system description, and whether this might facilitate a more object-oriented view of archival description.

Last term’s Archives and Society series included a talk by Nicole Schutz of Aberystwyth University about her development of a cloud computing toolkit for records management.  This was repeated at the recent meeting of the Data Standards Section of the Archives and Records Association, who had sponsored the research.  At the same meeting, I was pleased to discover that I know more than I thought I did about linked data and RDF, although I am still relieved that Jane Stevenson and the technical team behind the LOCAH Project are pioneering this approach in the UK archives sector and not me!  But I am fascinated by the potential for linked open data to draw in a radical new user community to archives, and will be watching the response to the LOCAH Project with interest.

The Linked Data theme was continued at the UKAD (UK Archives Discovery Network) Forum held at The National Archives on 2 March.  There was a real buzz to the day – so nice to attend an archives event that was full of positive energy about the future, not just ‘tough talk for tough times’.  There were three parallel tracks for most of the day, plus a busking space for short presentations and demos.  Obviously, I couldn’t get to everything, but highlights for me included:

  • the discovery of a second archives Linked Data project – the SALDA project at the University of Sussex, which is extract archival descriptions from CALM using EAD, and then transform them into Linked Data
  • Victoria Peters’ overview of the open source archival description software, ICA-AtoM – feedback welcomed, I think, on the University of Stathclyde’s new online catalogue which uses ICA-AtoM.
  • chatting about Manchester Archive + (Manchester archival images on flickr)
  • getting an insider’s view of HistoryPin and Ancestry’s World Archives Project – the latter particularly fascinating to me in the context of motivating and supporting contributors in online archival contexts

Slides from the day, including mine on Crowds and Communities in the Archives, are being gathered together on slideshare at http://www.slideshare.net/tag/ukad.  Initial feedback from the day was good, and several people have blogged about the event (including Bethan Ruddock from the ArchivesHub, a taxonomist’s viewpoint at VocabControl, Karen Watson from the SALDA Project, and The Questing Archivist).

Edit to add Kathryn Hannan’s Archives and Auteurs blog post.

Read Full Post »

Since it seems a few people read my post about day one of ECDL2010, I guess I’d better continue with day two!

Liina Munari’s keynote about digital libraries from the European Commission’s perspective provided delegates with an early morning shower of acronymns.  Amongst the funder-speak, however, there were a number of proposals from the forthcoming FP7 Call 6 funding round which are interesting from an archives and records perspective, including projects investigating cloud storage and the preservation of context, and on appraisal and selection using the ‘wisdom of crowds’. Also, the ‘Digital Single Market’ will include work on copyright, specifically the orphan works problem, which promises to be useful to the archives sector – Liina pointed out that the total size of the European Public Domain is smaller than the US equivalent because of the extended period of copyright protection available to works whose current copyright owners are unknown. But I do wish people would not use the ‘black hole’ description; its alarmist and inaccurate.  If we combine this twentieth century black hole (digitised orphan works) with the oft-quoted born-digital black hole, it seems a wonder we have any cultural heritage left in Europe at all.

After the opening keynote, I attended the stream on the Social Web/Web 2.0, where we were treated to three excellent papers on privacy-aware folksonomies, seamless web editing, and the automatic classification of social tags. The seamless web editor, seaweed, is of interest to me in a personal capacity, because of its WordPress plugin, which would essentially enable the user to add new posts or edit existing ones directly into a web browser without recourse to the cumbersome WordPress dashboard, and absent mindedly adding new pages instead of new posts (which is what I generally manage to do by mistake). I’m sure there are archives applications too, possibly for instance in terms of the user interface design for encouraging participation in archival description.  Privacy-aware folksonomies, a system to enable greater user control over tagging (with levels user only, friends, and tag provider), might have application in respect of some of the more sensitive archive content, such as mental health records perhaps.  The paper on the automatic classification of social tags will be of particular interest to records managers interested in the searchability and re-usability of folksonomies in record-keeping systems, as well as to archivists implementing tagging systems into the online catalogue or digital archives interfaces.

After lunch we had a poster and demo session.  Those which particularly caught my attention included a poster from the University of Oregon entitled ‘Creating a Flexible Preservation Infrastructure for Electronic Records’ and described as the ‘do-it’ solution to digital preservation in a small repository without any money.  Sounded familiar!  The authors, digital library expert Karen Estlund and University Archivist Heather Briston, described how they have made best use of existing infrastructure, such as share drives (for deposit) and the software package Archivists Toolkit for description.  Their approach is similar to the workflow I put in place for West Yorkshire Archive Service, except that the University are fortunate to be in a position to train staff to carry out some self-appraisal before deposit, which simplifies the process.  I was also interested (as someone who is never really sure why tagging is useful) in a poster ‘Exploring the Influence of Tagging Motivation on Tagging Behaviour’ which classified taggers into two groups, describers and categorisers, and in the demonstration of the OCRopodium project at King’s College London, exploring the use of optical character recognition (OCR) with typescript texts.

In the final session of the day, I was assigned to the stream on search in digital libraries, where papers explored the impact of the search interface on search tasks, relevance judgements, and search interface design.

Then there was the conference dinner…

Read Full Post »

Some exciting news today –  the West Yorkshire Archive Service [WYAS] submission to the InterPares 3 Research Project for a case study of the MLA Yorkshire archives has been accepted.  MLA Yorkshire, the lead strategic agency for museums, libraries and archives in the region, closes this week (so that live website might not be available for too much longer! – In fact, I’ve been experimenting with the Internet Archives’ Archive-It package as part of the MLA Yorkshire archives work) as part of a national restructuring of the wider organisation, and I’ve spent much of the past few days arranging the transfer of both paper and digital archives from the local office in Leeds. 

InterPares 3 focuses on implementing the theory of digital preservation in small and medium-sized archives, and should provide an excellent chance for WYAS to build up in-house digital preservation expertise as we feel our way with this, our first large-scale digital deposit.  I’m really excited about this opportunity, and I hope to document how we get on with the project on this blog.

Read Full Post »