Feeds:
Posts
Comments

Posts Tagged ‘transcription’

It’s been a busy summer for me – lots of stimulating conferences and events.  Here’s my (eclectic) roundup of highlights:

No.1 spot has to go to the fabulous VeleHanden project, a collaborative digitisation and crowdsourcing project initiated by Amsterdam City Archives, with numerous archival partners from all over the Netherlands. I was lucky enough to be invited to the inaugural meeting of the user test panel for the pilot project, militieregisters (militia registers), in Amsterdam at the end of June.

Why do I never get to work in buildings like this?

The testing phase of the project is now well underway, and the project is due to go live in October.  VeleHanden interests me for a number of reasons: Firstly, it has an interesting and innovative private-public partnership funding model and project structure.  Participating archives have to pay to have their registers scanned by a commercial digitisation company, but the sheer size of the consortium has enabled the negotiation of a very low price per page digitised.  Research users of the militieregisters site will pay a small fee to download a digitised image (similar to Ancestry), thus providing an ongoing revenue stream for the project.  The crowdsourcing interface is being developed by a private company; in future the consortium (or individual members of the consortium) will hire the platform for new projects, and the developers will be free to sell their product to other crowdsourcing markets.  Secondly, I’m interested in the project’s (still evolving) approach to opening up archival metadata.  Thirdly, I’m interested in the way the project is going about recruiting and motivating volunteers to undertake the indexing of the registers – targeting the popular family history community; offering extrinsic quasi-financial rewards for participants in the shape of discounted access to digitised content; and promoting and celebrating competition between participants.

In fact, I think one of VeleHanden‘s great strengths is the project’s user-focused approach to design and testing, the importance of which was highlighted by Claire Warwick in a ‘How To’ session on Studying Users at Interface 2011, “a new international forum to learn, share and network between the fields of Humanities and Technology”.  Slides from the keynote and workshop sessions at this event are available on the Interface 2011 website; all are worth a look.  I particularly enjoyed the workshop on Thinking Through Networks and the practical tips on How to Get Funded should resonate with a much wider audience than just the academic community.  All the delegates had to give a lightening talk about their research.  Here is mine:

View more presentations from 80gb
I also spoke at the Bloomsbury Conference on e-Publishing and e-Publications, and attended a couple of conferences I also went to last year – Research2, the Loughborough University student-organised conference on data analysis for information science, and AERI, the Archival Education and Research Institute, this year at Simmons College in Boston, MA.  It was interesting to note an increased interest in online participation and Internet-based methods at both events.  Podcasts of the AERI plenary sessions are available at the link above.
Advertisements

Read Full Post »

This post is a thank you to my followers on Twitter, for pointing me towards many of the examples given below.  The thoughts on automated description and transcription are a preliminary sketching out of ideas (which, I suppose, is a way of excusing myself if I am not coherent!), on which I would particularly welcome comments or further suggestions:

A week or so before Easter, I was reading a paper about the classification of galaxies on the astronomical crowdsourcing website, Galaxy Zoo.  The authors use a statistical (Bayesian) analysis to distil an accurate sample of data, and then compare the reliability of this crowdsourced sample to classifications produced by expert astronomers.  The article also refers to the use of sample data in training artificial neural networks in order to automate the galaxy classification process.

This set me thinking about archivists’ approaches to online user participation and the harnessing of computing power to solve problems in archival description.  On the whole, I would say that archivists (and our partners on ‘digital archives’ kinds of projects) have been rather hamstrung by a restrictive ‘human-scale’, qualitatively-evaluated, vision of what might be achievable through the application of computing technology to such issues.

True, the notion of an Archival Commons evokes a network-oriented archival environment.  But although the proponents of this concept recognise “that the volume of records simply does not allow for extensive contextualization by archivists to the extent that has been practiced in the past”, the types of ‘functionalities’ envisaged to comprise this interactive descriptive framework still mirror conventional techniques of description in that they rely upon the human ability to interpret context and content in order to make contributions imbued with “cultural meaning”.  There are occasional hints of the potential for more extensible (?web scale) methods of description, in the contexts of tagging and of information visualization, but these seem to be conceived more as opportunities for “mining the communal provenance” of aggregated metadata – so creating additional folksonomic structures alongside traditional finding aids.  Which is not to say that the Archival Commons is not still justified from a cultural or societal perspective, but that the “volume of records” cataloguing backlog issue will require a solution which moves beyond merely adding to the pool of potential participants enabled to contribute narrative descriptive content and establish contextual linkages.

Meanwhile, double-keying, checking and data standardisation procedures in family history indexing have come a long way since the debacle over the 1901 census transcription. But double-keying for a commercial partner also signals a doubling of transcription costs, possibly without a corresponding increase in transcription accuracy.  Or, as the Galaxy Zoo article puts it, “the overall agreement between users does not necessarily mean improvement as people can agree on a wrong classification”.  Nevertheless, these norms from the commercial world have somehow transferred themselves as the ‘gold standard’ into archival crowdsourcing transcription projects, in spite of the proofreading overhead (bounded by the capacity of the individual, again).  As far as I am aware, Old Weather (which is, of course, a Zooniverse cousin of Galaxy Zoo) is the only project working with archival content which has implemented a quantitative approach to assess transcription accuracy – improving the project’s completion rate in the process, since the decision could be taken to reduce the number of independent transcriptions required from five to three.

Pondering these and other such tangles, I began to wonder whether there have indeed been any genuine attempts to harness large-scale processing power for archival description or transcription.  Tools are now available commercially designed to decipher modern handwriting (two examples: MyScript for LiveScribe; Evernote‘s text recognition tool), why not an automated palaeographical tool?  Vaguely remembering that The National Archives had once been experimenting with text mining for both cataloguing and sensitivity classification [I do not know what happened to this project – can anyone shed some light on this?], and recollecting the determination of one customer at West Yorkshire Archive Service who tried (and failed) valiantly to teach his Optical Character Recognition (OCR) software to recognise nearly four centuries of clerk’s handwriting in the West Riding Registry of Deeds indexes, I put out a tentative plea on Twitter for further examples of archival automation.  The following examples are the pick of the amazing set of responses I received:

  • The Muninn Project aims to extract and classify written data about the First World War from digitized documents using raw computing power alone.  The project appears to be at an early stage, and is beginning with structured documents (those written onto pre-printed forms) but hopes to move into more challenging territory with semi-structured formats at a later stage.
  • The Dutch Monk Project (not to be confused with the American project of the same name, which facilitates text mining in full-text digital library collections!) seeks to make use of the qualitative interventions of participants playing an online transcription correction game in order to train OCR software for improved handwriting recognition rates in future.  The project tries to stimulate user participation through competition and rewards, following the example of Google Image Labeller.  If your Dutch is good, Christian van der Ven’s blog has an interesting critique of this project (Google’s attempt at translation into English is a bit iffy, but you can still get the gist).
  • Impact is a European funded project which takes a similar approach to the Monk project, but has focused upon improving automated text recognition with early printed books.  The project has produced numerous tools to improve both OCR image recognition and lexical information retrieval, and a web-based collaborative correction platform for accuracy verification by volunteers.  The input from these volunteers can then in turn be used to further refine the automated character recognition (see the videos on the project’s YouTube channel for some useful introductory materials).  Presumably these techniques could be further adapted to help with handwriting recognition, perhaps beginning with the more stylised court hands, such as Chancery hand.  The division of the quality control checks into separate character, word, and page length tasks (as illustrated in this video) is especially interesting, although I think I’d want to take this further and partition the labour on each of the different tasks as well, rather than expecting one individual to work sequentially through each step.  Thinking of myself as a potential volunteer checker, I think I’d be likely to get bored and give up at the letter-checking stage.  Perhaps this rather more mundane task would be more effectively offered in return for peppercorn payment as a ‘human intelligence task’ on a platform such as Amazon Mechanical Turk, whilst the volunteer time could be more effectively utilised on the more interesting word and page level checking.
  • Genealogists are always ahead of the game!  The Family History Technology Workshop held annually at Brigham Young University usually includes at least one session on handwriting recognition and/or data extraction from digitized documents.  I’ve yet to explore these papers in detail, but there looks to be masses to read up on here.
  • Wot no catalogue? Google-style text search within historic manuscripts?  The Center for Intelligent Information Retrieval (University of Massachusetts Amherst) handwriting retrieval demonstration systems – manuscript document retrieval on the fly.
  • Several other tools and projects which might be of interest are listed in this handy google doc on Transcribing Handwritten Documents put together by attendees at the DHapi workshop held at the Maryland Institute for Technology in the Humanities earlier this year.  Where I’ve not mentioned specific examples directly here its mostly because these are examples of online user transcription interfaces (which for the purposes of this post I’m classing as technology-enhanced projects, as opposed to technology-driven, which is my main focus here – if that makes sense? Monk and Impact creep in above because they combine both approaches).

If you know of other examples, please leave a comment…

Read Full Post »

A round-up of a few pieces of digital goodness to cheer up a damp and dark start to October:

What looks like a bumper new issue of the Journal of the Society of Archivists (shouldn’t it be getting a new name?) is published today.  It has an oral history theme, but actually it was the two articles that don’t fit the theme which caught my eye for this blog.  Firstly, Viv Cothey’s final report on the Digital Curation project, GAip and SCAT, at Gloucestershire Archives, with which I had a minor involvement as part of the steering group for the Sociey of Archivists’-funded part of the work.  The demonstration software developed by the project is now available for download via the project website.  Secondly, Candida Fenton’s dissertation research on the Use of Controlled Vocabulary and Thesauri in UK Online Finding Aids will be of  interest to my colleages in the UKAD network.  The issue also carries a review, by Alan Bell, of Philip Bantin’s book Understanding Data and Information Systems for Recordkeeping, which I’ve also found a helpful way in to some of the more technical electronic records issues.  If you do not have access via the authentication delights of Shibboleth, no doubt the paper copies will be plopping through ARA members’ letterboxes shortly.

Last night, by way of supporting the UCL home team (read: total failure to achieve self-imposed writing targets), I had my first go at transcribing a page of Jeremy Bentham’s scrawled notes on Transcribe Bentham.  I found it surprisingly difficult, even on the ‘easy’ pages!  Admittedly, my paleographical skills are probably a bit rusty, and Bentham’s handwriting and neatness leave a little to be desired – he seems to have been a man in a hurry – but what I found most tricky was not being able to glance at the page as a whole and get the gist of the sentence ahead at the same time as attempting to decipher particular words.  In particular, not being able to search down the whole page looking for similar letter shapes.  The navigation tools do allow you to pan and scroll, and zoom in and out, but when you’ve got the editing page up on the screen as well as the document, you’re a bit squished for space.  Perhaps it would be easier if I had a larger monitor.  Anyway, it struck me that this type of transcription task is definitely a challenge, for people who want to get their teeth into something, not the type of thing you might dip in and out of in a spare moment (like indicommons on iPhone and iPad, for instance).

I’m interested in reward and recognition systems at the moment, and how crowdsourcing projects seek to motivate participants to contribute.  Actually, it’s surprising how many projects seem not to think about this at all – the build it and wait for them to come attitude.  Quite often, it seems, the result is that ‘they’ don’t come, so it’s interesting to see Transcribe Bentham experiment with a number of tricks for monitoring progress and encouraging people to keep on transcribing.  So, there’s the Benthamometer for checking on overall progress, you can set up a watchlist to keep an eye on pages you’ve contributed to, individual registered contributors can set up a user profile to state their credentials, chat to fellow transcribers on the discussion forum, and there’s a points system, depending on how active you are on the site, and a leader board of top transcribers.  The leader board seems to be fueling a bit of healthy transatlantic competition right at the moment, but given the ‘expert’ wanting-to-crack-a-puzzle nature of the task here, I wonder whether the more social / community-building facilities might prove more effective over the longer term than the quantitative approaches.  One to watch.

Finally, anyone with the techie skills to mashup data ought to be welcoming The National Archives’ work on designing the Open Government Licence (OGL) for public sector information in the U.K.  I haven’t (got the technical skills) but I’m welcoming it anyway in case anyone who has hasn’t yet seen the publicity about it, and because I am keen to be associated with angels.

Read Full Post »