Feeds:
Posts
Comments

Summer Summary

It’s been a busy summer for me – lots of stimulating conferences and events.  Here’s my (eclectic) roundup of highlights:

No.1 spot has to go to the fabulous VeleHanden project, a collaborative digitisation and crowdsourcing project initiated by Amsterdam City Archives, with numerous archival partners from all over the Netherlands. I was lucky enough to be invited to the inaugural meeting of the user test panel for the pilot project, militieregisters (militia registers), in Amsterdam at the end of June.

Why do I never get to work in buildings like this?

The testing phase of the project is now well underway, and the project is due to go live in October.  VeleHanden interests me for a number of reasons: Firstly, it has an interesting and innovative private-public partnership funding model and project structure.  Participating archives have to pay to have their registers scanned by a commercial digitisation company, but the sheer size of the consortium has enabled the negotiation of a very low price per page digitised.  Research users of the militieregisters site will pay a small fee to download a digitised image (similar to Ancestry), thus providing an ongoing revenue stream for the project.  The crowdsourcing interface is being developed by a private company; in future the consortium (or individual members of the consortium) will hire the platform for new projects, and the developers will be free to sell their product to other crowdsourcing markets.  Secondly, I’m interested in the project’s (still evolving) approach to opening up archival metadata.  Thirdly, I’m interested in the way the project is going about recruiting and motivating volunteers to undertake the indexing of the registers – targeting the popular family history community; offering extrinsic quasi-financial rewards for participants in the shape of discounted access to digitised content; and promoting and celebrating competition between participants.

In fact, I think one of VeleHanden‘s great strengths is the project’s user-focused approach to design and testing, the importance of which was highlighted by Claire Warwick in a ‘How To’ session on Studying Users at Interface 2011, “a new international forum to learn, share and network between the fields of Humanities and Technology”.  Slides from the keynote and workshop sessions at this event are available on the Interface 2011 website; all are worth a look.  I particularly enjoyed the workshop on Thinking Through Networks and the practical tips on How to Get Funded should resonate with a much wider audience than just the academic community.  All the delegates had to give a lightening talk about their research.  Here is mine:

View more presentations from 80gb
I also spoke at the Bloomsbury Conference on e-Publishing and e-Publications, and attended a couple of conferences I also went to last year – Research2, the Loughborough University student-organised conference on data analysis for information science, and AERI, the Archival Education and Research Institute, this year at Simmons College in Boston, MA.  It was interesting to note an increased interest in online participation and Internet-based methods at both events.  Podcasts of the AERI plenary sessions are available at the link above.
  • Digital Impacts: How to Measure and Understand the Usage and Impact of Digital Content, Oxford Internet Institute/JISC, Oxford, 20th May 2011 (#oiiimpacts)
  • Beyond Collections: Crowdsourcing for public engagement, RunCoCo Conference, Oxford, 26th May 2011 (#beyond2011)
  • Professor Sherry Turkle, Alone Together RSA Lecture, RSA, London, 1st June 2011 (#rsaonline)

I’m getting a bit behind with blog postings (again), so here, in the interests of ticking another thing off my to-do list, are a few highlights from various events I’ve attended recently…

It was good to see a couple of fellow archivists at the showcase conference for JISC’s Impact and Embedding of Digitised Resources programme. As searchroom visitor figures continue to fall, it is more important than ever that archivists understand how to measure and demonstrate the usage and impact of their online resources. The number of unique visitor’s to the archive service’s website (currently the only metric available in the CIPFA questionnaire for Archive Services, for instance) is no longer (if it ever was) adequate as a measure of online usage.  As Dr Eric Meyer pointed out in his introduction, one of the central lessons arising from the development of the Toolkit for the Impact of Digitised Scholarly Resources has been that no single metric will ever tell the whole story – a range of qualitative and quantitative methods are necessary to provide a full picture.  The word ‘scholarly’ in the toolkit’s name may be rather off-putting to some archivists working in local government repositories. That would be a shame, because this free online resource is full of very practical and useful advice and guidance. Like the historians caracatured by Sharon Howard of the Old Bailey Online project, archivists are not good at “studying people who can answer back” – the professional archival literature is full of laments about how poor we are at user studies. The synthesis report from the Impact programme, Splashes and Ripples: Synthesizing the Evidence on the Impacts of Digital Resources, is recommended reading; detailed evaluation reports from each of the projects which took part in the programme are also available (at http://www.jisc.ac.uk/whatwedo/programmes/digitisation/impactembedding.aspx).  Many of the recommendations made by the report would be relatively straightforward to implement, yet could potentially transform archive services’ online presence – and the TIDSR toolkit contains the resources to help evaluate the change.  Simple suggestions like picking non-word acronymns to improve project visibility online (like TIDSR – at last I have understood the Internet’s curious aversion to vowels, flickr, lanyrd, tumblr and so on!) and providing simple, automatic citations that are easy to copy or download (although I rather fear that archives are missing the boat on this one). Jane Winters was also excellent on the subject of sustaining digital impact, an important subject for archives whose online resources are perhaps more likely than most to have a long shelf-life. Twitter coverage of the event is available on Summarizr (another one!).

One gap in the existing digital measurement landscape which occurred to me during the Impacts event was the need for metrics which take account, not just of the passive audience of digital resources, but of those who contribute to them and participate in a more active way.  The problem is easily illustrated by the difficulties encountered when using standard quantitative measurement tools with Web2.0 type sites.  Attempting to collate statistics on sites such as Your Archives or Transcribe Bentham through the likes of Google Scholar or Yahoo’s Site Explorer is handicapped by the very flexibility of a wiki site structure, compounded again, I suspect, by the want of a uniquely traceable identity.  Google Scholar, in particular, seems averse to searches on URLs (although curiously, I discovered that although a search for yourarchives.nationalarchives.gov.uk produces 0 hits, yourarchives.nationalarchives.gov.* comes back with 26), whilst sites which invite user contributions are perhaps particularly susceptible to false-positive site inlink hits where they are highlighted as a general resource in blogrolls and the like.

This need to be clearer about what we mean by user engagement and how to measure when we’ve successfully achieved it was also my main take-away from the following week’s RunCoCo Conference – Beyond Collections: Crowdsourcing for Public Engagement.  Like Arfon Smith of the Zooniverse team, I am not very comfortable with the term ‘crowdsourcing’, and indeed many of the projects showcased at the Beyond conference seemed to me to be more technologically-enhanced outreach events or volunteer projects than true attempts to engage the ‘crowd’ (not that there is anything wrong with traditional approaches, but I just don’t think they’re crowdsourcing).  Even where large numbers of people are involved, are they truly ‘engaged’ by receiving a rubber stamp (in the case of the Erster Weltkrieg in Alltagsdokumenten project) to mark their attendance at an open day type event?  Understanding the social dynamics behind even large scale online collaborations is important – the Zooniverse ethical contract bears repeating:

  1. Contributors are collaborators, not users
  2. Contributors are motivated and engaged by real research
  3. Don’t waste people’s time

Podcasts of all the Beyond presentations and a series of short, reflective blog posts on the day’s proceedings are available.

Finally, Professor Sherry Turkle‘s RSA lecture to celebrate the launch of her new book, Alone Together, about the social impact of the Internet, was rather too brief to give more than a glimpse of her current thinking on our technology saturated society, but nevertheless there were some intriguing ideas which have potentially wide-ranging implications for the future of archives. One was the sense that the Internet is not currently serving our human needs.  She also spoke about the tensions between being willing to share and privacy.  Turkle asked what is democracy and what is intimacy without privacy? In response to questions from the audience, Turkle also claimed that people don’t like to say negative things online because it leaves a trace of things that went wrong. If that is true, it might have important implications for what we can expect people to contribute in archival contexts, and the nature of the debate which might take place in contested spaces of memory. Audio of the event is available from the RSA website.

This post is a thank you to my followers on Twitter, for pointing me towards many of the examples given below.  The thoughts on automated description and transcription are a preliminary sketching out of ideas (which, I suppose, is a way of excusing myself if I am not coherent!), on which I would particularly welcome comments or further suggestions:

A week or so before Easter, I was reading a paper about the classification of galaxies on the astronomical crowdsourcing website, Galaxy Zoo.  The authors use a statistical (Bayesian) analysis to distil an accurate sample of data, and then compare the reliability of this crowdsourced sample to classifications produced by expert astronomers.  The article also refers to the use of sample data in training artificial neural networks in order to automate the galaxy classification process.

This set me thinking about archivists’ approaches to online user participation and the harnessing of computing power to solve problems in archival description.  On the whole, I would say that archivists (and our partners on ‘digital archives’ kinds of projects) have been rather hamstrung by a restrictive ‘human-scale’, qualitatively-evaluated, vision of what might be achievable through the application of computing technology to such issues.

True, the notion of an Archival Commons evokes a network-oriented archival environment.  But although the proponents of this concept recognise “that the volume of records simply does not allow for extensive contextualization by archivists to the extent that has been practiced in the past”, the types of ‘functionalities’ envisaged to comprise this interactive descriptive framework still mirror conventional techniques of description in that they rely upon the human ability to interpret context and content in order to make contributions imbued with “cultural meaning”.  There are occasional hints of the potential for more extensible (?web scale) methods of description, in the contexts of tagging and of information visualization, but these seem to be conceived more as opportunities for “mining the communal provenance” of aggregated metadata – so creating additional folksonomic structures alongside traditional finding aids.  Which is not to say that the Archival Commons is not still justified from a cultural or societal perspective, but that the “volume of records” cataloguing backlog issue will require a solution which moves beyond merely adding to the pool of potential participants enabled to contribute narrative descriptive content and establish contextual linkages.

Meanwhile, double-keying, checking and data standardisation procedures in family history indexing have come a long way since the debacle over the 1901 census transcription. But double-keying for a commercial partner also signals a doubling of transcription costs, possibly without a corresponding increase in transcription accuracy.  Or, as the Galaxy Zoo article puts it, “the overall agreement between users does not necessarily mean improvement as people can agree on a wrong classification”.  Nevertheless, these norms from the commercial world have somehow transferred themselves as the ‘gold standard’ into archival crowdsourcing transcription projects, in spite of the proofreading overhead (bounded by the capacity of the individual, again).  As far as I am aware, Old Weather (which is, of course, a Zooniverse cousin of Galaxy Zoo) is the only project working with archival content which has implemented a quantitative approach to assess transcription accuracy – improving the project’s completion rate in the process, since the decision could be taken to reduce the number of independent transcriptions required from five to three.

Pondering these and other such tangles, I began to wonder whether there have indeed been any genuine attempts to harness large-scale processing power for archival description or transcription.  Tools are now available commercially designed to decipher modern handwriting (two examples: MyScript for LiveScribe; Evernote‘s text recognition tool), why not an automated palaeographical tool?  Vaguely remembering that The National Archives had once been experimenting with text mining for both cataloguing and sensitivity classification [I do not know what happened to this project - can anyone shed some light on this?], and recollecting the determination of one customer at West Yorkshire Archive Service who tried (and failed) valiantly to teach his Optical Character Recognition (OCR) software to recognise nearly four centuries of clerk’s handwriting in the West Riding Registry of Deeds indexes, I put out a tentative plea on Twitter for further examples of archival automation.  The following examples are the pick of the amazing set of responses I received:

  • The Muninn Project aims to extract and classify written data about the First World War from digitized documents using raw computing power alone.  The project appears to be at an early stage, and is beginning with structured documents (those written onto pre-printed forms) but hopes to move into more challenging territory with semi-structured formats at a later stage.
  • The Dutch Monk Project (not to be confused with the American project of the same name, which facilitates text mining in full-text digital library collections!) seeks to make use of the qualitative interventions of participants playing an online transcription correction game in order to train OCR software for improved handwriting recognition rates in future.  The project tries to stimulate user participation through competition and rewards, following the example of Google Image Labeller.  If your Dutch is good, Christian van der Ven’s blog has an interesting critique of this project (Google’s attempt at translation into English is a bit iffy, but you can still get the gist).
  • Impact is a European funded project which takes a similar approach to the Monk project, but has focused upon improving automated text recognition with early printed books.  The project has produced numerous tools to improve both OCR image recognition and lexical information retrieval, and a web-based collaborative correction platform for accuracy verification by volunteers.  The input from these volunteers can then in turn be used to further refine the automated character recognition (see the videos on the project’s YouTube channel for some useful introductory materials).  Presumably these techniques could be further adapted to help with handwriting recognition, perhaps beginning with the more stylised court hands, such as Chancery hand.  The division of the quality control checks into separate character, word, and page length tasks (as illustrated in this video) is especially interesting, although I think I’d want to take this further and partition the labour on each of the different tasks as well, rather than expecting one individual to work sequentially through each step.  Thinking of myself as a potential volunteer checker, I think I’d be likely to get bored and give up at the letter-checking stage.  Perhaps this rather more mundane task would be more effectively offered in return for peppercorn payment as a ‘human intelligence task’ on a platform such as Amazon Mechanical Turk, whilst the volunteer time could be more effectively utilised on the more interesting word and page level checking.
  • Genealogists are always ahead of the game!  The Family History Technology Workshop held annually at Brigham Young University usually includes at least one session on handwriting recognition and/or data extraction from digitized documents.  I’ve yet to explore these papers in detail, but there looks to be masses to read up on here.
  • Wot no catalogue? Google-style text search within historic manuscripts?  The Center for Intelligent Information Retrieval (University of Massachusetts Amherst) handwriting retrieval demonstration systems – manuscript document retrieval on the fly.
  • Several other tools and projects which might be of interest are listed in this handy google doc on Transcribing Handwritten Documents put together by attendees at the DHapi workshop held at the Maryland Institute for Technology in the Humanities earlier this year.  Where I’ve not mentioned specific examples directly here its mostly because these are examples of online user transcription interfaces (which for the purposes of this post I’m classing as technology-enhanced projects, as opposed to technology-driven, which is my main focus here – if that makes sense? Monk and Impact creep in above because they combine both approaches).

If you know of other examples, please leave a comment…

Digital Connections

Digital Connections: new methodologies for British history, 1500-1900

I spent an enjoyable afternoon yesterday (a distinct contrast, I might add, to the rest of my day, but that is another story) at the Digital Connections workshop at the Institute of Historical Research in London, which introduced two new resources for historical research: the federated search facility, Connected Histories, and the Mapping Crime project to link crime-related documents in the John Johnson collection of ephemera at the Bodleian Library in Oxford to related external resources.

After a welcome from Jane Winters, Tim Hitchcock kicked off proceedings with an enthusiastic endorsement of Connected Histories and generally of all things digital and history-related in Towards a history lab for the digital past. I guess I fundamentally disagree with the suggestion that concepts of intellectual property might survive unchallenged in some quarters (in fact I think the idea is contradicted by Tim’s comments on the Enlightenment inheritance and the ‘authorship’ silo). But then again, we won’t challenge the paywall by shunning it altogether, and in that sense, Connected Histories’ ‘bridges’ to the commercial digitisation providers are an important step forward.  It will be interesting to see how business models evolve in response – there were indications yesterday that some providers may be considering moves towards offering short-term access passes, like the British Newspapers 1800-1900 at the British Library, where you can purchase a 24 hour or 7 day pass if you do not have an institutional affiliation.  Given the number of north American accents in evidence yesterday afternoon, too, there will be some pressure on online publishers to open up access to their resources to overseas users and beyond UK Higher Education institutions.

For me, the most exciting parts of the talk, and ensuing demonstration-workshop led by Bob Shoemaker, related to the Connected Histories API (which seems to be a little bit of a work-in-progress), which led to an interesting discussion about the technical skills required for contemporary historical research; and the eponymous ‘Connections‘, a facility for saving, annotating and (if desired) publicly sharing Connected Histories search results. The reception in the room was overwhelmingly positive – I’ll be fascinated to see if Connected Histories can succeed where other tools have failed to get academic historians to become more sociable about their research and expertise.  Connected Histories is not, in fact, truly a federated search platform, in that indexes for each participating resource have been re-created by the Connected Histories team, which then link back to the original source.  With the API, this will really open up access to many resources which were designed for human interrogation only, and I am particularly pleased that several commercial providers have been persuaded to sign up to this model.  It does, though, seem to add to the complexity of keeping Connected Histories itself up-to-date: there are plans to crawl contributing websites every 6 months to detect changes required.  This seems to me quite labour intensive, and I wonder how sustainable it will prove to be, particularly as the project team plan to add yet more resources to the site in the coming months and welcome enquiries from potential content providers (with an interesting charging model to cover the costs of including new material).  This September’s updates are planned to include DocumentsOnline from The National Archives, and there were calls from the audience yesterday to include catalogue data from local archives and museums.

Without wishing to come over as dismissive as this possibly sounds, David Tomkins’ talk about the Mapping Crime project was a pretty good illustration of what can be done when you have a generous JISC grant and a very small collection.  Coming from (well, my working background at least) a world of extremely large, poorly documented collections, where no JISC-equivalent funder is available, I was more interested in the generic tools provided for users in the John Johnson collection: permanent URIs for each item, citation download facilities, a personal, hosted user space within the resource, and even a scalable measuring tool for digitised documents.  I wonder why it is taking archival management software developers so long to get round to providing these kinds of tools for users of online archive catalogues? There was also a fascinating expose of broadsheet plagiarism revealed by the digitisation and linking of two sensationalist crime reports which were identical in all details – apart from the dates of publication and the names of those involved.  A wonderful case study in archival authenticity.

David Thomas’ keynote address was an entertaining journey through 13 years of online digitisation effort, via the rather more serious issues of sustainability and democratization of our digital heritage.  His conclusions, that the future of history is about machine-to-machine communication, GIS and spatial data especially, might have come as a surprise to the customary occupants of the IHR’s Common Room, but did come with a warning of the problems attached to the digital revolution from the point of view of ordinary citizens and users: the ‘google issue’ of search results presented out of context; the maze of often complex and difficult-to-interpret online resources; and the question of whether researchers have the technical skills to fully exploit this data in new ways.

A round-up and some brief reflections on a number of different events and presentations I’ve attended recently:

Many of this term’s Archives and Society seminars at the Institute of Historical Research have been been on particularly pertinent subjects for me, and rather gratifyingly have attracted bumper audiences (we ran out of chairs at the last one I attended).  I’ve already blogged here about the talk on the John Latham Archive.  Presentations by Adrian Autton and Judith Bottomley from Westminster Archives, and Nora Daly and Helen Broderick from the British Library revealed an increasing awareness and interest in the use of social media in archives, qualified by a growing realisation that such initiatives are not self-sustaining, and in fact require a substantial commitment from archive staff, in time if not necessarily in financial terms, if they are to be successful.  Nora and Helen’s talk also prompted an intriguing audience debate about the ‘usefulness’ of user contributions.  To me, this translates as ‘why don’t users behave like archivists’ (or possibly like academic historians)?  But if the aim of promoting archives through social media is to attract new audiences, as is often claimed, surely we have to expect and celebrate the different perspectives these users bring to our collections.  Our professional training perhaps gives us tunnel vision when it comes to assessing the impact of users’ tagging and commenting.  Just because users’ terminology cannot be easily matched to the standardised metadata elements of ISAD(G) doesn’t mean it lacks relevance or usefulness outside of archival contexts.  Similar observations have been made in research in the museums and art galleries world, where large proportions of the tags contributed to the steve.museum prototype tagger represented terms not found in museum documentation (in one case, greater than 90% of tags were ‘new’ terms).  These new terms are viewed an unparalleled opportunity to enhance the accessibility of museum objects beyond traditional audiences, augmenting professional descriptions, not replacing them.

Releasing archival description from the artificial restraints imposed by the canon of professional practice was also a theme of my UCL colleague, Jenny Bunn’s, presentation of her PhD research, ‘The Autonomy Paradox’.  I find I can balance increased understanding about her research each time I hear her speak, with simultaneously greater confusion the deeper she gets into second order cybernetics!  Anyway, suffice it to say that I cannot possibly do justice to her research here, but anyone in north America might like to catch her at the Association of Canadian Archivists’ Conference in June.  I’m interested in the implications of her research for a move away from hierarchical or even series-system description, and whether this might facilitate a more object-oriented view of archival description.

Last term’s Archives and Society series included a talk by Nicole Schutz of Aberystwyth University about her development of a cloud computing toolkit for records management.  This was repeated at the recent meeting of the Data Standards Section of the Archives and Records Association, who had sponsored the research.  At the same meeting, I was pleased to discover that I know more than I thought I did about linked data and RDF, although I am still relieved that Jane Stevenson and the technical team behind the LOCAH Project are pioneering this approach in the UK archives sector and not me!  But I am fascinated by the potential for linked open data to draw in a radical new user community to archives, and will be watching the response to the LOCAH Project with interest.

The Linked Data theme was continued at the UKAD (UK Archives Discovery Network) Forum held at The National Archives on 2 March.  There was a real buzz to the day – so nice to attend an archives event that was full of positive energy about the future, not just ‘tough talk for tough times’.  There were three parallel tracks for most of the day, plus a busking space for short presentations and demos.  Obviously, I couldn’t get to everything, but highlights for me included:

  • the discovery of a second archives Linked Data project – the SALDA project at the University of Sussex, which is extract archival descriptions from CALM using EAD, and then transform them into Linked Data
  • Victoria Peters’ overview of the open source archival description software, ICA-AtoM – feedback welcomed, I think, on the University of Stathclyde’s new online catalogue which uses ICA-AtoM.
  • chatting about Manchester Archive + (Manchester archival images on flickr)
  • getting an insider’s view of HistoryPin and Ancestry’s World Archives Project – the latter particularly fascinating to me in the context of motivating and supporting contributors in online archival contexts

Slides from the day, including mine on Crowds and Communities in the Archives, are being gathered together on slideshare at http://www.slideshare.net/tag/ukad.  Initial feedback from the day was good, and several people have blogged about the event (including Bethan Ruddock from the ArchivesHub, a taxonomist’s viewpoint at VocabControl, Karen Watson from the SALDA Project, and The Questing Archivist).

Edit to add Kathryn Hannan’s Archives and Auteurs blog post.

Today I have a guest post about my research on UKOLN‘s Cultural Heritage Blog.

I’ve noticed before that in all the excitement over Web2.0 tools for user participation, archivists tend to make one big assumption: that textual descriptions of archives will remain the primary access channel to archival material in the electronic age.  This is despite all the evidence (when we bother to look for it, which isn’t really often enough) that users find archival finding aids difficult to navigate, and an ongoing blurring of boundaries between previously separate descriptive products (catalogues, indexes, calendars, transcripts etc.) in online contexts.

In the context of user participation, this assumption is particularly significant, since adding considerable quantities of user-contributed metadata – comments, tags, and word-for-word transcripts of documents – can surely only amplify the existing difficulties of user interface design and add to the complexities of using archival descriptive systems.  And that’s not to mention the possibilities suggested by ongoing improvements in optical character recognition and data mining technologies.  Even given the assistance of sophisticated search algorithms, that’s a hell of a lot of text for the poor researcher to have to wade through.

Then of course, many archivists – and many of our users – would subscribe to the view that there is something extra special about the touch and feel of original archive documents, and consider the digitised surrogate to be an inevitably impoverished medium because of it.  Despite advances in digital tactility devices this one’s probably quite hard to crack for remote access to archives, however!

One really promising alternative is visual representation, which is particularly effective for very large datasets – see, for example, Mitchell Whitelaw’s ‘visual archive‘ research project for the National Archives of Australia.

And on Tuesday, I was introduced to another – searching by sound, soon to be implemented as one of the three access routes into the archive of the artist John Latham (it’ll be under the ‘AA’ link shortly; in the meantime you can browse a slideshow the archive or interrogate its contents in more traditional, textual fashion by clicking on either of the other two letter codes from the homepage).  Fascinating stuff, although one potential problem is that you would need to know quite a substantial amount about John Latham and his ‘flat-time’ theories before you can make sense of the soundtrack and the finding aid itself – so in that sense, the sound search tool might be as much as barrier as a facilitator of access.  But then again, the same can be said of certain textual finding aids: fonds, anybody?

Anyone fancy devising an olefactory finding aid?

Follow

Get every new post delivered to your Inbox.