Feeds:
Posts
Comments

Lonely Hearts at #UKAD

A quick post in support of my lonely hearts ads for this year’s UKAD Forum.  I’ve submitted two – slightly concerned this makes me look rather archivally-geekily dissolute… Anyway, these were inspired by a chance conversation on twitter a few weeks back with a couple of archivists who had signed up in January for the Code Year lessons, but had found it hard going and fallen behind.

So firstly:

  • Digital professional, likes history, cake, structure & logic, hates dust, WLTM archivists interested in learning programming, for fun and comradeship.
I’ve posted here previously that I have occasionally mulled over the possibility/feasibility of some kind of online basic programming tutorial for archivists, and this even gathered a very welcome offer of assistance.  But I hadn’t taken the idea any further for a couple of reasons (a) I wasn’t sure of demand and (b) I think its really important that any tutorial should be based around real, practical archival scenarios.  I know from experience that it can be difficult to learn tech stuff (well, perhaps any stuff) if you can’t see how you might apply it in personally relevant contexts.  So, if you’re an archivist, what I’d like to find out in the comments to this post is why you’re interested in learning how to program – specifically in which archives-related tasks you hope such skills could usefully be applied.
And secondly:
  • Tech-loving archivist seeks passionate, patient devs with GSOH to help teach archivists to code.
Because I know I couldn’t put together what I have in mind on my own, and because I’d be embarrassed to show any of my code to anyone.  Those two things are linked actually: on a good day, with a following wind, and plenty of coffee and swearing, I can cobble together some lines of code which do something useful (for my purposes).  I am all too aware I am using perhaps 5-10% of the power of any given language, but then again if it works (eventually, usually!) for my purposes, perhaps 5% is all the function I require (plus the confidence to explore and experiment).  I need any real programmers interesting in helping out to understand all of that.  The aim here would be to put together a simple tutorial for beginners based around day-to-day archival tasks.  From programmers, I’d be interested in ideas of how to put together this tutorial, including what language(s) you might recommend and why.

I have absolutely no clue whether or how this might come off.  Maybe the only UK archivists interested are the three of us who talked on twitter.  Maybe we’ll decide its too much effort to tailor a resource specifically for archivists (and I do have the small matter of a PhD thesis to write over the next few months).  Maybe we’ll find there’s already something out there that’s perfect.  Maybe the consensus will be that archivists’ time would be better spent brushing up their markup skills, or learning about database design, or practising palaeography or something.  I just don’t know, but UKAD is all about networking and getting people together from different fields but with common interests in archives.  Or, as one archivist tweeted: “Wanted to be able to have halfway-sensible conversation with techies” – now there’s a challenge!
Advertisement

It’s been a while since I’ve posted here purely on digital preservation issues: my work has moved in other directions, although I did attend a number of the digital preservation sessions at the Society of American Archivists’ conference this summer.  I retain a keen interest in digital preservation, however, particularly in developments which might be useful for smaller archives.  Recently, I’ve been engaged in a little work for a project called DiSARM (Digital Scenarios for Archives and Records Management), preparing some teaching materials for the Masters students at UCL to work from next term, and in revising the contents of a guest lecture I present to the University of Liverpool MARM students on ‘Digital Preservation for the Small Repository’.  Consequently, I’ve been trying to catch up on the last couple of years (since I left West Yorkshire Archive Service at the end of 2009) of new digital preservation projects and research.

So what’s new?  Well, from a small archives perspective, I think the key development has been the emergence of several digital curation workflow management systems – Archivematica, Curator’s Workbench, the National Archive of Australia’s Digital Preservation Software Platform (others…?) – which package together a number of different tools to guide the archivist through a sequenced set of stages for the processing of digital content.  The currently available systems vary in their approaches to preservation, comprehensiveness, and levels of maturity, but represent a major step forward from the situation just a couple of years ago.  In 2008, if (like me when WYAS took in the MLA Yorkshire archive as a testbed), you didn’t have much (or any) money available, your only option was – as one of the former Liverpool students memorably pointed out to me – to cobble together a set of tools as best you could from old socks and a bit of string.  Now we have several offerings approaching an integrated software solution; moreover, these packages are generally open source and freely available, so would-be adopters are able to download each one and play about with it before deciding which one might suit them best.

Having said that, I still think it is important that students (and practitioners, of course) understand the preservation strategies and assumptions underlying each software suite.  When we learn how to catalogue archives, we are not trained merely to use a particular software tool.  Rather, we are taught the principles of archival description, and then we move on to see how these concepts are implemented in practice in EAD or by using specific database applications, such as (in the U.K.) CALM or Adlib.  For DiSARM, students will design a workflow and attempt to process a small sample set of digital documents using their choice of one or more of the currently available preservation tools, which they will be expected to download and install themselves.  This Do-It-Yourself approach will mirror the practical reality in many small archives, where the (frequently lone) archivist often has little access to professional IT support. Similarly, students at UCL are not permitted to install software onto the university network.  Rather than see this as a barrier, again I prefer to treat this situation a reflection of organisational reality.  There are a number of very good reasons why you would not want to process digital archives directly onto your organisation’s internal network, and recycling re-purposing old computer equipment of varying technical specifications and capabilities to serve as workstations for ingest is a fact of life even, it seems, for Mellon-funded projects!

In preparation for writing this DiSARM task, I began to put together for my own reference a spreadsheet listing all the applications I could think of, or have heard referenced recently, which might be useful for preservation processing tasks in small archives.  I set out to record:

  • the version number of the latest (stable) release
  • the licence arrangements for each tool
  • the URL from which the software can be downloaded
  • basic system requirements (essentially the platform(s) on which the software can be run – we have surveyed the class and know there is a broad range of operating systems in use, including several flavours of both Linux and Windows, and Mac OS X)
  • location of further documentation for each application
  • end-user support availability (forums or mailing lists etc)
This all proved surprisingly difficult.  I was half expecting that user-friendly documentation and (especially) support might often be lacking in the smaller projects, but several websites also lack clear statements about system requirements or the legal conditions under which the software may be installed and used.  Does ‘educational use and research’ cover a local authority archives providing research services to the general public (including academics)?  Probably not, but it would presumably allow for use in a university archives.  Thanks to the wonders of interpreted programming languages (mostly Java, but Python also puts in an occasional appearance), many tools are effectively cross-platform, but it is astonishing how many projects fail clearly to say so.  This is self-evident to a developer, of course, but not at all obvious to an archivist, who will probably be worried about bringing coffee into the repository, let alone a reptile.  Oh, and if you expect your software to be compiled from code, or require sundry other faffing around at a command line before use, I’m sorry, but your application is not “easy to implement” for ordinary mortals, as more than one site claimed.  Is it really so hard to generate binary executables for common operating systems (or if you have a good excuse – such as Archivematica which is still in alpha development – at least provide detailed step-by-step instructions)?  Many projects of course make use of SourceForge to host code, but use another website for documentation and updates – it can be quite confusing finding your way around.  The veritable ClamAV seems to have undergone some kind of Windows conversion, and although I’m sure that Unix packages must be there somewhere, I’m damned if I could find them easily…

All of which plays into a wider debate about just how far the modern archivist’s digital skills ought to reach (there are many other versions of this debate, the one linked – from 2006 so now quite old – just happens to be one of the most comprehensive attempts to define a required digital skill set for information practitioners).  No doubt there will be readers of this post who believe that archivists shouldn’t be dabbling in this sort of stuff at all, especially if s/he also works for an organisation which lacks the resources to establish a reliable infrastructure for a trusted digital repository.  And certainly I’ve been wondering lately whether some kind of archivists’ equivalent of The Programming Historian would be welcome or useful, teaching basic coding tailored to common tasks that an archivist might need to carry out.  But essentially, I don’t subscribe to the view that all archivists need to re-train as computer scientists or IT professionals.  Of course, these skills are still needed (obviously!) within the digital preservation community, but to drive a car I don’t need to be a mechanic or have a deep understanding of transport infrastructure.  Digital preservation needs to open up spaces around the periphery of the community where newcomers can experiment and learn, otherwise it will become an increasingly closed and ultimately moribund endeavour.

8am on Saturday morning, and those hardy souls who have not yet fled to beat Hurricane Irene home or who are stranded in Chicago, plus other assorted insomniacs, were presented with a veritable smörgåsbord of digital preservation goodness.  The programme has many of the digital sessions scheduled at the same time, and today I decided not to session-hop but stick it out in one session in each of the morning’s two hour-long slots.

My first choice was session 502, Born-Digital archives in Collecting Repositories: Turning Challenges into Byte-Size Opportunities, primarily an end-of-project report on the AIMS Project.  It’s been great to see many such practical digital preservation sessions at this conference, although I do slightly wonder what it will take before working with born-digital truly becomes part of the professional mainstream.  Despite the efforts of all the speakers at sessions like this (and in the UK, colleagues at the Digital Preservation Roadshows with which I was involved, and more recent similar events), there still appears to be a significant mental barrier which stops many archivists from giving it a go.  As the session chair began her opening remarks this morning, a woman behind me remarked “I’m lost already”.

There may be some clues in the content of this morning’s presentations: in amongst my other work (as would be the case for most archivists, I guess) I try to keep reasonably up-to-date with recent developments in practical digital preservation.  For instance, I was already well aware of the AIMS Project, although I’d not had a previous opportunity to hear about their work in any detail, but here were yet more new suggested tools for digital preservation: I happen to know of FTK Imager, having used it with the MLA Yorkshire archive accession, although what wasn’t stated was that the full FTK forensics package is damn expensive and the free FTK Imager Lite (scroll down the page for links) is an adequate and more realistic proposition for many cash-strapped archives.  BagIt is familiar too, but Bagger, a graphical user interface to the BagIt Library is new since I last looked (I’ll add links later – the Library of Congress site is down for maintenance”).  Sleuthkit was mentioned at the research forum earlier this week, but fiwalk (“a program that processes a disk image using the SleuthKit library and outputs its results in Digital Forensics XML”) was another new one on me, and there was even talk in this session of hardware write-blockers.  All this variety is hugely confusing for anybody who has to fit digital preservation around another day job, not to mention potentially expensive when it comes to buying hardware and software, and the skills necessary to install and maintain such a jigsaw puzzle system.  As the project team outlined their wish list for yet another application, Hypathia, I couldn’t help wondering whether we can’t promote a little more convergence between all these different tools both digital preservation specific and more general.  For instance, the requirement for a graphical drag ‘n’ drop interface to help archivists create the intellectual arrangement of a digital collection and add metadata reminded me very much of recent work at Simmons College on a graphical tool to help teach archival arrangement and description (whose name I forget, but will add it when it comes back to me!*).  I was interested particularly in the ‘access’ part of this session, particularly the idea that FTK’s bookmark and label functions could be transformed into user generated content tools, to enable researchers to annotate and tag records, and in the use of network graphs as a visual finding aid for email collections.

The rabbit-caught-in-headlights issue seems less of an issue for archivists jumping on the Web2.0 bandwagon, which was the theme of session 605, Acquiring Organizational Records in a Social Media World: Documentation Strategies in the Facebook Era, where we heard about the use of social media, primarily facebook, to contact and document student activities and student societies in a number of university settings, and from a university archivist just beginning to dip her toe into Twitter.  As a strategy of working directly with student organisations and providing training to ‘student archivists’ was outlined, as a method of enabling the capturing of social media content (both simultaneously with upload and by web-crawling sites afterwards), I was reminded of my own presentation at this conference: surely here is another example of real-life community development? The archivist is deliberately ‘going out to where the community is’ and adapting to the community norms and schedules of the students themselves, rather than expecting the students themselves to comply with archival rules and expectations.

This afternoon I went to learn about SNAC: the social networks and archival context project (session 710), something I’ve been hearing other people mention for a long time now but knew little about.  SNAC is extracting names (corporate, personal, family) from Encoded Archival Description (EAD) finding aids as EAC-CPF and then matching these together and with pre-existing authority records to create a single archival authorities prototype.  The hope is to then extend this authorities cooperative both nationally and potentially internationally.

My sincere thanks to the Society of American Archivists for their hospitality during the conference, and once again to those who generously funded my trip – the Archives and Records Association, University College London Graduate Conference Fund, UCL Faculty of Arts and UCL Department of Information Studies.

* UPDATE: the name of the Simmons’ archival arrangement platform is Archivopteryx (not to be confused with the Internet mail server Archiveopteryx which has an additional ‘e’ in the name)

#SAA11, Friday 26th August

Friday had a bit of a digital theme for me, beginning with a packed, standing-room-only session 302, Practical Approaches to Born-Digital Records: What Works Today. After a witty introduction by Chris Prom about his Fulbright research in Dundee, a series of speakers introduced their digital preservation work, with a real emphasis on ‘you too can do this’.  I learnt about a few new tools: firefly, a tool which is used to scan for American social security numbers and other sensitive information – not much use in a British context, I imagine, but an interesting approach all the same; TreeSize Professional, a graphical hard disk analyser; and several projects were making use of the Duke Data Accessioner, a tool with which I was already familiar but have never used.  During the morning session, I also popped in and out of ‘team-Brit’ session 304 Archives in the Web of Data which discussed developments in the UK and US in opening up and linking together archival descriptive data, and session 301 Archives on the Go: Using Mobile Technologies for Your Collections, where I caught a presentation on the use of FourSquare at Stanford University.

In the afternoon, I mostly concentrated on session 401, Re-arranging Arrangement and Description, with a brief foray into session 407, Faces of Diversity: Diasporic Archives and Archivists in the New Millennium.  Unless I missed this whilst I was out at the other session, nobody in session 410 mentioned the series system as a possible alternative or resolution to some of the tensions identified in a strict application of hierarchically-interpreted original order, which surprised me.  There were some hints towards a need for a more object-oriented view of description in a digital environment, and of methods of addressing the complexity of having multiple representations (physical, digital etc.), but I have been reading my UCL colleague Jenny Bunn’s recently completed PhD thesis, Multiple Narratives, Multiple Views: Observing Archival Description on flights for this trip, which would have added another layer to the discussion in this session.

And continuing the digital theme, I was handed a flyer for an event coming later this year (on 6th October): Day of Digital Archives which might interest some UK colleagues.  This is

…an initiative to raise awareness of digital archives among both users and managers. On this day, archivists, digital humanists, programmers, or anyone else creating, using, or managing digital archives are asked to devote some of their social media output (i.e. tweets, blog posts, youtube videos etc.) to describing their work with digital archives.  By collectively documenting what we do, we will be answering questions like: What are digital archives? Who uses them? How are they created and maanged? Why are they important?

 

Day 1 Proper of the conference began with acknowledgements to the organisers, some kind of raffle draw and then a plenary address by an American radio journalist.  Altogether this conference has a celebratory feel to it – fitting since this is SAA’s 75th Anniversary year, but very different in tone from the UK conferences where the opening keynote speaker tends to be some archival luminary.  More on the American archival cultural experience later.

My session with Kate Theimer (of ArchivesNext fame) and Dr Elizabeth Yakel from the University of Michigan (probably best known amongst tech savvy UK practitioners for her work on the Polar Bear Expedition Finding Aid) followed immediately afterwards, and seemed to go well.  The session title was: “What Happens After ‘Here Comes Everybody’: An Examination of Participatory Archives”.  Kate proposed a new definition for Participatory Archives, distinguishing between participation and engagement (outreach); Beth spoke about credibility and trust, and my contribution was primarily concerned with contributors’ motivations to participate.  A couple of people, Lori Satter and Mimi Dionne have already blogged about the session (did I really say that?!), and here are my slides:

After lunch, I indulged in a little session-hopping, beginning in session 204 hearing about Jean Dryden’s copyright survey of American institutions, which asked whether copyright limits access to archives by restricting digitisation activity.  Dryden found that American archivists tended to take a very conservative approach to copyright expiry terms and obtaining third party permission for use, even though many interviewees felt that it would be good to take a bolder line.   Also, some archivists knowledge of the American copyright law was shaky – sounds familiar!  It would be interesting to see how UK attitudes would compare; I suspect results would be similar, however, I also wonder how easy it is in practical terms to suddenly start taking more of a risk-management approach to copyright after many years of insisting upon strict copyright compliance.

Next I switched to session 207, The Future is Now: New Tools to Address Archival Challenges, hearing Maria Esteva speak about some interesting collaborative work between the Texas Advanced Computing Center and NARA on visual finding aids, similar to the Australian Visible Archive research project. At the Exhibit Hall later, I picked up some leaflets about other NARA Applied Research projects and tools for file format conversion, data mining and record type identification which were discussed by other speakers in this session.

Continuing the digitization theme, although with a much more philosophical focus, Joan Schwartz in session 210, Genuine Encounter, Authentic Relationships: Archival Convenant & Professional Self-Understanding discussed the loss of materiality and context resulting from the digitisation of photographs (for example, a thumbnail image presented out of its album context).  She commented that archivists are often too caught up with the ‘how’ of digitization rather than the ‘why’.  I wouldn’t disagree with that.

Back to the American archival cultural experience, I was invited to the Michigan University ‘alumni mixer’ in the evening – a drinks reception with some short speeches updating alumni on staff news and recent developments in the archival education courses at the university.  All in all, archives students are much in evidence here: there are special student ‘ribbons’ to attach to name badges, many students are presenting posters on their work, and there is a special careers area where face-to-face advice is available from more senior members of SAA, current job postings are advertised, and new members of the profession can even pin up their curriculum vitae.  Some of this (the public posting of CVs in particular) might seem a bit pushy for UK tastes, and the one year length of UK Masters programmes (and the timing of Conference) of course precludes the presentation of student dissertation work.  But the general atmosphere seems very supportive of new entrants to the profession, and I feel there are ideas here that ARA’s New Professionals section might like to consider for future ARA Conferences.

A meeting connected to my research for me, followed by a little sight-seeing as I was not involved in any of the day’s events.

It’s interesting to compare how SAA organises their annual meeting, in comparison to the much smaller ARA event.  On the days running up to the main conference, SAA arranges a series of training workshops, group committee meetings, and some poor archivists are even taking their Certified Archivists examination.  This means that delegates arrive at different times over the first few days, and it is only tomorrow (Thursday) that the full size of the conference is revealed with the first plenary session.  Some of the pre-conference events are an extra charge to attendees.  I understand these are an important source of income for SAA but I imagine work out as a cost-effective way of attending training for the delegates, since they are already traveling to get to the main Annual Meeting itself.  I guess many ARA sections also hold committee meetings at Conference, but that is more of an informal arrangement.  I wondered if formalising it might simultaneously help save some costs for ARA, and boost attendance, but I think I’d switch the order so that such add-on events followed the main conference rather than precede it as happens here – the gradual ramping up towards the main event I find quite dis-orientating as a first time attendee.

Oh, and in the evening, a mass outing to watch the Chicago Cubs baseball game – and the home team won!  Fortunately, I had a very patient explainer…

This should be the first of several posts from this year’s Society of American Archivists Annual Meeting in Chicago, for which I have received generous funding to attend from UCL’s Graduate Conference Fund, and from the Archives and Records Association who asked me to blog the conference.  First impressions of a Brit: this conference is huge.  I could (and probably will) get lost inside the conference hotel, and the main programme involves parallel tracks of ten sessions at once.  And proceedings start at 8am.  This is all a bit of a shock to the system; not sure anybody would turn up if you started before 9am at the earliest back home! Anyway, the twitter tag to watch is #saa11, although with no wifi in the session rooms, live coverage of sessions will be limited to those who can get a mobile phone signal, which is a bit of a shame.

The conference proper starts on Thursday; the beginning of the week is mostly taken up with meetings, but on Tuesday I attended an impressive range of presentations at the SAA Research Forum.  Abstracts and bios for each speaker are already online (and are linked where relevant below), and I understand that slides will follow in the next week or so.  Here are some personal highlights and things which I think may be of interest to archivists back home in the UK:

It was interesting to see several presentations on digital preservation, many reflecting similar issues and themes to those which inspired my Churchill Fellowship research and the beginning of this blog back in 2008.  Whilst I don’t think I’d recommend anyone set out to learn about digital preservation techniques the hard way with seriously obsolete media, if you do find yourself in the position of having to deal with 5.25 inch floppy disks or the like, Karen Ballingher’s presentation on students’ work at the University of Texas – Austin had some handy links, including the UT-iSchool Digital Archaeology Lab Manual and related documentation and an open source forensics package called Sleuth Kit.  Her conclusions were more generally applicable, and familiar: the importance of documenting everything you do, including failures; planning out trials; and just do it – learn by doing a real digital preservation project.  Cal Lee was excellent (as ever) on Levels of Representation in Digital Collections, outlining a framework of digital information constructed of eight layers of representation from the bit(byte-)stream to aggregations of digital objects, and noting that archival description already supports description at multiple levels but has not yet evolved to address these multiple representation layers.  Eugenia Kim’s paper on her ChoreoSave project to determine the metadata elements required for digital dance preservation reminded me of several UK and European initiatives; Siobhan Davies Replay, which Eugenia herself referenced and talked about at some length; the University of the Arts London’s John Latham Archive, which I’ve blogged about previously, because Eugenia commented that choreographers had found the task of entering data into the numerous metadata fields onerous: once again it seems to me there is a tension between the (dance, in this case) event and the assumption that text offers the only or best means of describing and accessing that event; and the CASPAR research on the preservation of interactive multimedia performances at the University of Leeds.

For my current research work on user participation in archives, the following papers were particularly relevant: Helice Koffler‘s report on the RLG Social Metadata Working Group‘s project on evaluating the impact of social media on museums, libraries and archives.  A three-part report is to be issued; part one is due for publication in September 2011.  I understand that this will include some useful and much-needed definitions of ‘user interaction’ terminology.  Part 1 has moderation as its theme – Helice commented that a strict moderation policy can act as a barrier to participation (a point that I agree with up to a point – and will explore further in my own paper on Thursday).  Part 2 will be an analysis of the survey of social media use undertaken by the Working Group (4 U.K. organisations were involved in this, although none were archives).  As my interviews with archivists would also suggest, the survey found little evidence of serious problems with spam or abusive behaviour on MLA contributory platforms.  Ixchel Faniel reported on University of Michigan research on whether trust matters for re-use decisions.

With my UKAD hat on, the blue sky (sorry, I hate that term, but I think its appropriate in this instance) thinking on archival description methods which emerged from the Radcliffe Workshop on Technology and Archival Processing was particularly inspiring.  The workshop was a two-day event which brought together invited technologists (many of whom had not previously encountered archives at all) and archivists to brainstorm new thinking on ways to tackle cataloguing backlogs, streamline cataloguing workflows and improve access to archives.  A collections exhibition was used to spark discussion, together with specially written use cases and scenarios to guide each day’s discussion.  Suggestions included the use of foot-pedal operated overhead cameras to enable archival material to be digitised either at the point of accessioning, or during arrangement and description; experimenting with ‘trusted crowdsourcing’ – asking archivists to check documents for sensitivity – as a first step towards automating the redaction process of confidential information.  These last two suggestions reminded me of two recent projects at The National Archives in the U.K. – John Sheridan’s work to promote expert input into legislation.gov.uk (does anyone have a better link?) and the proposal to use text mining on closed record series which was presented to DSG in 2009.  Adam Kreisberg presented about the development of a toolkit for running focus groups by the Archival Metrics Project.  The toolkit will be tested with a sample session based upon archives’ use of social media, which I think could be very valuable for U.K. archivists.

Finally only because I couldn’t fit this one into any of the categories above, I found Heather Soyka and Eliot Wilczek‘s questions on how modern counter-insurgency warfare can be documented intriguing and thought-provoking.

Summer Summary

It’s been a busy summer for me – lots of stimulating conferences and events.  Here’s my (eclectic) roundup of highlights:

No.1 spot has to go to the fabulous VeleHanden project, a collaborative digitisation and crowdsourcing project initiated by Amsterdam City Archives, with numerous archival partners from all over the Netherlands. I was lucky enough to be invited to the inaugural meeting of the user test panel for the pilot project, militieregisters (militia registers), in Amsterdam at the end of June.

Why do I never get to work in buildings like this?

The testing phase of the project is now well underway, and the project is due to go live in October.  VeleHanden interests me for a number of reasons: Firstly, it has an interesting and innovative private-public partnership funding model and project structure.  Participating archives have to pay to have their registers scanned by a commercial digitisation company, but the sheer size of the consortium has enabled the negotiation of a very low price per page digitised.  Research users of the militieregisters site will pay a small fee to download a digitised image (similar to Ancestry), thus providing an ongoing revenue stream for the project.  The crowdsourcing interface is being developed by a private company; in future the consortium (or individual members of the consortium) will hire the platform for new projects, and the developers will be free to sell their product to other crowdsourcing markets.  Secondly, I’m interested in the project’s (still evolving) approach to opening up archival metadata.  Thirdly, I’m interested in the way the project is going about recruiting and motivating volunteers to undertake the indexing of the registers – targeting the popular family history community; offering extrinsic quasi-financial rewards for participants in the shape of discounted access to digitised content; and promoting and celebrating competition between participants.

In fact, I think one of VeleHanden‘s great strengths is the project’s user-focused approach to design and testing, the importance of which was highlighted by Claire Warwick in a ‘How To’ session on Studying Users at Interface 2011, “a new international forum to learn, share and network between the fields of Humanities and Technology”.  Slides from the keynote and workshop sessions at this event are available on the Interface 2011 website; all are worth a look.  I particularly enjoyed the workshop on Thinking Through Networks and the practical tips on How to Get Funded should resonate with a much wider audience than just the academic community.  All the delegates had to give a lightening talk about their research.  Here is mine:

View more presentations from 80gb
I also spoke at the Bloomsbury Conference on e-Publishing and e-Publications, and attended a couple of conferences I also went to last year – Research2, the Loughborough University student-organised conference on data analysis for information science, and AERI, the Archival Education and Research Institute, this year at Simmons College in Boston, MA.  It was interesting to note an increased interest in online participation and Internet-based methods at both events.  Podcasts of the AERI plenary sessions are available at the link above.
  • Digital Impacts: How to Measure and Understand the Usage and Impact of Digital Content, Oxford Internet Institute/JISC, Oxford, 20th May 2011 (#oiiimpacts)
  • Beyond Collections: Crowdsourcing for public engagement, RunCoCo Conference, Oxford, 26th May 2011 (#beyond2011)
  • Professor Sherry Turkle, Alone Together RSA Lecture, RSA, London, 1st June 2011 (#rsaonline)

I’m getting a bit behind with blog postings (again), so here, in the interests of ticking another thing off my to-do list, are a few highlights from various events I’ve attended recently…

It was good to see a couple of fellow archivists at the showcase conference for JISC’s Impact and Embedding of Digitised Resources programme. As searchroom visitor figures continue to fall, it is more important than ever that archivists understand how to measure and demonstrate the usage and impact of their online resources. The number of unique visitor’s to the archive service’s website (currently the only metric available in the CIPFA questionnaire for Archive Services, for instance) is no longer (if it ever was) adequate as a measure of online usage.  As Dr Eric Meyer pointed out in his introduction, one of the central lessons arising from the development of the Toolkit for the Impact of Digitised Scholarly Resources has been that no single metric will ever tell the whole story – a range of qualitative and quantitative methods are necessary to provide a full picture.  The word ‘scholarly’ in the toolkit’s name may be rather off-putting to some archivists working in local government repositories. That would be a shame, because this free online resource is full of very practical and useful advice and guidance. Like the historians caracatured by Sharon Howard of the Old Bailey Online project, archivists are not good at “studying people who can answer back” – the professional archival literature is full of laments about how poor we are at user studies. The synthesis report from the Impact programme, Splashes and Ripples: Synthesizing the Evidence on the Impacts of Digital Resources, is recommended reading; detailed evaluation reports from each of the projects which took part in the programme are also available (at http://www.jisc.ac.uk/whatwedo/programmes/digitisation/impactembedding.aspx).  Many of the recommendations made by the report would be relatively straightforward to implement, yet could potentially transform archive services’ online presence – and the TIDSR toolkit contains the resources to help evaluate the change.  Simple suggestions like picking non-word acronymns to improve project visibility online (like TIDSR – at last I have understood the Internet’s curious aversion to vowels, flickr, lanyrd, tumblr and so on!) and providing simple, automatic citations that are easy to copy or download (although I rather fear that archives are missing the boat on this one). Jane Winters was also excellent on the subject of sustaining digital impact, an important subject for archives whose online resources are perhaps more likely than most to have a long shelf-life. Twitter coverage of the event is available on Summarizr (another one!).

One gap in the existing digital measurement landscape which occurred to me during the Impacts event was the need for metrics which take account, not just of the passive audience of digital resources, but of those who contribute to them and participate in a more active way.  The problem is easily illustrated by the difficulties encountered when using standard quantitative measurement tools with Web2.0 type sites.  Attempting to collate statistics on sites such as Your Archives or Transcribe Bentham through the likes of Google Scholar or Yahoo’s Site Explorer is handicapped by the very flexibility of a wiki site structure, compounded again, I suspect, by the want of a uniquely traceable identity.  Google Scholar, in particular, seems averse to searches on URLs (although curiously, I discovered that although a search for yourarchives.nationalarchives.gov.uk produces 0 hits, yourarchives.nationalarchives.gov.* comes back with 26), whilst sites which invite user contributions are perhaps particularly susceptible to false-positive site inlink hits where they are highlighted as a general resource in blogrolls and the like.

This need to be clearer about what we mean by user engagement and how to measure when we’ve successfully achieved it was also my main take-away from the following week’s RunCoCo Conference – Beyond Collections: Crowdsourcing for Public Engagement.  Like Arfon Smith of the Zooniverse team, I am not very comfortable with the term ‘crowdsourcing’, and indeed many of the projects showcased at the Beyond conference seemed to me to be more technologically-enhanced outreach events or volunteer projects than true attempts to engage the ‘crowd’ (not that there is anything wrong with traditional approaches, but I just don’t think they’re crowdsourcing).  Even where large numbers of people are involved, are they truly ‘engaged’ by receiving a rubber stamp (in the case of the Erster Weltkrieg in Alltagsdokumenten project) to mark their attendance at an open day type event?  Understanding the social dynamics behind even large scale online collaborations is important – the Zooniverse ethical contract bears repeating:

  1. Contributors are collaborators, not users
  2. Contributors are motivated and engaged by real research
  3. Don’t waste people’s time

Podcasts of all the Beyond presentations and a series of short, reflective blog posts on the day’s proceedings are available.

Finally, Professor Sherry Turkle‘s RSA lecture to celebrate the launch of her new book, Alone Together, about the social impact of the Internet, was rather too brief to give more than a glimpse of her current thinking on our technology saturated society, but nevertheless there were some intriguing ideas which have potentially wide-ranging implications for the future of archives. One was the sense that the Internet is not currently serving our human needs.  She also spoke about the tensions between being willing to share and privacy.  Turkle asked what is democracy and what is intimacy without privacy? In response to questions from the audience, Turkle also claimed that people don’t like to say negative things online because it leaves a trace of things that went wrong. If that is true, it might have important implications for what we can expect people to contribute in archival contexts, and the nature of the debate which might take place in contested spaces of memory. Audio of the event is available from the RSA website.

This post is a thank you to my followers on Twitter, for pointing me towards many of the examples given below.  The thoughts on automated description and transcription are a preliminary sketching out of ideas (which, I suppose, is a way of excusing myself if I am not coherent!), on which I would particularly welcome comments or further suggestions:

A week or so before Easter, I was reading a paper about the classification of galaxies on the astronomical crowdsourcing website, Galaxy Zoo.  The authors use a statistical (Bayesian) analysis to distil an accurate sample of data, and then compare the reliability of this crowdsourced sample to classifications produced by expert astronomers.  The article also refers to the use of sample data in training artificial neural networks in order to automate the galaxy classification process.

This set me thinking about archivists’ approaches to online user participation and the harnessing of computing power to solve problems in archival description.  On the whole, I would say that archivists (and our partners on ‘digital archives’ kinds of projects) have been rather hamstrung by a restrictive ‘human-scale’, qualitatively-evaluated, vision of what might be achievable through the application of computing technology to such issues.

True, the notion of an Archival Commons evokes a network-oriented archival environment.  But although the proponents of this concept recognise “that the volume of records simply does not allow for extensive contextualization by archivists to the extent that has been practiced in the past”, the types of ‘functionalities’ envisaged to comprise this interactive descriptive framework still mirror conventional techniques of description in that they rely upon the human ability to interpret context and content in order to make contributions imbued with “cultural meaning”.  There are occasional hints of the potential for more extensible (?web scale) methods of description, in the contexts of tagging and of information visualization, but these seem to be conceived more as opportunities for “mining the communal provenance” of aggregated metadata – so creating additional folksonomic structures alongside traditional finding aids.  Which is not to say that the Archival Commons is not still justified from a cultural or societal perspective, but that the “volume of records” cataloguing backlog issue will require a solution which moves beyond merely adding to the pool of potential participants enabled to contribute narrative descriptive content and establish contextual linkages.

Meanwhile, double-keying, checking and data standardisation procedures in family history indexing have come a long way since the debacle over the 1901 census transcription. But double-keying for a commercial partner also signals a doubling of transcription costs, possibly without a corresponding increase in transcription accuracy.  Or, as the Galaxy Zoo article puts it, “the overall agreement between users does not necessarily mean improvement as people can agree on a wrong classification”.  Nevertheless, these norms from the commercial world have somehow transferred themselves as the ‘gold standard’ into archival crowdsourcing transcription projects, in spite of the proofreading overhead (bounded by the capacity of the individual, again).  As far as I am aware, Old Weather (which is, of course, a Zooniverse cousin of Galaxy Zoo) is the only project working with archival content which has implemented a quantitative approach to assess transcription accuracy – improving the project’s completion rate in the process, since the decision could be taken to reduce the number of independent transcriptions required from five to three.

Pondering these and other such tangles, I began to wonder whether there have indeed been any genuine attempts to harness large-scale processing power for archival description or transcription.  Tools are now available commercially designed to decipher modern handwriting (two examples: MyScript for LiveScribe; Evernote‘s text recognition tool), why not an automated palaeographical tool?  Vaguely remembering that The National Archives had once been experimenting with text mining for both cataloguing and sensitivity classification [I do not know what happened to this project – can anyone shed some light on this?], and recollecting the determination of one customer at West Yorkshire Archive Service who tried (and failed) valiantly to teach his Optical Character Recognition (OCR) software to recognise nearly four centuries of clerk’s handwriting in the West Riding Registry of Deeds indexes, I put out a tentative plea on Twitter for further examples of archival automation.  The following examples are the pick of the amazing set of responses I received:

  • The Muninn Project aims to extract and classify written data about the First World War from digitized documents using raw computing power alone.  The project appears to be at an early stage, and is beginning with structured documents (those written onto pre-printed forms) but hopes to move into more challenging territory with semi-structured formats at a later stage.
  • The Dutch Monk Project (not to be confused with the American project of the same name, which facilitates text mining in full-text digital library collections!) seeks to make use of the qualitative interventions of participants playing an online transcription correction game in order to train OCR software for improved handwriting recognition rates in future.  The project tries to stimulate user participation through competition and rewards, following the example of Google Image Labeller.  If your Dutch is good, Christian van der Ven’s blog has an interesting critique of this project (Google’s attempt at translation into English is a bit iffy, but you can still get the gist).
  • Impact is a European funded project which takes a similar approach to the Monk project, but has focused upon improving automated text recognition with early printed books.  The project has produced numerous tools to improve both OCR image recognition and lexical information retrieval, and a web-based collaborative correction platform for accuracy verification by volunteers.  The input from these volunteers can then in turn be used to further refine the automated character recognition (see the videos on the project’s YouTube channel for some useful introductory materials).  Presumably these techniques could be further adapted to help with handwriting recognition, perhaps beginning with the more stylised court hands, such as Chancery hand.  The division of the quality control checks into separate character, word, and page length tasks (as illustrated in this video) is especially interesting, although I think I’d want to take this further and partition the labour on each of the different tasks as well, rather than expecting one individual to work sequentially through each step.  Thinking of myself as a potential volunteer checker, I think I’d be likely to get bored and give up at the letter-checking stage.  Perhaps this rather more mundane task would be more effectively offered in return for peppercorn payment as a ‘human intelligence task’ on a platform such as Amazon Mechanical Turk, whilst the volunteer time could be more effectively utilised on the more interesting word and page level checking.
  • Genealogists are always ahead of the game!  The Family History Technology Workshop held annually at Brigham Young University usually includes at least one session on handwriting recognition and/or data extraction from digitized documents.  I’ve yet to explore these papers in detail, but there looks to be masses to read up on here.
  • Wot no catalogue? Google-style text search within historic manuscripts?  The Center for Intelligent Information Retrieval (University of Massachusetts Amherst) handwriting retrieval demonstration systems – manuscript document retrieval on the fly.
  • Several other tools and projects which might be of interest are listed in this handy google doc on Transcribing Handwritten Documents put together by attendees at the DHapi workshop held at the Maryland Institute for Technology in the Humanities earlier this year.  Where I’ve not mentioned specific examples directly here its mostly because these are examples of online user transcription interfaces (which for the purposes of this post I’m classing as technology-enhanced projects, as opposed to technology-driven, which is my main focus here – if that makes sense? Monk and Impact creep in above because they combine both approaches).

If you know of other examples, please leave a comment…