Posts Tagged ‘description’

Friday had a bit of a digital theme for me, beginning with a packed, standing-room-only session 302, Practical Approaches to Born-Digital Records: What Works Today. After a witty introduction by Chris Prom about his Fulbright research in Dundee, a series of speakers introduced their digital preservation work, with a real emphasis on ‘you too can do this’.  I learnt about a few new tools: firefly, a tool which is used to scan for American social security numbers and other sensitive information – not much use in a British context, I imagine, but an interesting approach all the same; TreeSize Professional, a graphical hard disk analyser; and several projects were making use of the Duke Data Accessioner, a tool with which I was already familiar but have never used.  During the morning session, I also popped in and out of ‘team-Brit’ session 304 Archives in the Web of Data which discussed developments in the UK and US in opening up and linking together archival descriptive data, and session 301 Archives on the Go: Using Mobile Technologies for Your Collections, where I caught a presentation on the use of FourSquare at Stanford University.

In the afternoon, I mostly concentrated on session 401, Re-arranging Arrangement and Description, with a brief foray into session 407, Faces of Diversity: Diasporic Archives and Archivists in the New Millennium.  Unless I missed this whilst I was out at the other session, nobody in session 410 mentioned the series system as a possible alternative or resolution to some of the tensions identified in a strict application of hierarchically-interpreted original order, which surprised me.  There were some hints towards a need for a more object-oriented view of description in a digital environment, and of methods of addressing the complexity of having multiple representations (physical, digital etc.), but I have been reading my UCL colleague Jenny Bunn’s recently completed PhD thesis, Multiple Narratives, Multiple Views: Observing Archival Description on flights for this trip, which would have added another layer to the discussion in this session.

And continuing the digital theme, I was handed a flyer for an event coming later this year (on 6th October): Day of Digital Archives which might interest some UK colleagues.  This is

…an initiative to raise awareness of digital archives among both users and managers. On this day, archivists, digital humanists, programmers, or anyone else creating, using, or managing digital archives are asked to devote some of their social media output (i.e. tweets, blog posts, youtube videos etc.) to describing their work with digital archives.  By collectively documenting what we do, we will be answering questions like: What are digital archives? Who uses them? How are they created and maanged? Why are they important?


Read Full Post »

This should be the first of several posts from this year’s Society of American Archivists Annual Meeting in Chicago, for which I have received generous funding to attend from UCL’s Graduate Conference Fund, and from the Archives and Records Association who asked me to blog the conference.  First impressions of a Brit: this conference is huge.  I could (and probably will) get lost inside the conference hotel, and the main programme involves parallel tracks of ten sessions at once.  And proceedings start at 8am.  This is all a bit of a shock to the system; not sure anybody would turn up if you started before 9am at the earliest back home! Anyway, the twitter tag to watch is #saa11, although with no wifi in the session rooms, live coverage of sessions will be limited to those who can get a mobile phone signal, which is a bit of a shame.

The conference proper starts on Thursday; the beginning of the week is mostly taken up with meetings, but on Tuesday I attended an impressive range of presentations at the SAA Research Forum.  Abstracts and bios for each speaker are already online (and are linked where relevant below), and I understand that slides will follow in the next week or so.  Here are some personal highlights and things which I think may be of interest to archivists back home in the UK:

It was interesting to see several presentations on digital preservation, many reflecting similar issues and themes to those which inspired my Churchill Fellowship research and the beginning of this blog back in 2008.  Whilst I don’t think I’d recommend anyone set out to learn about digital preservation techniques the hard way with seriously obsolete media, if you do find yourself in the position of having to deal with 5.25 inch floppy disks or the like, Karen Ballingher’s presentation on students’ work at the University of Texas – Austin had some handy links, including the UT-iSchool Digital Archaeology Lab Manual and related documentation and an open source forensics package called Sleuth Kit.  Her conclusions were more generally applicable, and familiar: the importance of documenting everything you do, including failures; planning out trials; and just do it – learn by doing a real digital preservation project.  Cal Lee was excellent (as ever) on Levels of Representation in Digital Collections, outlining a framework of digital information constructed of eight layers of representation from the bit(byte-)stream to aggregations of digital objects, and noting that archival description already supports description at multiple levels but has not yet evolved to address these multiple representation layers.  Eugenia Kim’s paper on her ChoreoSave project to determine the metadata elements required for digital dance preservation reminded me of several UK and European initiatives; Siobhan Davies Replay, which Eugenia herself referenced and talked about at some length; the University of the Arts London’s John Latham Archive, which I’ve blogged about previously, because Eugenia commented that choreographers had found the task of entering data into the numerous metadata fields onerous: once again it seems to me there is a tension between the (dance, in this case) event and the assumption that text offers the only or best means of describing and accessing that event; and the CASPAR research on the preservation of interactive multimedia performances at the University of Leeds.

For my current research work on user participation in archives, the following papers were particularly relevant: Helice Koffler‘s report on the RLG Social Metadata Working Group‘s project on evaluating the impact of social media on museums, libraries and archives.  A three-part report is to be issued; part one is due for publication in September 2011.  I understand that this will include some useful and much-needed definitions of ‘user interaction’ terminology.  Part 1 has moderation as its theme – Helice commented that a strict moderation policy can act as a barrier to participation (a point that I agree with up to a point – and will explore further in my own paper on Thursday).  Part 2 will be an analysis of the survey of social media use undertaken by the Working Group (4 U.K. organisations were involved in this, although none were archives).  As my interviews with archivists would also suggest, the survey found little evidence of serious problems with spam or abusive behaviour on MLA contributory platforms.  Ixchel Faniel reported on University of Michigan research on whether trust matters for re-use decisions.

With my UKAD hat on, the blue sky (sorry, I hate that term, but I think its appropriate in this instance) thinking on archival description methods which emerged from the Radcliffe Workshop on Technology and Archival Processing was particularly inspiring.  The workshop was a two-day event which brought together invited technologists (many of whom had not previously encountered archives at all) and archivists to brainstorm new thinking on ways to tackle cataloguing backlogs, streamline cataloguing workflows and improve access to archives.  A collections exhibition was used to spark discussion, together with specially written use cases and scenarios to guide each day’s discussion.  Suggestions included the use of foot-pedal operated overhead cameras to enable archival material to be digitised either at the point of accessioning, or during arrangement and description; experimenting with ‘trusted crowdsourcing’ – asking archivists to check documents for sensitivity – as a first step towards automating the redaction process of confidential information.  These last two suggestions reminded me of two recent projects at The National Archives in the U.K. – John Sheridan’s work to promote expert input into legislation.gov.uk (does anyone have a better link?) and the proposal to use text mining on closed record series which was presented to DSG in 2009.  Adam Kreisberg presented about the development of a toolkit for running focus groups by the Archival Metrics Project.  The toolkit will be tested with a sample session based upon archives’ use of social media, which I think could be very valuable for U.K. archivists.

Finally only because I couldn’t fit this one into any of the categories above, I found Heather Soyka and Eliot Wilczek‘s questions on how modern counter-insurgency warfare can be documented intriguing and thought-provoking.

Read Full Post »

This post is a thank you to my followers on Twitter, for pointing me towards many of the examples given below.  The thoughts on automated description and transcription are a preliminary sketching out of ideas (which, I suppose, is a way of excusing myself if I am not coherent!), on which I would particularly welcome comments or further suggestions:

A week or so before Easter, I was reading a paper about the classification of galaxies on the astronomical crowdsourcing website, Galaxy Zoo.  The authors use a statistical (Bayesian) analysis to distil an accurate sample of data, and then compare the reliability of this crowdsourced sample to classifications produced by expert astronomers.  The article also refers to the use of sample data in training artificial neural networks in order to automate the galaxy classification process.

This set me thinking about archivists’ approaches to online user participation and the harnessing of computing power to solve problems in archival description.  On the whole, I would say that archivists (and our partners on ‘digital archives’ kinds of projects) have been rather hamstrung by a restrictive ‘human-scale’, qualitatively-evaluated, vision of what might be achievable through the application of computing technology to such issues.

True, the notion of an Archival Commons evokes a network-oriented archival environment.  But although the proponents of this concept recognise “that the volume of records simply does not allow for extensive contextualization by archivists to the extent that has been practiced in the past”, the types of ‘functionalities’ envisaged to comprise this interactive descriptive framework still mirror conventional techniques of description in that they rely upon the human ability to interpret context and content in order to make contributions imbued with “cultural meaning”.  There are occasional hints of the potential for more extensible (?web scale) methods of description, in the contexts of tagging and of information visualization, but these seem to be conceived more as opportunities for “mining the communal provenance” of aggregated metadata – so creating additional folksonomic structures alongside traditional finding aids.  Which is not to say that the Archival Commons is not still justified from a cultural or societal perspective, but that the “volume of records” cataloguing backlog issue will require a solution which moves beyond merely adding to the pool of potential participants enabled to contribute narrative descriptive content and establish contextual linkages.

Meanwhile, double-keying, checking and data standardisation procedures in family history indexing have come a long way since the debacle over the 1901 census transcription. But double-keying for a commercial partner also signals a doubling of transcription costs, possibly without a corresponding increase in transcription accuracy.  Or, as the Galaxy Zoo article puts it, “the overall agreement between users does not necessarily mean improvement as people can agree on a wrong classification”.  Nevertheless, these norms from the commercial world have somehow transferred themselves as the ‘gold standard’ into archival crowdsourcing transcription projects, in spite of the proofreading overhead (bounded by the capacity of the individual, again).  As far as I am aware, Old Weather (which is, of course, a Zooniverse cousin of Galaxy Zoo) is the only project working with archival content which has implemented a quantitative approach to assess transcription accuracy – improving the project’s completion rate in the process, since the decision could be taken to reduce the number of independent transcriptions required from five to three.

Pondering these and other such tangles, I began to wonder whether there have indeed been any genuine attempts to harness large-scale processing power for archival description or transcription.  Tools are now available commercially designed to decipher modern handwriting (two examples: MyScript for LiveScribe; Evernote‘s text recognition tool), why not an automated palaeographical tool?  Vaguely remembering that The National Archives had once been experimenting with text mining for both cataloguing and sensitivity classification [I do not know what happened to this project – can anyone shed some light on this?], and recollecting the determination of one customer at West Yorkshire Archive Service who tried (and failed) valiantly to teach his Optical Character Recognition (OCR) software to recognise nearly four centuries of clerk’s handwriting in the West Riding Registry of Deeds indexes, I put out a tentative plea on Twitter for further examples of archival automation.  The following examples are the pick of the amazing set of responses I received:

  • The Muninn Project aims to extract and classify written data about the First World War from digitized documents using raw computing power alone.  The project appears to be at an early stage, and is beginning with structured documents (those written onto pre-printed forms) but hopes to move into more challenging territory with semi-structured formats at a later stage.
  • The Dutch Monk Project (not to be confused with the American project of the same name, which facilitates text mining in full-text digital library collections!) seeks to make use of the qualitative interventions of participants playing an online transcription correction game in order to train OCR software for improved handwriting recognition rates in future.  The project tries to stimulate user participation through competition and rewards, following the example of Google Image Labeller.  If your Dutch is good, Christian van der Ven’s blog has an interesting critique of this project (Google’s attempt at translation into English is a bit iffy, but you can still get the gist).
  • Impact is a European funded project which takes a similar approach to the Monk project, but has focused upon improving automated text recognition with early printed books.  The project has produced numerous tools to improve both OCR image recognition and lexical information retrieval, and a web-based collaborative correction platform for accuracy verification by volunteers.  The input from these volunteers can then in turn be used to further refine the automated character recognition (see the videos on the project’s YouTube channel for some useful introductory materials).  Presumably these techniques could be further adapted to help with handwriting recognition, perhaps beginning with the more stylised court hands, such as Chancery hand.  The division of the quality control checks into separate character, word, and page length tasks (as illustrated in this video) is especially interesting, although I think I’d want to take this further and partition the labour on each of the different tasks as well, rather than expecting one individual to work sequentially through each step.  Thinking of myself as a potential volunteer checker, I think I’d be likely to get bored and give up at the letter-checking stage.  Perhaps this rather more mundane task would be more effectively offered in return for peppercorn payment as a ‘human intelligence task’ on a platform such as Amazon Mechanical Turk, whilst the volunteer time could be more effectively utilised on the more interesting word and page level checking.
  • Genealogists are always ahead of the game!  The Family History Technology Workshop held annually at Brigham Young University usually includes at least one session on handwriting recognition and/or data extraction from digitized documents.  I’ve yet to explore these papers in detail, but there looks to be masses to read up on here.
  • Wot no catalogue? Google-style text search within historic manuscripts?  The Center for Intelligent Information Retrieval (University of Massachusetts Amherst) handwriting retrieval demonstration systems – manuscript document retrieval on the fly.
  • Several other tools and projects which might be of interest are listed in this handy google doc on Transcribing Handwritten Documents put together by attendees at the DHapi workshop held at the Maryland Institute for Technology in the Humanities earlier this year.  Where I’ve not mentioned specific examples directly here its mostly because these are examples of online user transcription interfaces (which for the purposes of this post I’m classing as technology-enhanced projects, as opposed to technology-driven, which is my main focus here – if that makes sense? Monk and Impact creep in above because they combine both approaches).

If you know of other examples, please leave a comment…

Read Full Post »

A round-up and some brief reflections on a number of different events and presentations I’ve attended recently:

Many of this term’s Archives and Society seminars at the Institute of Historical Research have been been on particularly pertinent subjects for me, and rather gratifyingly have attracted bumper audiences (we ran out of chairs at the last one I attended).  I’ve already blogged here about the talk on the John Latham Archive.  Presentations by Adrian Autton and Judith Bottomley from Westminster Archives, and Nora Daly and Helen Broderick from the British Library revealed an increasing awareness and interest in the use of social media in archives, qualified by a growing realisation that such initiatives are not self-sustaining, and in fact require a substantial commitment from archive staff, in time if not necessarily in financial terms, if they are to be successful.  Nora and Helen’s talk also prompted an intriguing audience debate about the ‘usefulness’ of user contributions.  To me, this translates as ‘why don’t users behave like archivists’ (or possibly like academic historians)?  But if the aim of promoting archives through social media is to attract new audiences, as is often claimed, surely we have to expect and celebrate the different perspectives these users bring to our collections.  Our professional training perhaps gives us tunnel vision when it comes to assessing the impact of users’ tagging and commenting.  Just because users’ terminology cannot be easily matched to the standardised metadata elements of ISAD(G) doesn’t mean it lacks relevance or usefulness outside of archival contexts.  Similar observations have been made in research in the museums and art galleries world, where large proportions of the tags contributed to the steve.museum prototype tagger represented terms not found in museum documentation (in one case, greater than 90% of tags were ‘new’ terms).  These new terms are viewed an unparalleled opportunity to enhance the accessibility of museum objects beyond traditional audiences, augmenting professional descriptions, not replacing them.

Releasing archival description from the artificial restraints imposed by the canon of professional practice was also a theme of my UCL colleague, Jenny Bunn’s, presentation of her PhD research, ‘The Autonomy Paradox’.  I find I can balance increased understanding about her research each time I hear her speak, with simultaneously greater confusion the deeper she gets into second order cybernetics!  Anyway, suffice it to say that I cannot possibly do justice to her research here, but anyone in north America might like to catch her at the Association of Canadian Archivists’ Conference in June.  I’m interested in the implications of her research for a move away from hierarchical or even series-system description, and whether this might facilitate a more object-oriented view of archival description.

Last term’s Archives and Society series included a talk by Nicole Schutz of Aberystwyth University about her development of a cloud computing toolkit for records management.  This was repeated at the recent meeting of the Data Standards Section of the Archives and Records Association, who had sponsored the research.  At the same meeting, I was pleased to discover that I know more than I thought I did about linked data and RDF, although I am still relieved that Jane Stevenson and the technical team behind the LOCAH Project are pioneering this approach in the UK archives sector and not me!  But I am fascinated by the potential for linked open data to draw in a radical new user community to archives, and will be watching the response to the LOCAH Project with interest.

The Linked Data theme was continued at the UKAD (UK Archives Discovery Network) Forum held at The National Archives on 2 March.  There was a real buzz to the day – so nice to attend an archives event that was full of positive energy about the future, not just ‘tough talk for tough times’.  There were three parallel tracks for most of the day, plus a busking space for short presentations and demos.  Obviously, I couldn’t get to everything, but highlights for me included:

  • the discovery of a second archives Linked Data project – the SALDA project at the University of Sussex, which is extract archival descriptions from CALM using EAD, and then transform them into Linked Data
  • Victoria Peters’ overview of the open source archival description software, ICA-AtoM – feedback welcomed, I think, on the University of Stathclyde’s new online catalogue which uses ICA-AtoM.
  • chatting about Manchester Archive + (Manchester archival images on flickr)
  • getting an insider’s view of HistoryPin and Ancestry’s World Archives Project – the latter particularly fascinating to me in the context of motivating and supporting contributors in online archival contexts

Slides from the day, including mine on Crowds and Communities in the Archives, are being gathered together on slideshare at http://www.slideshare.net/tag/ukad.  Initial feedback from the day was good, and several people have blogged about the event (including Bethan Ruddock from the ArchivesHub, a taxonomist’s viewpoint at VocabControl, Karen Watson from the SALDA Project, and The Questing Archivist).

Edit to add Kathryn Hannan’s Archives and Auteurs blog post.

Read Full Post »

I’ve noticed before that in all the excitement over Web2.0 tools for user participation, archivists tend to make one big assumption: that textual descriptions of archives will remain the primary access channel to archival material in the electronic age.  This is despite all the evidence (when we bother to look for it, which isn’t really often enough) that users find archival finding aids difficult to navigate, and an ongoing blurring of boundaries between previously separate descriptive products (catalogues, indexes, calendars, transcripts etc.) in online contexts.

In the context of user participation, this assumption is particularly significant, since adding considerable quantities of user-contributed metadata – comments, tags, and word-for-word transcripts of documents – can surely only amplify the existing difficulties of user interface design and add to the complexities of using archival descriptive systems.  And that’s not to mention the possibilities suggested by ongoing improvements in optical character recognition and data mining technologies.  Even given the assistance of sophisticated search algorithms, that’s a hell of a lot of text for the poor researcher to have to wade through.

Then of course, many archivists – and many of our users – would subscribe to the view that there is something extra special about the touch and feel of original archive documents, and consider the digitised surrogate to be an inevitably impoverished medium because of it.  Despite advances in digital tactility devices this one’s probably quite hard to crack for remote access to archives, however!

One really promising alternative is visual representation, which is particularly effective for very large datasets – see, for example, Mitchell Whitelaw’s ‘visual archive‘ research project for the National Archives of Australia.

And on Tuesday, I was introduced to another – searching by sound, soon to be implemented as one of the three access routes into the archive of the artist John Latham (it’ll be under the ‘AA’ link shortly; in the meantime you can browse a slideshow the archive or interrogate its contents in more traditional, textual fashion by clicking on either of the other two letter codes from the homepage).  Fascinating stuff, although one potential problem is that you would need to know quite a substantial amount about John Latham and his ‘flat-time’ theories before you can make sense of the soundtrack and the finding aid itself – so in that sense, the sound search tool might be as much as barrier as a facilitator of access.  But then again, the same can be said of certain textual finding aids: fonds, anybody?

Anyone fancy devising an olefactory finding aid?

Read Full Post »

A bit late with this, but I’ve just noticed that fellow National Archives / UCL PhD student Ann Fenech has posted her 3-minute presentation from the recent PhD day held at The National Archives on her blog, and its occurred to me that mine is probably quite a good short introduction to what I’m working on too:

Read Full Post »

Day 3 of ECDL started for me with the Query Log Analysis session.  I thought perhaps that, now the papers were getting heavily into IR technicalities, I might not understand what was being presented or that it would be less relevant to archives.  How wrong can you be!  Well, ok, IR metrics are complex, especially for someone new to the field, but when the first presentation was based upon a usability study of the EAD finding aids at the Nationaal Archief (the National Archives of the Netherlands), it wasn’t too difficult to spot the relevance.  In fact, it was interesting to see how you notice things when the test data is presented in a foreign language, that you wouldn’t necessarily observe if they were in your mother tongue.  In the case of the Nationaal Archief, I was horrified at how many clicks were required to reach an item description.  Most archives have this problem with web-based finding aids (unless they merely replicate a traditional format, for instance, a PDF copy of a paper list), but somehow it was so much more obvious when I wasn’t quite sure exactly what was being presented to me at each stage of the results.  This is what it must be like to be an archival novice.  No wonder they give up.

The second paper of the morning, Determining Time of Queries for Re-ranking Search Results, was also very pertinent to searching in an archival context.  It discussed ‘temporal documents’ where either the terminology itself has changed over time or time is highly relevant to the query.  This temporal intent may be either implicit or explicit in the query.  For example, ‘tsunami + Thailand’ is likely to refer to the 2004 tsunami.  These kinds of issues are obviously very important for historians, and for archivists making temporal collections available in a web environment, such as web archives and online archival finding aids.

Later in the morning, I was down to attend the stream on Domain-specific Digital Libraries.  One of these specific domains turned out to be archives, with an (appropriately) very philosophical paper presented by Pierre-Edouard Portier about DINAH [in French].  This is “a philological platform for the construction of multi-structured documents”, created to enable the transcription and annotation of the papes of the French philosopher, Jean-Toussaint Desanti, and to facilitate the visualization of the trace of user activities.  My tweeting of this paper (limited on account of both the presentation’s intellectual and technical complexity and the fact that I’d got to bed at around 3am that morning!) seemed to catch the attention of both the archival profession and the Linked Data community;  it certainly deserves some further coverage in the English-speaking archival professional literature.

In the same session, I was also interested in the visualization techniques presented for time-oriented scientific data by Jürgen Bernard, which reminded me of The Visible Archive research project funded by the National Archives of Australia.  The principle – that visual presentations are a useful, possibly preferable, alternative to text-based descriptions of huge series of data – is the same in both cases.  Similarly, the PROBADO project has investigated the development of tools to store and retrieve complex, non-textual data and objects, such as 3D CAD drawings and music.  There were important implications from all of these papers for the future development of archival finding aids.

In the afternoon, I found myself helping out at the Networked Knowledge Organization Systems/Services (NKOS) workshop.  I wasn’t really sure what this entailed, but it turned out to involve things like thesauri construction and semantic mapping between systems, all of which is very relevant to the UK Archives Discovery (UKAD) Network objectives.  I was particularly sorry I was unable to make the Friday session of the workshop, which was to be all about user-centred knowledge system design, and Linked Data, however the slides are all available with the programme for the workshop.

Once again, my sincere thanks to the conference organisers for my opportunity to participate in ECDL2010.  The conference proceedings are available from Springer, for those who want to follow up further, and presentation slides are gradually appearing on the conference website.

Read Full Post »

Older Posts »