I arrived late at the JISC Collections AGM, but was pleased to find that I was by no means the only representative of the further education sector. Peter Murray-Rust was in full spate, and left us in no doubt as to his views on PowerPoint presentations and PDF files, delivering a web-based presentation and inveighing against pdfs as useless for the purposes of the semantic chemistry he and others are developing. He amused us with some illustrations of the difficulties of data mining, for example teaching the OSCAR3 program to distinguish between the pronoun he and the chemical symbol for helium. He also described, and there's some detail on his blog, how a student who, perfectly legitimately, bookmarked some American Chemical Society articles and opened them all at once in a browser, instantly causing the ACS to cut off the whole university from access to all their publications, and also the story of Shelley Batts, who, in 2007, got in trouble for referring to a graph in some research on the effects of antioxidants taken with alcohol: see here for BoingBoing's comment on the story . He alluded to the reports in that morning's Guardian of the difficulties local councils were having using OS data. He wondered if something similar to freedom of Information might help scientists to access data and concluded with a plug for the Open Knowledge Foundation
Richard Kidd of the Royal Society of Chemistry alluded to Stephen Arnold's Three Curves of Despair. Taxonomies abound in chemistry, and syntax and text mining are problems he and others have been wrestling with using the same OSCAR3 tool from the SciBorg project. He also cited Borges' classification of animals, from his story The Analytical Language of John Wilkins Celestial Emporium of Benevolent Knowledge, in which it is written that animals are divided into:
- those that belong to the Emperor,
- embalmed ones,
- those that are trained,
- suckling pigs,
- mermaids,
- fabulous ones,
- stray dogs,
- those included in the present classification,
- those that tremble as if they were mad,
- innumerable ones,
- those drawn with a very fine camelhair brush,
- others,
- those that have just broken a flower vase,
- those that from a long way off look like flies.
We need standards for repositories,he argued.
Finally Alastair Dunning discussed how the Lancaster Newsbooks project is a good example of enriched digitisation,with the place names in the newsbooks mapped to a gazetteer and then superimposed on Google Earth. The project has also developed a thesaurus and uses semantic tagging to understand the text in machine-readable ways. He spoke about the Million Books Challenge, an international multi-agency competition to explore innovative use of large historical corpora, to start in January 2009
Update: the presentations are now available as podcasts and in slide format.