Language change and IR

I’m excited to say that the kind folks at Google have funded my proposal to the Google Digital Humanities Awards Program.

The title of the project is Meeting the Challenge of Language Change in Text Retrieval with Machine Translation Techniques, and as the name implies the work will tackle language drift in IR, but language drift on a nettlesome order.

Specifically, the work will focus on Google Books data where a great deal of text may indeed be written in a given language (I’m focusing on English), but it may also have been written in the 18th Century, the 17th Century, etc.  Anyone who has studied Shakespeare or Chaucer (or less poetically, Robert Burton or Thomas Hobbes) can remind you that a wide gulf separates our vernacular from English of the medieval period , the Renaissance, the 17th Century, etc..

The project aims to support search across historically diverse corpora.  But more generally, the work will address the question: how can we model language (empirically and statistically) in an environment where texts treating similar topics often use dissimilar lexical and even syntactic features?

As a sample use case, we may consider a scholar whose research treats historical issues related to literacy and the rise of the merchant class in Europe.  During his research, this scholar might find the following passage.  In the introduction to his 1623 English Dictionarie, Henry Cockeram addresses his intended audience:

Ladies and Gentlewomen, young Schollers, Clark Merchantz, and also Strangers of any Nation.

Finding this passage of interest, the researcher submits it to a retrieval system such as Google Books in hopes of finding similar texts (i.e. this is a type of query by example).  To meet this challenge, the search system must overcome the idiosyncracies of 17th Century written English.  To find results in more recent English, the search system must induce a model of the information need that accounts for the probability that schollers is a translation of scholars and that clark merchantz is a translation of statute merchants, etc.  Similar processes need to happen for retrieval over earlier English texts.

My angle on this problem is that we can approach it as a type of machine translation, where the English of period X comprises a source language and period Y’s  English is a target language.  The nice thing about the IR setting is that I think we don’t need to find a single “correct” translation of a query.  Instead, the work will generate an expanded model of the information need in the target language.

This problem has native merits.  But I’m especially excited to take it on because it comprises part of an interesting body of work on temporal issues in IR (see my own contribution here or here, as well as excellent work Susan Dumais with Jon Elsas here, and with Jamie Teevan, Eytan Adar, Dan Leibling,  and Richard Hughes here, here and here; this list isn’t exhaustive, omitting obvious links to the topic detection and tracking literature).

More to come as the project matures.

meaningful text analysis

Last night I had dinner with a group of visiting scholars from Germany who are part of the textGrid project.  Textgrid entails an effort to bring grid computing to bear on digital humanities research.  We spent the evening talking not so much about grid technologies but rather about humanities computing in general.  The conversation also focused on the Monk project, with which our gracious host John Unsworth is closely involved.

The thrust of our discussion lay in what computing does, can, should, and cannot offer to the study of humanistic data.

The interesting question is, what should humanities computing be?

Kirsten Uszkalo was especially keen on the application of sentiment analysis to the work she does on early modern English literature.  But I wonder whether the already-difficult problem of identifying, say, positive and negative product reviews isn’t qualitatively different in the context of 16th Century popular lit.

Consider one example that we discussed: reports of demonic possession.  It struck me that a humanist is unlikely to be compelled by a classifier that achieves n% accuracy in separating medical and theological treatments of possession.   Instead, the interesting question in this case lies in identifying the textual features that would enable such a classifier.  That is what aspects of a text–vocabulary, physical dimensions, print type, etc.–speak to a meaningful difference in discourses?

I came away from the dinner wondering where the problem of feature creation, selection, reduction, etc. fits into humanities computing.  To what extent is feature selection a computing problem at all?  Maybe the features that would inform a classifier are the aim of the humanist in the first place.