Language change and IRPosted: July 14, 2010
I’m excited to say that the kind folks at Google have funded my proposal to the Google Digital Humanities Awards Program.
The title of the project is Meeting the Challenge of Language Change in Text Retrieval with Machine Translation Techniques, and as the name implies the work will tackle language drift in IR, but language drift on a nettlesome order.
Specifically, the work will focus on Google Books data where a great deal of text may indeed be written in a given language (I’m focusing on English), but it may also have been written in the 18th Century, the 17th Century, etc. Anyone who has studied Shakespeare or Chaucer (or less poetically, Robert Burton or Thomas Hobbes) can remind you that a wide gulf separates our vernacular from English of the medieval period , the Renaissance, the 17th Century, etc..
The project aims to support search across historically diverse corpora. But more generally, the work will address the question: how can we model language (empirically and statistically) in an environment where texts treating similar topics often use dissimilar lexical and even syntactic features?
As a sample use case, we may consider a scholar whose research treats historical issues related to literacy and the rise of the merchant class in Europe. During his research, this scholar might find the following passage. In the introduction to his 1623 English Dictionarie, Henry Cockeram addresses his intended audience:
Ladies and Gentlewomen, young Schollers, Clark Merchantz, and also Strangers of any Nation.
Finding this passage of interest, the researcher submits it to a retrieval system such as Google Books in hopes of finding similar texts (i.e. this is a type of query by example). To meet this challenge, the search system must overcome the idiosyncracies of 17th Century written English. To find results in more recent English, the search system must induce a model of the information need that accounts for the probability that schollers is a translation of scholars and that clark merchantz is a translation of statute merchants, etc. Similar processes need to happen for retrieval over earlier English texts.
My angle on this problem is that we can approach it as a type of machine translation, where the English of period X comprises a source language and period Y’s English is a target language. The nice thing about the IR setting is that I think we don’t need to find a single “correct” translation of a query. Instead, the work will generate an expanded model of the information need in the target language.
This problem has native merits. But I’m especially excited to take it on because it comprises part of an interesting body of work on temporal issues in IR (see my own contribution here or here, as well as excellent work Susan Dumais with Jon Elsas here, and with Jamie Teevan, Eytan Adar, Dan Leibling, and Richard Hughes here, here and here; this list isn’t exhaustive, omitting obvious links to the topic detection and tracking literature).
More to come as the project matures.