Language change and IR

I’m excited to say that the kind folks at Google have funded my proposal to the Google Digital Humanities Awards Program.

The title of the project is Meeting the Challenge of Language Change in Text Retrieval with Machine Translation Techniques, and as the name implies the work will tackle language drift in IR, but language drift on a nettlesome order.

Specifically, the work will focus on Google Books data where a great deal of text may indeed be written in a given language (I’m focusing on English), but it may also have been written in the 18th Century, the 17th Century, etc.  Anyone who has studied Shakespeare or Chaucer (or less poetically, Robert Burton or Thomas Hobbes) can remind you that a wide gulf separates our vernacular from English of the medieval period , the Renaissance, the 17th Century, etc..

The project aims to support search across historically diverse corpora.  But more generally, the work will address the question: how can we model language (empirically and statistically) in an environment where texts treating similar topics often use dissimilar lexical and even syntactic features?

As a sample use case, we may consider a scholar whose research treats historical issues related to literacy and the rise of the merchant class in Europe.  During his research, this scholar might find the following passage.  In the introduction to his 1623 English Dictionarie, Henry Cockeram addresses his intended audience:

Ladies and Gentlewomen, young Schollers, Clark Merchantz, and also Strangers of any Nation.

Finding this passage of interest, the researcher submits it to a retrieval system such as Google Books in hopes of finding similar texts (i.e. this is a type of query by example).  To meet this challenge, the search system must overcome the idiosyncracies of 17th Century written English.  To find results in more recent English, the search system must induce a model of the information need that accounts for the probability that schollers is a translation of scholars and that clark merchantz is a translation of statute merchants, etc.  Similar processes need to happen for retrieval over earlier English texts.

My angle on this problem is that we can approach it as a type of machine translation, where the English of period X comprises a source language and period Y’s  English is a target language.  The nice thing about the IR setting is that I think we don’t need to find a single “correct” translation of a query.  Instead, the work will generate an expanded model of the information need in the target language.

This problem has native merits.  But I’m especially excited to take it on because it comprises part of an interesting body of work on temporal issues in IR (see my own contribution here or here, as well as excellent work Susan Dumais with Jon Elsas here, and with Jamie Teevan, Eytan Adar, Dan Leibling,  and Richard Hughes here, here and here; this list isn’t exhaustive, omitting obvious links to the topic detection and tracking literature).

More to come as the project matures.


7 Comments on “Language change and IR”

  1. I noticed you’d been awarded a grant on the Google blog the other day. Congratulations! Sounds a very interesting project; please keep us up to date with your progress.

  2. […] Efron wrote about a research project he is starting on statistical processing of 17th and 18th century English texts with the goal of […]

  3. […] the exciting news that Google funded my application to their digital humanities program, I found out this week that they will also fund another project […]

  4. […] recent XKCD cartoon made me think of Miles Efron’s project on language change and information retrieval. Perhaps the tools he develops will help scholars in the future parse what we declaim to our […]

  5. Interesting to see that more and more research groups are also developing methods for retrieval of historic documents!

  6. interesting project for sure

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s