When it rains it pours.
After the exciting news that Google funded my application to their digital humanities program, I found out this week that they will also fund another project of mine (full list): Defining and Solving Key Challenges in Microblog Search. The research will focus largely on helping people find and make sense of information that comes across Twitter.
Over the next year the project will support me and two Ph.D. students as we address (and propose some responses to) questions such as:
- What are meaningful units of retrieval for IR over microblog data?
- What types of information needs do people bring to microblogging environments and how can we support them? Is there a place for ad hoc IR in this space? If not (or even if so) what might constitute a ‘query’ in microblog IR?
- What criteria should we pursue to help people find useful information in microblog collections? Surely time plays a role here. Topical relevance is a likely suspect, as are various types of reputation factors such as TunkRank (and here).
- How does microblog IR relate to more established IR problems such as blog search, expert finding, and other entity search issues?
For me, one of the most interesting issues at work in microblog IR is: how can we aggregate (and then retrieve) data in order to create information that is useful once collected but that might be uninteresting on its own?
Is it useful to retrieve an individual tweet that shares keywords with an ad hoc query? Maybe. But it seems more likely that people might seek debates, consensus, emerging sub-topics, or communities of experts with respect to a given topic. These are just a few of the aggregates that leap to mind. I’m sure readers can think of others. And I’m sure readers can think of other tasks that can help move microblog IR forward.
In case anyone wonders how this project relates to the other work of mine that Google funded (which treats retrieval over historically diverse texts in Google Books data), the short answer is that both projects concern IR in situations where change over time is a critical factor, a topic similar to what I addressed in a recent JASIST paper.
I’m excited to say that the kind folks at Google have funded my proposal to the Google Digital Humanities Awards Program.
The title of the project is Meeting the Challenge of Language Change in Text Retrieval with Machine Translation Techniques, and as the name implies the work will tackle language drift in IR, but language drift on a nettlesome order.
Specifically, the work will focus on Google Books data where a great deal of text may indeed be written in a given language (I’m focusing on English), but it may also have been written in the 18th Century, the 17th Century, etc. Anyone who has studied Shakespeare or Chaucer (or less poetically, Robert Burton or Thomas Hobbes) can remind you that a wide gulf separates our vernacular from English of the medieval period , the Renaissance, the 17th Century, etc..
The project aims to support search across historically diverse corpora. But more generally, the work will address the question: how can we model language (empirically and statistically) in an environment where texts treating similar topics often use dissimilar lexical and even syntactic features?
As a sample use case, we may consider a scholar whose research treats historical issues related to literacy and the rise of the merchant class in Europe. During his research, this scholar might find the following passage. In the introduction to his 1623 English Dictionarie, Henry Cockeram addresses his intended audience:
Ladies and Gentlewomen, young Schollers, Clark Merchantz, and also Strangers of any Nation.
Finding this passage of interest, the researcher submits it to a retrieval system such as Google Books in hopes of finding similar texts (i.e. this is a type of query by example). To meet this challenge, the search system must overcome the idiosyncracies of 17th Century written English. To find results in more recent English, the search system must induce a model of the information need that accounts for the probability that schollers is a translation of scholars and that clark merchantz is a translation of statute merchants, etc. Similar processes need to happen for retrieval over earlier English texts.
My angle on this problem is that we can approach it as a type of machine translation, where the English of period X comprises a source language and period Y’s English is a target language. The nice thing about the IR setting is that I think we don’t need to find a single “correct” translation of a query. Instead, the work will generate an expanded model of the information need in the target language.
This problem has native merits. But I’m especially excited to take it on because it comprises part of an interesting body of work on temporal issues in IR (see my own contribution here or here, as well as excellent work Susan Dumais with Jon Elsas here, and with Jamie Teevan, Eytan Adar, Dan Leibling, and Richard Hughes here, here and here; this list isn’t exhaustive, omitting obvious links to the topic detection and tracking literature).
More to come as the project matures.