Ranking recent information

I was happy to get the news that our paper, Estimation methods for ranking recent information, (co-authored with Gene Golovchinsky) was accepted for presentation at SIGIR 2011.  I’ll let the paper speak for itself.  But to prod people to read it, here are some of the motivations and findings.

Often a query expresses an information need where recency is a a crucial dimension of relevance.  The goal of the paper was to formulate approaches to incorporating time into ad hoc IR when we have evidence that this is the case.  For example, a web query like  champaign tornado had a strong temporal dimension during our crazy weather a few nights back. This is in contrast to a query such as steampunk photo.  Though so-called recency queries show up in many IR domains, they are especially important in the context of microblog search as discussed nicely here.

Of course handling recency queries (and other time-sensitive queries) is a well-studied problem.  Articles in this area are too numerous to name here.  But one of the canonical approaches was formulated by Li and Croft.  In their work, Li and Croft use time to inform a document prior in the standard query likelihood model:

Pr(d | q) \propto Pr(q | d) Pr(d | t_d)

where for a time-stamped document d, Pr(d | t_d) follows an exponential distribution with rate parameter λ–newer documents have a higher prior probability.  This approach is elegant, and it has been shown to work well for recency queries.

The problem, however, is how well such an approach works if we apply it to queries that aren’t concerned with recency.  We found that using a time-based prior as shown above leads to decreased effectiveness on non-recency queries (which isn’t surprising).  Of course we could mitigate this by classifying queries with respect to their temporal concerns.  However, this strikes me as a hard problem.

Instead, we propose several methods of incorporating recency into retrieval  that allow time to influence ranking, but that degrade gracefully if the query shows little evidence of temporal concern.  Additionally, the approaches we outline show less sensitivity to parameterization than we see in previous work

The paper introduces a number of strategies.  But the one that I find most interesting uses time to guide smoothing in the language modeling framework.  To keep things simple, we limit analysis to Jelinek-Mercer smoothing, such that the smoothed estimate of the probability of a word w given a document d and the collection C is given by

\hat{P}r(w|d) = (1-lambda_t)\hat{P}r_{ML}(w|d) + \lambda_t \hat{P}r(w|C)

Where λ_t is a smoothing parameter that is estimated based on the “age” of the document.  The intuition is that we might plausibly trust the word frequencies that drive a document model’s maximum likelihood estimator less for older documents than we do for recent documents insofar as an author might choose to phrase things differently were he to re-write an old text today.

The main work of the paper lies in establishing methods of parameterizing models for promoting recency in retrieved documents.  Whether we’re looking at the rate parameter of an exponential distribution, the parameter for JM smoothing, or the mixture parameter for combining query terms with expansion terms, we take the view that we’re dealing with an estimation problem, and we propose treating the problem by finding the maximum a posteriori estimate based on temporal characteristics.

Dealing with recency queries comprises only a slice of the more general matter of time-sensitive retrieval.  A great deal of recent work has shown (and continues to show) the complexity of dealing with time in IR, as well as ingenuity in the face of this complexity.  It’s exciting to have a seat at this table.