Ranking recent information

I was happy to get the news that our paper, Estimation methods for ranking recent information, (co-authored with Gene Golovchinsky) was accepted for presentation at SIGIR 2011.  I’ll let the paper speak for itself.  But to prod people to read it, here are some of the motivations and findings.

Often a query expresses an information need where recency is a a crucial dimension of relevance.  The goal of the paper was to formulate approaches to incorporating time into ad hoc IR when we have evidence that this is the case.  For example, a web query like  champaign tornado had a strong temporal dimension during our crazy weather a few nights back. This is in contrast to a query such as steampunk photo.  Though so-called recency queries show up in many IR domains, they are especially important in the context of microblog search as discussed nicely here.

Of course handling recency queries (and other time-sensitive queries) is a well-studied problem.  Articles in this area are too numerous to name here.  But one of the canonical approaches was formulated by Li and Croft.  In their work, Li and Croft use time to inform a document prior in the standard query likelihood model:

Pr(d | q) \propto Pr(q | d) Pr(d | t_d)

where for a time-stamped document d, Pr(d | t_d) follows an exponential distribution with rate parameter λ–newer documents have a higher prior probability.  This approach is elegant, and it has been shown to work well for recency queries.

The problem, however, is how well such an approach works if we apply it to queries that aren’t concerned with recency.  We found that using a time-based prior as shown above leads to decreased effectiveness on non-recency queries (which isn’t surprising).  Of course we could mitigate this by classifying queries with respect to their temporal concerns.  However, this strikes me as a hard problem.

Instead, we propose several methods of incorporating recency into retrieval  that allow time to influence ranking, but that degrade gracefully if the query shows little evidence of temporal concern.  Additionally, the approaches we outline show less sensitivity to parameterization than we see in previous work

The paper introduces a number of strategies.  But the one that I find most interesting uses time to guide smoothing in the language modeling framework.  To keep things simple, we limit analysis to Jelinek-Mercer smoothing, such that the smoothed estimate of the probability of a word w given a document d and the collection C is given by

\hat{P}r(w|d) = (1-lambda_t)\hat{P}r_{ML}(w|d) + \lambda_t \hat{P}r(w|C)

Where λ_t is a smoothing parameter that is estimated based on the “age” of the document.  The intuition is that we might plausibly trust the word frequencies that drive a document model’s maximum likelihood estimator less for older documents than we do for recent documents insofar as an author might choose to phrase things differently were he to re-write an old text today.

The main work of the paper lies in establishing methods of parameterizing models for promoting recency in retrieved documents.  Whether we’re looking at the rate parameter of an exponential distribution, the parameter for JM smoothing, or the mixture parameter for combining query terms with expansion terms, we take the view that we’re dealing with an estimation problem, and we propose treating the problem by finding the maximum a posteriori estimate based on temporal characteristics.

Dealing with recency queries comprises only a slice of the more general matter of time-sensitive retrieval.  A great deal of recent work has shown (and continues to show) the complexity of dealing with time in IR, as well as ingenuity in the face of this complexity.  It’s exciting to have a seat at this table.

Advertisements

3 Comments on “Ranking recent information”

  1. […] = 'wpp-254'; var addthis_config = {"data_track_clickback":true,"ui_language":"en"};As Miles wrote yesterday, our paper was accepted to SIGIR 2011. The idea that time has an impact in ranking […]

  2. Bob Carpenter says:

    This is an interesting idea.

    Has anyone explored a related idea of using a time series of language model priors and then estimating a document posterior by using the appropriate model from the series? For instance, we could associate a number of discrete time points with Dirichlet distributions to use as priors for our documents. Then we could fit these given the time-stamped corpus by smoothing over time (technically, taking the course of the Dirichlet parameters to be a random walk). This is related to what Blei et al. did with topic models over time.

    This’d almost certainly give you a better language model around the document, which I suspect would lead to better search for not just recency queries, but for any query that contained temporally varying terminology.

    • Miles Efron says:

      That’s a nice idea, Bob. I don’t know of anything along those lines. I’ve been meaning to look into Blei’s work on time in the context of topic models. But I hadn’t thought of this work as related to that interest.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s