Ranking recent information

I was happy to get the news that our paper, Estimation methods for ranking recent information, (co-authored with Gene Golovchinsky) was accepted for presentation at SIGIR 2011.  I’ll let the paper speak for itself.  But to prod people to read it, here are some of the motivations and findings.

Often a query expresses an information need where recency is a a crucial dimension of relevance.  The goal of the paper was to formulate approaches to incorporating time into ad hoc IR when we have evidence that this is the case.  For example, a web query like  champaign tornado had a strong temporal dimension during our crazy weather a few nights back. This is in contrast to a query such as steampunk photo.  Though so-called recency queries show up in many IR domains, they are especially important in the context of microblog search as discussed nicely here.

Of course handling recency queries (and other time-sensitive queries) is a well-studied problem.  Articles in this area are too numerous to name here.  But one of the canonical approaches was formulated by Li and Croft.  In their work, Li and Croft use time to inform a document prior in the standard query likelihood model:

Pr(d | q) \propto Pr(q | d) Pr(d | t_d)

where for a time-stamped document d, Pr(d | t_d) follows an exponential distribution with rate parameter λ–newer documents have a higher prior probability.  This approach is elegant, and it has been shown to work well for recency queries.

The problem, however, is how well such an approach works if we apply it to queries that aren’t concerned with recency.  We found that using a time-based prior as shown above leads to decreased effectiveness on non-recency queries (which isn’t surprising).  Of course we could mitigate this by classifying queries with respect to their temporal concerns.  However, this strikes me as a hard problem.

Instead, we propose several methods of incorporating recency into retrieval  that allow time to influence ranking, but that degrade gracefully if the query shows little evidence of temporal concern.  Additionally, the approaches we outline show less sensitivity to parameterization than we see in previous work

The paper introduces a number of strategies.  But the one that I find most interesting uses time to guide smoothing in the language modeling framework.  To keep things simple, we limit analysis to Jelinek-Mercer smoothing, such that the smoothed estimate of the probability of a word w given a document d and the collection C is given by

\hat{P}r(w|d) = (1-lambda_t)\hat{P}r_{ML}(w|d) + \lambda_t \hat{P}r(w|C)

Where λ_t is a smoothing parameter that is estimated based on the “age” of the document.  The intuition is that we might plausibly trust the word frequencies that drive a document model’s maximum likelihood estimator less for older documents than we do for recent documents insofar as an author might choose to phrase things differently were he to re-write an old text today.

The main work of the paper lies in establishing methods of parameterizing models for promoting recency in retrieved documents.  Whether we’re looking at the rate parameter of an exponential distribution, the parameter for JM smoothing, or the mixture parameter for combining query terms with expansion terms, we take the view that we’re dealing with an estimation problem, and we propose treating the problem by finding the maximum a posteriori estimate based on temporal characteristics.

Dealing with recency queries comprises only a slice of the more general matter of time-sensitive retrieval.  A great deal of recent work has shown (and continues to show) the complexity of dealing with time in IR, as well as ingenuity in the face of this complexity.  It’s exciting to have a seat at this table.

Snowball sampling for Twitter Research

By way of shameless promotion, I am currently encouraging people to help me evaluate an experimental IR system that searches microblog (Twitter) data.  To participate, please see:


Please consider giving it a moment…even a brief moment.

Now, onto a more substantive matter: I’ve been wrestling with the validity of testing an IR system (particularly a microblog IR system) using a so-called snowball sampling technique.  For the uninitiated, snowball sampling involves recruiting a small number of people to participate in a study with the explicit aim that they will, in turn, encourage others to participate.  The hope is that participation in the study will extend beyond the narrow initial sample as subjects recruit new participants.

Snowball sampling has clear drawbacks.  Most obviously, it is ripe for introducing bias into one’s analysis.  The initial “seed” participants will drive the demographics of subsequent recruits.  This effect could amplify any initial bias.  The non-random (assuming it is non-random) selection of initial participants, and their non-random selection of recruits calls into question the application of standard inferential statistics at the end of the study.  What status does a confidence interval on, say, user satisfaction derived from a snowball sample have with respect to the level of user satisfaction in the population?

However, snowball sampling has its merits, too.  Among these is the possibility of obtaining a reasonable number of participants in the absence of a tractable method for random sampling.

In my case, I have decided that a snowball sample for this study is worth the risks it entails.  In order to avoid poisoning my results, I’ll keep description of the project to a minimum.

But I feel comfortable saying that my method of recruiting includes dissemination of a call for participation in several venues:

  • Via a twitter post with a call for readers to retweet it.
  • Via this blog post!
  • By email to two mailing lists (one a student list, and the other a list of Twitter researchers).

In this case, the value of a snowball sample extends beyond simply acquiring a large N. The fact that Twitter users are connected by Twitter’s native subscription model suggests to me that the fact that my sample will draw many users who are “close” to my social network is not a liability.  Instead it will, I hope, lend a level of realism to how a particular sub-community functions.

One problem with these rose-colored lenses is that I have no way to characterize this sub-community formally.  Inferences drawn from this sample may generalize to some group.  But what group is that?

Obviously some of the validity of this sample will have to do with the nature of the data collected and the research questions to be posed against it, neither of which I’m comfortable discussing yet.  But I would be interested to hear what readers think: does snowball sampling have merits or liabilities for research on the use of systems that inherently rely on social connections that do not pertain to situations lacking explicit social linkage?

Research award for microblog search

When it rains it pours.

After the exciting news that Google funded my application to their digital humanities program, I found out this week that they will also fund another project of mine (full list): Defining and Solving Key Challenges in Microblog Search.  The research will focus largely on helping people find and make sense of information that comes across Twitter.

Over the next year the project will support me and two Ph.D. students as we address (and propose some responses to) questions such as:

  • What are meaningful units of retrieval for IR over microblog data?
  • What types of information needs do people bring to microblogging environments and how can we support them?  Is there a place for ad hoc IR in this space?  If not (or even if so) what might constitute a ‘query’ in microblog IR?
  • What criteria should we pursue to help people find useful information in microblog collections?  Surely time plays a role here.  Topical relevance is a likely suspect, as are various types of reputation factors such as TunkRank (and here).
  • How does microblog IR relate to more established IR problems such as blog search, expert finding, and other entity search issues?

This work builds on earlier work that I did with Gene Golovchinsky, as well as research I presented at SIGIR this week.

For me, one of the most interesting issues at work in microblog IR is: how can we aggregate (and then retrieve) data in order to create information that is useful once collected but that might be uninteresting on its own?

Is it useful to retrieve an individual tweet that shares keywords with an ad hoc query?  Maybe.  But it seems more likely that people might seek debates, consensus, emerging sub-topics, or communities of experts with respect to a given topic.  These are just a few of the aggregates that leap to mind.  I’m sure readers can think of others.  And I’m sure readers can think of other tasks that can help move microblog IR forward.

In case anyone wonders how this project relates to the other work of mine that Google funded (which treats retrieval over historically diverse texts in Google Books data), the short answer is that both projects concern IR in situations where change over time is a critical factor, a topic similar to what I addressed in a recent JASIST paper.