I was happy to get the news that our paper, Estimation methods for ranking recent information, (co-authored with Gene Golovchinsky) was accepted for presentation at SIGIR 2011. I’ll let the paper speak for itself. But to prod people to read it, here are some of the motivations and findings.
Often a query expresses an information need where recency is a a crucial dimension of relevance. The goal of the paper was to formulate approaches to incorporating time into ad hoc IR when we have evidence that this is the case. For example, a web query like champaign tornado had a strong temporal dimension during our crazy weather a few nights back. This is in contrast to a query such as steampunk photo. Though so-called recency queries show up in many IR domains, they are especially important in the context of microblog search as discussed nicely here.
Of course handling recency queries (and other time-sensitive queries) is a well-studied problem. Articles in this area are too numerous to name here. But one of the canonical approaches was formulated by Li and Croft. In their work, Li and Croft use time to inform a document prior in the standard query likelihood model:
where for a time-stamped document d, Pr(d | t_d) follows an exponential distribution with rate parameter λ–newer documents have a higher prior probability. This approach is elegant, and it has been shown to work well for recency queries.
The problem, however, is how well such an approach works if we apply it to queries that aren’t concerned with recency. We found that using a time-based prior as shown above leads to decreased effectiveness on non-recency queries (which isn’t surprising). Of course we could mitigate this by classifying queries with respect to their temporal concerns. However, this strikes me as a hard problem.
Instead, we propose several methods of incorporating recency into retrieval that allow time to influence ranking, but that degrade gracefully if the query shows little evidence of temporal concern. Additionally, the approaches we outline show less sensitivity to parameterization than we see in previous work
The paper introduces a number of strategies. But the one that I find most interesting uses time to guide smoothing in the language modeling framework. To keep things simple, we limit analysis to Jelinek-Mercer smoothing, such that the smoothed estimate of the probability of a word w given a document d and the collection C is given by
Where λ_t is a smoothing parameter that is estimated based on the “age” of the document. The intuition is that we might plausibly trust the word frequencies that drive a document model’s maximum likelihood estimator less for older documents than we do for recent documents insofar as an author might choose to phrase things differently were he to re-write an old text today.
The main work of the paper lies in establishing methods of parameterizing models for promoting recency in retrieved documents. Whether we’re looking at the rate parameter of an exponential distribution, the parameter for JM smoothing, or the mixture parameter for combining query terms with expansion terms, we take the view that we’re dealing with an estimation problem, and we propose treating the problem by finding the maximum a posteriori estimate based on temporal characteristics.
Dealing with recency queries comprises only a slice of the more general matter of time-sensitive retrieval. A great deal of recent work has shown (and continues to show) the complexity of dealing with time in IR, as well as ingenuity in the face of this complexity. It’s exciting to have a seat at this table.
By way of shameless promotion, I am currently encouraging people to help me evaluate an experimental IR system that searches microblog (Twitter) data. To participate, please see:
Please consider giving it a moment…even a brief moment.
Now, onto a more substantive matter: I’ve been wrestling with the validity of testing an IR system (particularly a microblog IR system) using a so-called snowball sampling technique. For the uninitiated, snowball sampling involves recruiting a small number of people to participate in a study with the explicit aim that they will, in turn, encourage others to participate. The hope is that participation in the study will extend beyond the narrow initial sample as subjects recruit new participants.
Snowball sampling has clear drawbacks. Most obviously, it is ripe for introducing bias into one’s analysis. The initial “seed” participants will drive the demographics of subsequent recruits. This effect could amplify any initial bias. The non-random (assuming it is non-random) selection of initial participants, and their non-random selection of recruits calls into question the application of standard inferential statistics at the end of the study. What status does a confidence interval on, say, user satisfaction derived from a snowball sample have with respect to the level of user satisfaction in the population?
However, snowball sampling has its merits, too. Among these is the possibility of obtaining a reasonable number of participants in the absence of a tractable method for random sampling.
In my case, I have decided that a snowball sample for this study is worth the risks it entails. In order to avoid poisoning my results, I’ll keep description of the project to a minimum.
But I feel comfortable saying that my method of recruiting includes dissemination of a call for participation in several venues:
- Via a twitter post with a call for readers to retweet it.
- Via this blog post!
- By email to two mailing lists (one a student list, and the other a list of Twitter researchers).
In this case, the value of a snowball sample extends beyond simply acquiring a large N. The fact that Twitter users are connected by Twitter’s native subscription model suggests to me that the fact that my sample will draw many users who are “close” to my social network is not a liability. Instead it will, I hope, lend a level of realism to how a particular sub-community functions.
One problem with these rose-colored lenses is that I have no way to characterize this sub-community formally. Inferences drawn from this sample may generalize to some group. But what group is that?
Obviously some of the validity of this sample will have to do with the nature of the data collected and the research questions to be posed against it, neither of which I’m comfortable discussing yet. But I would be interested to hear what readers think: does snowball sampling have merits or liabilities for research on the use of systems that inherently rely on social connections that do not pertain to situations lacking explicit social linkage?
This morning the New York Times ran an article describing a newly minted approach by researchers at U. of Washington for analyzing the process of amino acid folding. A more thorough discussion appears in Nature.
To avoid the burdensome statistical modeling that is the norm in the field (and about which I confess to near no knowledge) the researchers developed Foldit a game inviting amateurs to help in this work. Quoting the Times on Foldit:
The game, which was competitive and offered the puzzle-solving qualities of a game like Rubik’s Cube, quickly attracted a dedicated following of thousands of players.
In other words, the researchers crowdsourced the problem by posing the work as a game. Aside from sidestepping heavy computation, the researchers found that Foldit led to a level of accuracy on a par with established methods.
With a lot of current interest in crowdsourcing for IR (see Omar Alonso’s slides from an ECIR tutorial and proceedings from the SIGIR2010 crowdsourcing workshop), the article begs the question (for me): what retrieval work could be approached in this way? Of course this question isn’t new; cf. Google’s image labeler. But it’s still resonant.
Gathering relevance judgment via Mechanical Turk is an obvious place for crowdsourcing to enter IR research. But this type of crowdsourcing is qualitatively different than what we see in Foldit: participants are paid to do work that presumably they otherwise wouldn’t do. Not only is this model limiting–it’s also ripe for people to fudge the process in order to get paid without completing the task appropriately.
Opening research problems to crowds in the form of games strikes me as a way to mitigate the problem of people gaming (sorry for the pun) the system. The approach might also help us expand the scope of problems that can be aided by crowdsourcing. In Foldit, user interaction is abstracted in a way that makes it difficult for people to cheat. Andwithout a work/payment model, there’s little incentive to do the job poorly. Most striking, though: Foldit has broken a tremendously complex problem into sub-problems whose solutions make plausible entertainment.
What IR problems lend themselves to this kind of crowdsourcing? The image labeler is certainly one example, though I personally found it about as fun as waiting in an airport terminal for a delayed flight.
What else could we do in this space?
To start what I hope might become a discussion, I’ll offer a few criteria that I think a compelling crowdsourcing game should meet:
- Instant feedback: the game should give information to the player at all times. A real-time display of a performance-based score might do the trick.
- Abandonment & restarting: players should be able to quit or start a game at any time while still making their participation useful.
- Level of difficulty: obviously the game should be neither too hard nor too easy to be enjoyable. Better yet, let the player should choose his or her preferred level of challenge (e.g. work on a larger or smaller part of the problem).
- Manageable chunks of work: Foldit operates by presenting the player with ‘puzzles.’ These are scenarios that involve solving a well-defined problem such as freeing atoms of moving a chain from an unsuitable location to a better spot. Each of these problems is solvable and discrete.
Of course this list of only the sketchiest effort. I’m curious if others have more and better ideas. And of course the real question is how all this can be made to work in IR settings. What problems in IR lend themselves to this kind of solution? If we identify such problems, how do we transform the work into a viable ‘game’ that people would undertake voluntarily and to good effect?
When it rains it pours.
After the exciting news that Google funded my application to their digital humanities program, I found out this week that they will also fund another project of mine (full list): Defining and Solving Key Challenges in Microblog Search. The research will focus largely on helping people find and make sense of information that comes across Twitter.
Over the next year the project will support me and two Ph.D. students as we address (and propose some responses to) questions such as:
- What are meaningful units of retrieval for IR over microblog data?
- What types of information needs do people bring to microblogging environments and how can we support them? Is there a place for ad hoc IR in this space? If not (or even if so) what might constitute a ‘query’ in microblog IR?
- What criteria should we pursue to help people find useful information in microblog collections? Surely time plays a role here. Topical relevance is a likely suspect, as are various types of reputation factors such as TunkRank (and here).
- How does microblog IR relate to more established IR problems such as blog search, expert finding, and other entity search issues?
For me, one of the most interesting issues at work in microblog IR is: how can we aggregate (and then retrieve) data in order to create information that is useful once collected but that might be uninteresting on its own?
Is it useful to retrieve an individual tweet that shares keywords with an ad hoc query? Maybe. But it seems more likely that people might seek debates, consensus, emerging sub-topics, or communities of experts with respect to a given topic. These are just a few of the aggregates that leap to mind. I’m sure readers can think of others. And I’m sure readers can think of other tasks that can help move microblog IR forward.
In case anyone wonders how this project relates to the other work of mine that Google funded (which treats retrieval over historically diverse texts in Google Books data), the short answer is that both projects concern IR in situations where change over time is a critical factor, a topic similar to what I addressed in a recent JASIST paper.
I’m excited to say that the kind folks at Google have funded my proposal to the Google Digital Humanities Awards Program.
The title of the project is Meeting the Challenge of Language Change in Text Retrieval with Machine Translation Techniques, and as the name implies the work will tackle language drift in IR, but language drift on a nettlesome order.
Specifically, the work will focus on Google Books data where a great deal of text may indeed be written in a given language (I’m focusing on English), but it may also have been written in the 18th Century, the 17th Century, etc. Anyone who has studied Shakespeare or Chaucer (or less poetically, Robert Burton or Thomas Hobbes) can remind you that a wide gulf separates our vernacular from English of the medieval period , the Renaissance, the 17th Century, etc..
The project aims to support search across historically diverse corpora. But more generally, the work will address the question: how can we model language (empirically and statistically) in an environment where texts treating similar topics often use dissimilar lexical and even syntactic features?
As a sample use case, we may consider a scholar whose research treats historical issues related to literacy and the rise of the merchant class in Europe. During his research, this scholar might find the following passage. In the introduction to his 1623 English Dictionarie, Henry Cockeram addresses his intended audience:
Ladies and Gentlewomen, young Schollers, Clark Merchantz, and also Strangers of any Nation.
Finding this passage of interest, the researcher submits it to a retrieval system such as Google Books in hopes of finding similar texts (i.e. this is a type of query by example). To meet this challenge, the search system must overcome the idiosyncracies of 17th Century written English. To find results in more recent English, the search system must induce a model of the information need that accounts for the probability that schollers is a translation of scholars and that clark merchantz is a translation of statute merchants, etc. Similar processes need to happen for retrieval over earlier English texts.
My angle on this problem is that we can approach it as a type of machine translation, where the English of period X comprises a source language and period Y’s English is a target language. The nice thing about the IR setting is that I think we don’t need to find a single “correct” translation of a query. Instead, the work will generate an expanded model of the information need in the target language.
This problem has native merits. But I’m especially excited to take it on because it comprises part of an interesting body of work on temporal issues in IR (see my own contribution here or here, as well as excellent work Susan Dumais with Jon Elsas here, and with Jamie Teevan, Eytan Adar, Dan Leibling, and Richard Hughes here, here and here; this list isn’t exhaustive, omitting obvious links to the topic detection and tracking literature).
More to come as the project matures.
Checking out the recently opened infochimps API reminded me of an issue that has been on my research backburner, but also on my radar for a while. Here’s the question: given the proliferating number of web app APIs available to developers, would it be useful to build a search service that helps people find / explore available APIs? I think it would.
Maybe such a thing already exists; trying to find it with a web search along the lines of api search is unhelpful for obvious reasons. I’d be curious if any readers could offer pointers.
What I’m envisioning is a service that supports some hybrid of search and browsing with the aim of helping developers find APIs that provide access to data, functions, etc. that will help in completing a particular task.
For instance, I’m working on a system that helps people manage the searches they perform over time, across systems, on various devices, etc. My feeling is that peoples’ experience with IR systems creates information that is often lost. Some previous research of course addresses this issue… but that is for another post.
This is a pretty nebulous idea, and it’s not obvious to me what data and services are available to support its development.
I’d like to have access to a system that helps me explore:
- data sets exposed via publicly available APIs.
- APIs or API functions x,y, and z that are similar to some API function a.
- restrictions / terms of service for possibly useful APIs (e.g. rate limits, attribution, re-ranking of search results).
- API documentation.
- Libraries available for working with APIs.
Aside from the practical value of such a system, I think API retrieval raises interesting research questions and problem areas. For instance, what kinds of queries (or for that matter what kinds of information needs) can we expect to deal with in this arena? What factors make a particular unit of retrieval relevant? What features of these units offer traction in pursuit of retrieval? What kind of crawling / data acquisition steps do we need to address to move this agenda forward?
I suspect that addressing these problems is as much an HCI problem as it is a core IR issue. Presenting germane information about APIs in a consistent fashion that allows for comparison and coherent exploration strikes me as a tall challenge.
Having returned from CHI and the CHI2010 microblog research workshop, I’m jazzed–new problems to tackle, studies to run. In other words, the conference did just what it should; it gave me ideas for new research projects.
One of these projects is time-sensitive (I can’t go into detail because doing so will bias the results. More on that later.) As they put it on the Twitter search page, it’s what’s happening right now. More seriously, the questions need to run within a few days of CHI’s end. But the study will involve asking real people a few questions. For a researcher at a university, this means that I must get human subjects approval from my local institutional review board (IRB).
It’s easy to kvetch about IRB’s. See the Chronicle of Higher Ed’s piece, Do IRB’s Go Overboard? In fact, I’ve found the IRB officers at my institution to be extremely helpful, so I’m not going to kvetch (thinking of strategic ways of posing IRB applications recently led me to the very interesting IRB Review Blog that offers nuanced, substantive reflections on the subject).
As anyone who has sat through a university’s research ethics training knows, IRB’s were created in the wake of several odious and damaging studies. This motivation is clear and impeccable.
But for those of us working in research related to information use, especially in domains such as IR, HCI, and social informatics broadly construed, the risk of damage or exploitation of subjects is often (though not always; privacy issues can be problematic) minimal.
But more interestingly, I think our work challenges the basic model that underpins contemporary research practice in the university.
My point in writing this post, is not to argue that we should occupy a rarefied, unsupervised domain. But recently I’ve dealt with several particular matters that suggest that research on information behavior (mostly HCIR work) pushes some matters to the fore that I think will soon be more general. The following is a brief list. I invite elaborations or arguments.
- crowd-sourced studies. Services like Amazon’s Mechanical Turk offer a huge opening for IR research, as an upcoming SIGIR workshop makes clear. What is the status of turkers with respect to human subjects approval? In a future post I’ll describe in detail my own experience shepherding an MTurk-based study through university approval channels.
- search log analysis. This isn’t a new problem wrt to IRB, and it definitely does raise issues of privacy. But I wonder where more broadly informed studies of user behavior fit into this picture. As an example, I was recently given permission to use a set of query logs without human subjects approval. These logs were already in existence; I got them from a third party. However, in a new study I want to collect logs from my own system. Initial interaction with IRB led to the decision that this work must go through the application process. Likewise, clickthrough data raised red flags.
- real-time user studies. As I mentioned above, I’m in a situation where I need to collect information (essentially survey data) from Twitter users now. Until very recently the subject of this “survey” didn’t exist, and it won’t exist in any meaningful sense for long. I anticipate that this issue will be common for me, and perhaps for others.
Again, my point in writing this is not to say that I should have carte blanche to do research outside of normal channels. What I am saying is twofold:
- Research on information interactions is pushing the limits of the current human subjects/IRB model used by most universities. This is evidenced by unpredictable judgments on the status of projects.
- I think the community of researchers in “our” areas would do well to consider strategies for approaching IRB and other institutional hurdles. We don’t want to game the system. But I think the way we describe the work we do has an impact on the status of that work. If current models are going to change, it would be great if we could (by our interactions with relevant officers) influence those changes in a positive way.
Yesterday’s microblogging workshop at CHI2010 was great, as those of you following #CHImb on twitter already know. All of the participants brought interesting ideas–too many to list here. So I’m just going to focus on a few themes/results that relate most closely to IR. I highly recommend browsing the list of accepted papers to see for yourself the many, many interesting contributions.
First, I’ll mention that Gene Golovchinsky did a wonderful job presenting our paper on making sense of twitter search. Gene has posted his slides and some discussion of the workshop. The questions we posed in the paper and the presentation were:
- What information needs do people actually bring to microblog search?
- What should a test collection for conducting research on microblog search look like?
Instead of dwelling on our own contribution, though, I want to offer a recap of some of the work of other people…
I was especially interested in work by several researchers from Xerox PARC.
Michael Bernstein showed a system, eddi, that helps readers who follow many people manage their twitter experience, avoiding information overload via intelligent filtering on several levels. Ed Chi introduced FeedWinnower, another ambitious system for managing twitter information. I was especially interested in Bongwon Suh‘s talk. He focused on the role that serendipity plays (or should play) in twitter search. He suggested that search over microblog data (I know, microblog is not equal to twitter) benefits from serendipity. Of course only certain types of serendipity are valuable in this context (he said something to the effect of courting previously unknown relevance).
Another really interesting paper (and an interesting conversation over lunch) came from Alice Oh. The paper focused on using people’s list memberships to induce models of their interests and expertise. I think Alice’s paper speaks to the challenge of finding sources of evidence for information management in microblog environments.
With respect to IR and microblogging, I came away with from the workshop with new questions and with a keener edge on questions I already had. Here’s a very abbreviated list of some challenges that researchers in this area face.
information needs: What types of information needs are most germane in this space? Are users interested in known-item search, ad hoc retrieval, recommendations, browsing, something completely new?
unit of retrieval: Of course this goes back to the matter of information needs (as do all of the following points). Certainly the task at hand will sway exactly what it is that systems should show users. But my sense is that some sort of entity search is almost always likely to be of more value than treating an individual tweet as a ‘document.’ i.e. Search over people, conversations, communities, hashtags, etc. will, I think, lend more value than tweets taken out of context.
data acquisition and evaluation: It’s easy to get lots of twitter data; just latch onto the garden hose and go. In some cases, data from the hose may be perfectly useful for research and development. Do we need or want formal test collections of this type of data? If so, what should they look like? How does obsolescence figure into creating a test collection of de facto ephemeral data? And of course, there’s probably more to ground truth the mechanical Turk.
objective functions: In the arena of microblog search, what criteria should we use to rank (if we ARE ranking) entities? Certainly twitter’s own search engine sees temporality as paramount. As always, relevance is dicey here–a murky mixture of topicality, usefulness, trustworthiness, timeliness, etc.
Recently I’ve been speaking with several folks (e.g. Megan Winget and Gene Golovshinsky) about how twitter is or might be important with respect to academic conferences. I’ve got some research coming up where I hope to look at this.
But in the meantime, I put this together:
A caveat: having slapped this together quickly, I’m not sure how the site will behave…I hope it is relatively solid.
I put the page up just because it seemed like a natural thing to do given the data that I’ve been collecting (relatively large amounts of twitter-generated info). I’m hoping that it might, even a little bit, encourage the conference attendees to think of twitter as they listen, chat, etc.
Jeff Dalton has a great post up about the New York Times‘ recent announcement: the paper has lauched data.nytimes.com. Currently the service offers 5k named (i.e. people) subject headings from the NYT news vocabulary. The headings are available as linked open data. More headings are on the way.
Handwringing (e.g. here, here and here) , maybe deserved, has been in abundance recently in the arena of print journalism. Finding/maintaining viable business models for high-quality reporting in environments where free information is readily available is a challenge.
I’ve been rooting for NYT in this struggle. In this respect I’m glad to see their release of data. Rather than leaning on the obvious and dubious advertising model or walled gardens, this strikes me as a gambit for a novel approach to attacking the problem of the papers’ value.
Can we (i.e. hackers of textual data) repurpose and add value to the excellent information compiled by the Times’ editors? Is there a viable business model for the Times that could emerge from releasing data, as opposed to closing it? It’s a creative response to a problem that is full of caricature. I hope we’ll take up the challenge.