Microblogging workshop at CHI2010

Yesterday’s microblogging workshop at CHI2010 was great, as those of you following #CHImb on twitter already know.  All of the participants brought interesting ideas–too many to list here.  So I’m just going to focus on a few themes/results that relate most closely to IR.  I highly recommend browsing the list of accepted papers to see for yourself the many, many interesting contributions.

First, I’ll mention that Gene Golovchinsky did a wonderful job presenting our paper on making sense of twitter search.  Gene has posted his slides and some discussion of the workshop.  The questions we posed in the paper and the presentation were:

  • What information needs do people actually bring to microblog search?
  • What should a test collection for conducting research on microblog search look like?

Instead of dwelling on our own contribution, though, I want to offer a recap of some of the work of other people…

I was especially interested in work by several researchers from Xerox PARC.

Michael Bernstein showed a system, eddi, that helps readers who follow many people manage their twitter experience, avoiding information overload via intelligent filtering on several levels.  Ed Chi introduced FeedWinnower, another ambitious system for managing twitter information.  I was especially interested in Bongwon Suh‘s talk.  He focused on the role that serendipity plays (or should play) in twitter search.  He suggested that search over microblog data (I know, microblog is not equal to twitter) benefits from serendipity.  Of course only certain types of serendipity are valuable in this context (he said something to the effect of courting previously unknown relevance).

Another really interesting paper (and an interesting conversation over lunch) came from Alice Oh.  The paper focused on using people’s list memberships to induce models of their interests and expertise. I think Alice’s paper speaks to the challenge of finding sources of evidence for information management in microblog environments.

With respect to IR and microblogging, I came away with from the workshop with new questions and with a keener edge on questions I already had. Here’s a very abbreviated list of some challenges that researchers in this area face.

information needs: What types of information needs are most germane in this space?  Are users interested in known-item search, ad hoc retrieval, recommendations, browsing, something completely new?

unit of retrieval: Of course this goes back to the matter of information needs (as do all of the following points).  Certainly the task at hand will sway exactly what it is that systems should show users.  But my sense is that some sort of entity search is almost always likely to be of more value than treating an individual tweet as a ‘document.’  i.e. Search over people, conversations, communities, hashtags, etc. will, I think, lend more value than tweets taken out of context.

data acquisition and evaluation: It’s easy to get lots of twitter data; just latch onto the garden hose and go.  In some cases, data from the hose may be perfectly useful for research and development. Do we need or want formal test collections of this type of data?  If so, what should they look like?  How does obsolescence figure into creating a test collection of de facto ephemeral data?  And of course, there’s probably more to ground truth the mechanical Turk.

objective functions: In the arena of microblog search, what criteria should we use to rank (if we ARE ranking) entities?  Certainly twitter’s own search engine sees temporality as paramount.  As always, relevance is dicey here–a murky mixture of topicality, usefulness, trustworthiness, timeliness, etc.

By way of a parting shot, I’d like to thank Julia Grace, Denjin Zhao, and danah boyd for organizing the workshop.

twitter at asist 2009

Recently I’ve been speaking with several folks (e.g. Megan Winget and Gene Golovshinsky) about how twitter is or might be important with respect to academic conferences.  I’ve got some research coming up where I hope to look at this.

But in the meantime, I put this together:


A screenshot:


ASIST-related words

Words appearing in recent tweets tagged with #asist, #asist09, #asist2009, #asist2010



People heading to ASIST 2009 might be interested in it.  The page just gives a snapshot (updated hourly) of words from tweets tagged with #asist, #asist09, #asist2009, and #asist2010.

A caveat: having slapped this together quickly, I’m not sure how the site will behave…I hope it is relatively solid.

I put the page up just because it seemed like a natural thing to do given the data that I’ve been collecting (relatively large amounts of twitter-generated info).  I’m hoping that it might, even a little bit, encourage the conference attendees to think of twitter as they listen, chat, etc.

New York Times: OpenData

Jeff Dalton has a great post up about the New York Times‘ recent announcement: the paper has lauched data.nytimes.com.  Currently the service offers 5k named (i.e. people) subject headings from the NYT news vocabulary.  The headings are available as linked open data.  More headings are on the way.

Handwringing (e.g. here, here and here) , maybe deserved, has been in abundance recently in the arena of print journalism.  Finding/maintaining viable business models for high-quality reporting in environments where free information is readily available is a challenge.

I’ve been rooting for NYT in this struggle.  In this respect I’m glad to see their release of data.  Rather than leaning on the obvious and dubious advertising model or walled gardens, this strikes me as a gambit for a novel approach to attacking the problem of the papers’ value.

Can we (i.e. hackers of textual data) repurpose and add value to the excellent information compiled by the Times’ editors?  Is there a viable business model for the Times that could emerge from releasing data, as opposed to closing it?  It’s a creative response to a problem that is full of caricature.  I hope we’ll take up the challenge.

handling hashtags

Twitter hashtags are a great tool for improvised info organization–i.e. using software features to marshall information in ways that the feature designer didn’t think up (and made no pretense of thinking up).  In particular, I’ve been thinking of hashtags as a hack to support collaborative IR.  Need to research the size of Google Scholar’s index?  Mark relevant resources with, say, #search.GSsize .  Others interested in the same topic could add to the body of knowledge related to the search, and could follow the collected resources.

Of course this is what hashtags are for, so I’m not proposing anything very new here.

But this idea got me thinking of a few services that would support hashtag use for collaborative IR:

  1. Intelligent search for tags
  2. hashtag disambiguation.

Other services like recommendation also leap to mind.

By intelligent search, I’m thinking of a way to find tags that are relevant to a particular topic.  hashtags.org/tags already collects tags.  But as far as I know (please correct me if I’m wrong) existing hashtag search simply supports string matches.  It’s difficult to find semantically useful tags. This would frustrate any kind of real collaborative use of them.

As for hashtag disambiguation, I simply mean trying to identify and separate different semantic uses of the same tag character string.  The admirably ungoverned nature of hashtags naturally leads to collisions.  For example #ir primarily yields tweets related to Iran; not what I had in mind.

Another example: I’m an amateur (VERY amateur) painter with a particular interest in paintings mediums.  Too lazy to type #paintingMedium on my phone (I’m not alone in this, I see), I’m inclined to tag things with #medium, which tosses my lot in with scads of information on the TV show.  These collisions aren’t a problem as I organize my own posts, but they would be if people wanted to search broadly for useful tags, jumping onto a tag in medias res.

What I’m suggesting is that it would be useful and interesting to tackle the complexity of hashtags in efforts to extend their utility.  A first step here would be to analyze the text that accompanies them.  But I suspect this wouldn’t be enough.  Would we need to consider the social structure in which tags are embedded?  I sense an opportunity here.

Size of Google Scholar’s Index?

I’m writing a semi-detailed blog post to counter some recent arguments about the quality of data in Google Scholar.  I don’t have much stake in defending Google here, but I’ve seen some egregious straw man arguments and vacuous statistics bandied around.

To make this argument compelling, though, it would help to have a rough idea of how many documents the Google Scholar index contains.  I twittered about this yesterday, but thought this venue would have a wider reach.

Any comments on the matter would be most helpful.  Even suggestions on the order of magnitude would probably be sufficient.

If the size of the index isn’t obvious, maybe others have ideas about how to estimate it.

Bayes, Fisher and indirect evidence

I just came back from a talk in the stat department.  The speaker was Brad Efron (yes, he’s my dad).  The title of the talk was “The future of indirect evidence.”  A proto-paper version is available in PDF.

The talk concerned some very specific points of relationship and deviation between frequentist and Bayesian statistics.  It’s too reductive to say that the talk tried to marry them, though there was some flavor of that, especially in the context of empirical Bayes methods.  But I think it’s accurate to say that Brad argued that the kind of information that we usually think of in terms of Bayesian priors is not anathema to frequentist methods.  His umbrella term for this is ‘indirect evidence.’

As an example, he offered this graph:

Data from a nephrology study.  y-axis -> kidney function, line -> least-squares regression

Data from a nephrology study. y-axis -> kidney function, line -> least-squares regression

This is a standard result from classical statistics: fitting a linear regression to a sample.  Brad argued, however, that despite its obvious frequentism, analysis of this kind does rely on indirect evidence.  That is, even here we’re bringing belief (though not strictly prior belief) to prediction.

In his example, we wish to predict the kidney function of a 55 year-old.  The red dot indicates the score of the lone 55 year-old in the study.  An analysis based on only direct evidence would thus use his score as the prediction.  But of course statisticians are more comfortable with the prediction that lies on the regression line.  Thus the canonical prediction for a 55 year-old relies on evidence only indirectly related to the kidney function of a 55 year-old.

I’ve not done the topic justice.  But the reason I’ve labored over the point is that the thrust of the talk applied immediately to IR.  Brad argued that classical statistics was developed in the 19th and 20th centuries for data that was common in those eras.  Now data of high dimensionality and tremendous sample sizes is common–IR certainly falls into this camp.

The challenge, we were told, was that contemporary data sets make indirect evidence unignorable.  Bayesian approaches offer a response to this problem, but not the only response.  In particular, the matter of empirical Bayes strikes me as uniquely suited to IR.

In a future post I plan to consider how an empirical Bayes approach would apply to a common problem in IR: smoothing a language model.  I think that this simple task is a good starting point for this analysis.

Sergey Brin on Google Books, NYT Op-Ed

Sergey Brin has an op-ed piece in this morning’s New York Times.  In it he writes about the Google books project, evangelizing on behalf of Google’s work in this arena.  It’s a bland article, and I think this is the point of it.  Brin’s conclusion reads:

“I hope [wholesale book] destruction never happens again, but history would suggest otherwise. More important, even if our cultural heritage stays intact in the world’s foremost libraries, it is effectively lost if no one can access it easily. Many companies, libraries and organizations will play a role in saving and making available the works of the 20th century. Together, authors, publishers and Google are taking just one step toward this goal, but it’s an important step. Let’s not miss this opportunity.”

This sounds like something I’d read in a mediocre student essay.  

True, the article is intended to sway the uninitiated, and thus needs to speak generally.  But as I read it, I found myself wondering if the piece’s rhetorical doldrums don’t serve another purpose: appealing to the populist streak in our (American) zeitgeist.

At the risk of generalizing egregiously, many Americans distrust eggheads.  We like a hale-fellow-well-met.  Elitist nerds from the coasts don’t speak to the “real America.”  Is a strategy of pabulum effective in this context?  Assuming Google has editors on staff as well-trained as their engineering, I suspect they’ve banked on a ‘yes’ to that question.