Snowball sampling for Twitter Research

By way of shameless promotion, I am currently encouraging people to help me evaluate an experimental IR system that searches microblog (Twitter) data.  To participate, please see:

Please consider giving it a moment…even a brief moment.

Now, onto a more substantive matter: I’ve been wrestling with the validity of testing an IR system (particularly a microblog IR system) using a so-called snowball sampling technique.  For the uninitiated, snowball sampling involves recruiting a small number of people to participate in a study with the explicit aim that they will, in turn, encourage others to participate.  The hope is that participation in the study will extend beyond the narrow initial sample as subjects recruit new participants.

Snowball sampling has clear drawbacks.  Most obviously, it is ripe for introducing bias into one’s analysis.  The initial “seed” participants will drive the demographics of subsequent recruits.  This effect could amplify any initial bias.  The non-random (assuming it is non-random) selection of initial participants, and their non-random selection of recruits calls into question the application of standard inferential statistics at the end of the study.  What status does a confidence interval on, say, user satisfaction derived from a snowball sample have with respect to the level of user satisfaction in the population?

However, snowball sampling has its merits, too.  Among these is the possibility of obtaining a reasonable number of participants in the absence of a tractable method for random sampling.

In my case, I have decided that a snowball sample for this study is worth the risks it entails.  In order to avoid poisoning my results, I’ll keep description of the project to a minimum.

But I feel comfortable saying that my method of recruiting includes dissemination of a call for participation in several venues:

  • Via a twitter post with a call for readers to retweet it.
  • Via this blog post!
  • By email to two mailing lists (one a student list, and the other a list of Twitter researchers).

In this case, the value of a snowball sample extends beyond simply acquiring a large N. The fact that Twitter users are connected by Twitter’s native subscription model suggests to me that the fact that my sample will draw many users who are “close” to my social network is not a liability.  Instead it will, I hope, lend a level of realism to how a particular sub-community functions.

One problem with these rose-colored lenses is that I have no way to characterize this sub-community formally.  Inferences drawn from this sample may generalize to some group.  But what group is that?

Obviously some of the validity of this sample will have to do with the nature of the data collected and the research questions to be posed against it, neither of which I’m comfortable discussing yet.  But I would be interested to hear what readers think: does snowball sampling have merits or liabilities for research on the use of systems that inherently rely on social connections that do not pertain to situations lacking explicit social linkage?

API search

Checking out the recently opened infochimps API reminded me of an issue that has been on my research backburner, but also on my radar for a while.  Here’s the question: given the proliferating number of web app APIs available to developers, would it be useful to build a search service that helps people find / explore available APIs?  I think it would.

Maybe such a thing already exists; trying to find it with a web search along the lines of api search is unhelpful for obvious reasons.  I’d be curious if any readers could offer pointers.

What I’m envisioning is a service that supports some hybrid of search and browsing with the aim of helping developers find APIs that provide access to data, functions, etc. that will help in completing a particular task.

For instance, I’m working on a system that helps people manage the searches they perform over time, across systems, on various devices, etc.  My feeling is that peoples’ experience with IR systems creates information that is often lost.  Some previous research of course addresses this issue… but that is for another post.

This is a pretty nebulous idea, and it’s not obvious to me what data and services are available to support its development.

I’d like to have access to a system that helps me explore:

  • data sets exposed via publicly available APIs.
  • APIs or API functions x,y, and z that are similar to some API function a.
  • restrictions / terms of service for possibly useful APIs (e.g. rate limits, attribution, re-ranking of search results).
  • API documentation.
  • Libraries available for working with APIs.

Aside from the practical value of such a system, I think API retrieval raises interesting research questions and problem areas.  For instance, what kinds of queries (or for that matter what kinds of information needs) can we expect to deal with in this arena? What factors make a particular unit of retrieval relevant? What features of these units offer traction in pursuit of retrieval? What kind of crawling / data acquisition steps do we need to address to move this agenda forward?

I suspect that addressing these problems is as much an HCI problem as it is a core IR issue.  Presenting germane information about APIs in a consistent fashion that allows for comparison and coherent exploration strikes me as a tall challenge.

New York Times: OpenData

Jeff Dalton has a great post up about the New York Times‘ recent announcement: the paper has lauched  Currently the service offers 5k named (i.e. people) subject headings from the NYT news vocabulary.  The headings are available as linked open data.  More headings are on the way.

Handwringing (e.g. here, here and here) , maybe deserved, has been in abundance recently in the arena of print journalism.  Finding/maintaining viable business models for high-quality reporting in environments where free information is readily available is a challenge.

I’ve been rooting for NYT in this struggle.  In this respect I’m glad to see their release of data.  Rather than leaning on the obvious and dubious advertising model or walled gardens, this strikes me as a gambit for a novel approach to attacking the problem of the papers’ value.

Can we (i.e. hackers of textual data) repurpose and add value to the excellent information compiled by the Times’ editors?  Is there a viable business model for the Times that could emerge from releasing data, as opposed to closing it?  It’s a creative response to a problem that is full of caricature.  I hope we’ll take up the challenge.