twitter at asist 2009

November 6, 2009

Recently I’ve been speaking with several folks (e.g. Megan Winget and Gene Golovshinsky) about how twitter is or might be important with respect to academic conferences.  I’ve got some research coming up where I hope to look at this.

But in the meantime, I put this together:

http://tacoma.lis.illinois.edu/asist/

A screenshot:

 

ASIST-related words

Words appearing in recent tweets tagged with #asist, #asist09, #asist2009, #asist2010

 

 

People heading to ASIST 2009 might be interested in it.  The page just gives a snapshot (updated hourly) of words from tweets tagged with #asist, #asist09, #asist2009, and #asist2010.

A caveat: having slapped this together quickly, I’m not sure how the site will behave…I hope it is relatively solid.

I put the page up just because it seemed like a natural thing to do given the data that I’ve been collecting (relatively large amounts of twitter-generated info).  I’m hoping that it might, even a little bit, encourage the conference attendees to think of twitter as they listen, chat, etc.


New York Times: OpenData

November 3, 2009

Jeff Dalton has a great post up about the New York Times‘ recent announcement: the paper has lauched data.nytimes.com.  Currently the service offers 5k named (i.e. people) subject headings from the NYT news vocabulary.  The headings are available as linked open data.  More headings are on the way.

Handwringing (e.g. here, here and here) , maybe deserved, has been in abundance recently in the arena of print journalism.  Finding/maintaining viable business models for high-quality reporting in environments where free information is readily available is a challenge.

I’ve been rooting for NYT in this struggle.  In this respect I’m glad to see their release of data.  Rather than leaning on the obvious and dubious advertising model or walled gardens, this strikes me as a gambit for a novel approach to attacking the problem of the papers’ value.

Can we (i.e. hackers of textual data) repurpose and add value to the excellent information compiled by the Times’ editors?  Is there a viable business model for the Times that could emerge from releasing data, as opposed to closing it?  It’s a creative response to a problem that is full of caricature.  I hope we’ll take up the challenge.


handling hashtags

November 2, 2009

Twitter hashtags are a great tool for improvised info organization–i.e. using software features to marshall information in ways that the feature designer didn’t think up (and made no pretense of thinking up).  In particular, I’ve been thinking of hashtags as a hack to support collaborative IR.  Need to research the size of Google Scholar’s index?  Mark relevant resources with, say, #search.GSsize .  Others interested in the same topic could add to the body of knowledge related to the search, and could follow the collected resources.

Of course this is what hashtags are for, so I’m not proposing anything very new here.

But this idea got me thinking of a few services that would support hashtag use for collaborative IR:

  1. Intelligent search for tags
  2. hashtag disambiguation.

Other services like recommendation also leap to mind.

By intelligent search, I’m thinking of a way to find tags that are relevant to a particular topic.  hashtags.org/tags already collects tags.  But as far as I know (please correct me if I’m wrong) existing hashtag search simply supports string matches.  It’s difficult to find semantically useful tags. This would frustrate any kind of real collaborative use of them.

As for hashtag disambiguation, I simply mean trying to identify and separate different semantic uses of the same tag character string.  The admirably ungoverned nature of hashtags naturally leads to collisions.  For example #ir primarily yields tweets related to Iran; not what I had in mind.

Another example: I’m an amateur (VERY amateur) painter with a particular interest in paintings mediums.  Too lazy to type #paintingMedium on my phone (I’m not alone in this, I see), I’m inclined to tag things with #medium, which tosses my lot in with scads of information on the TV show.  These collisions aren’t a problem as I organize my own posts, but they would be if people wanted to search broadly for useful tags, jumping onto a tag in medias res.

What I’m suggesting is that it would be useful and interesting to tackle the complexity of hashtags in efforts to extend their utility.  A first step here would be to analyze the text that accompanies them.  But I suspect this wouldn’t be enough.  Would we need to consider the social structure in which tags are embedded?  I sense an opportunity here.


Size of Google Scholar’s Index?

October 22, 2009

I’m writing a semi-detailed blog post to counter some recent arguments about the quality of data in Google Scholar.  I don’t have much stake in defending Google here, but I’ve seen some egregious straw man arguments and vacuous statistics bandied around.

To make this argument compelling, though, it would help to have a rough idea of how many documents the Google Scholar index contains.  I twittered about this yesterday, but thought this venue would have a wider reach.

Any comments on the matter would be most helpful.  Even suggestions on the order of magnitude would probably be sufficient.

If the size of the index isn’t obvious, maybe others have ideas about how to estimate it.


Bayes, Fisher and indirect evidence

October 15, 2009

I just came back from a talk in the stat department.  The speaker was Brad Efron (yes, he’s my dad).  The title of the talk was “The future of indirect evidence.”  A proto-paper version is available in PDF.

The talk concerned some very specific points of relationship and deviation between frequentist and Bayesian statistics.  It’s too reductive to say that the talk tried to marry them, though there was some flavor of that, especially in the context of empirical Bayes methods.  But I think it’s accurate to say that Brad argued that the kind of information that we usually think of in terms of Bayesian priors is not anathema to frequentist methods.  His umbrella term for this is ‘indirect evidence.’

As an example, he offered this graph:

Data from a nephrology study.  y-axis -> kidney function, line -> least-squares regression

Data from a nephrology study. y-axis -> kidney function, line -> least-squares regression

This is a standard result from classical statistics: fitting a linear regression to a sample.  Brad argued, however, that despite its obvious frequentism, analysis of this kind does rely on indirect evidence.  That is, even here we’re bringing belief (though not strictly prior belief) to prediction.

In his example, we wish to predict the kidney function of a 55 year-old.  The red dot indicates the score of the lone 55 year-old in the study.  An analysis based on only direct evidence would thus use his score as the prediction.  But of course statisticians are more comfortable with the prediction that lies on the regression line.  Thus the canonical prediction for a 55 year-old relies on evidence only indirectly related to the kidney function of a 55 year-old.

I’ve not done the topic justice.  But the reason I’ve labored over the point is that the thrust of the talk applied immediately to IR.  Brad argued that classical statistics was developed in the 19th and 20th centuries for data that was common in those eras.  Now data of high dimensionality and tremendous sample sizes is common–IR certainly falls into this camp.

The challenge, we were told, was that contemporary data sets make indirect evidence unignorable.  Bayesian approaches offer a response to this problem, but not the only response.  In particular, the matter of empirical Bayes strikes me as uniquely suited to IR.

In a future post I plan to consider how an empirical Bayes approach would apply to a common problem in IR: smoothing a language model.  I think that this simple task is a good starting point for this analysis.


Sergey Brin on Google Books, NYT Op-Ed

October 9, 2009

Sergey Brin has an op-ed piece in this morning’s New York Times.  In it he writes about the Google books project, evangelizing on behalf of Google’s work in this arena.  It’s a bland article, and I think this is the point of it.  Brin’s conclusion reads:

“I hope [wholesale book] destruction never happens again, but history would suggest otherwise. More important, even if our cultural heritage stays intact in the world’s foremost libraries, it is effectively lost if no one can access it easily. Many companies, libraries and organizations will play a role in saving and making available the works of the 20th century. Together, authors, publishers and Google are taking just one step toward this goal, but it’s an important step. Let’s not miss this opportunity.”

This sounds like something I’d read in a mediocre student essay.  

True, the article is intended to sway the uninitiated, and thus needs to speak generally.  But as I read it, I found myself wondering if the piece’s rhetorical doldrums don’t serve another purpose: appealing to the populist streak in our (American) zeitgeist.

At the risk of generalizing egregiously, many Americans distrust eggheads.  We like a hale-fellow-well-met.  Elitist nerds from the coasts don’t speak to the “real America.”  Is a strategy of pabulum effective in this context?  Assuming Google has editors on staff as well-trained as their engineering, I suspect they’ve banked on a ‘yes’ to that question.


Daniel Tunkelang on HCIR in ASIST Bulletin

October 6, 2009

I’ve been reading many blog posts recently on HCIR and cognate problems, due in no small part to the upcoming HCIR conference and the CFP for the 2nd Workshop on collaborative IR.  But a really clear, high-level articulation of the key factors in HCIR are laid out in Daniel Tunkelang’s new piece in the ASIST Bulletin, “Reconsidering Relevance and Embracing Interaction.”

Besides a compelling overview of HCIR’s motivations (especially wrt the problematic status of relevance in many IR settings), Daniel offers three hallmarks of HCIR, at least if HCIR is done well.  Systems, Tunkelang suggests, should strive for:

  • transparency:  Communicate why the retrieved documents retrieved.
  • control: Allow the searcher to express (and revise) his or her information need in a way that bears directly on what’s communicated through the transparency mechanisms.
  • guidance: Shepherd searchers through the process of translating information needs into tractable queries.

Of course Daniel’s essay does a better job of describing these imperatives than I have done here.  Check it out.


meaningful text analysis

October 1, 2009

Last night I had dinner with a group of visiting scholars from Germany who are part of the textGrid project.  Textgrid entails an effort to bring grid computing to bear on digital humanities research.  We spent the evening talking not so much about grid technologies but rather about humanities computing in general.  The conversation also focused on the Monk project, with which our gracious host John Unsworth is closely involved.

The thrust of our discussion lay in what computing does, can, should, and cannot offer to the study of humanistic data.

The interesting question is, what should humanities computing be?

Kirsten Uszkalo was especially keen on the application of sentiment analysis to the work she does on early modern English literature.  But I wonder whether the already-difficult problem of identifying, say, positive and negative product reviews isn’t qualitatively different in the context of 16th Century popular lit.

Consider one example that we discussed: reports of demonic possession.  It struck me that a humanist is unlikely to be compelled by a classifier that achieves n% accuracy in separating medical and theological treatments of possession.   Instead, the interesting question in this case lies in identifying the textual features that would enable such a classifier.  That is what aspects of a text–vocabulary, physical dimensions, print type, etc.–speak to a meaningful difference in discourses?

I came away from the dinner wondering where the problem of feature creation, selection, reduction, etc. fits into humanities computing.  To what extent is feature selection a computing problem at all?  Maybe the features that would inform a classifier are the aim of the humanist in the first place.


foodscanner–the social contract in micro-IR

September 25, 2009

An inaugural blog post should be monumental, a manifesto even.  But this high bar has been keeping me from getting  IRepeat going.  As such, I’ve decided to start more modestly.

Recently I posted to probablyIrrelevant about “micro-IR” which may or may not be something new.  The idea was picked up by theNoisyChannel and FXPAL Blog.  In efforts to triangulate on what micro-IR is, this is the first installment of a series of entries that highlight specific micro-retrieval applications.

I should state clearly: posts about particular applications are not endorsements on my part.  They are simply examples.

foodscanner is part of the larger dailyBurn diet management application.  foodscanner is an iphone app that scans bar codes on food items, shows you the item’s nutritional information, and allows you to store (and later analyze) the nutritional value of your diet.

In what sense does foodscanner constitute a micro-IR application?

  1. It is tightly linked to a particular task, a particular context: monitoring diet.
  2. The search space is narrowly constrained and highly structured: each “food” is represented as a single database record with agreed-upon nutritional information available.
  3. It uses innovative affordances  to make articulating a complex information need simple.  Here the affordances entail both the iphone’s technology (camera and a third-party barcode scanner) and the constrained information space (the system ‘knows’ a priori the thrust of the searcher’s problem).
  4. The information retrieved by foodscanner naturally feeds (no pun intended) into another application (DailyBurn) which supports more complex, secondary information interactions.

Points 1-3 suggest an important characteristic of micro-IR.  In a micro-IR setting, expectations are high.  This applies both to searchers and to the system.  That is, the system expects that users will express their information need in a way that is highly informative; however, the burden of this expectation is lightened for the user by the application’s design (however successful; here’s where innovation enters the fray).  In return for this highly informative ‘query’ the user expects information that is immediately actionable: I search and then I have the information I need to decide whether or not to eat this food.

I’m leaning towards calling the high bar of expectations in micro-IR an aspect of the social contract between users and systems.  This contract, I argue, is different in micro-IR systems than it is in other IR venues.  But exactly how to characterize the idea of IR’s social contract is up for grabs and with luck will inform a future IRepeat post.