Size of Google Scholar’s Index?

I’m writing a semi-detailed blog post to counter some recent arguments about the quality of data in Google Scholar.  I don’t have much stake in defending Google here, but I’ve seen some egregious straw man arguments and vacuous statistics bandied around.

To make this argument compelling, though, it would help to have a rough idea of how many documents the Google Scholar index contains.  I twittered about this yesterday, but thought this venue would have a wider reach.

Any comments on the matter would be most helpful.  Even suggestions on the order of magnitude would probably be sufficient.

If the size of the index isn’t obvious, maybe others have ideas about how to estimate it.


Bayes, Fisher and indirect evidence

I just came back from a talk in the stat department.  The speaker was Brad Efron (yes, he’s my dad).  The title of the talk was “The future of indirect evidence.”  A proto-paper version is available in PDF.

The talk concerned some very specific points of relationship and deviation between frequentist and Bayesian statistics.  It’s too reductive to say that the talk tried to marry them, though there was some flavor of that, especially in the context of empirical Bayes methods.  But I think it’s accurate to say that Brad argued that the kind of information that we usually think of in terms of Bayesian priors is not anathema to frequentist methods.  His umbrella term for this is ‘indirect evidence.’

As an example, he offered this graph:

Data from a nephrology study.  y-axis -> kidney function, line -> least-squares regression

Data from a nephrology study. y-axis -> kidney function, line -> least-squares regression

This is a standard result from classical statistics: fitting a linear regression to a sample.  Brad argued, however, that despite its obvious frequentism, analysis of this kind does rely on indirect evidence.  That is, even here we’re bringing belief (though not strictly prior belief) to prediction.

In his example, we wish to predict the kidney function of a 55 year-old.  The red dot indicates the score of the lone 55 year-old in the study.  An analysis based on only direct evidence would thus use his score as the prediction.  But of course statisticians are more comfortable with the prediction that lies on the regression line.  Thus the canonical prediction for a 55 year-old relies on evidence only indirectly related to the kidney function of a 55 year-old.

I’ve not done the topic justice.  But the reason I’ve labored over the point is that the thrust of the talk applied immediately to IR.  Brad argued that classical statistics was developed in the 19th and 20th centuries for data that was common in those eras.  Now data of high dimensionality and tremendous sample sizes is common–IR certainly falls into this camp.

The challenge, we were told, was that contemporary data sets make indirect evidence unignorable.  Bayesian approaches offer a response to this problem, but not the only response.  In particular, the matter of empirical Bayes strikes me as uniquely suited to IR.

In a future post I plan to consider how an empirical Bayes approach would apply to a common problem in IR: smoothing a language model.  I think that this simple task is a good starting point for this analysis.

Sergey Brin on Google Books, NYT Op-Ed

Sergey Brin has an op-ed piece in this morning’s New York Times.  In it he writes about the Google books project, evangelizing on behalf of Google’s work in this arena.  It’s a bland article, and I think this is the point of it.  Brin’s conclusion reads:

“I hope [wholesale book] destruction never happens again, but history would suggest otherwise. More important, even if our cultural heritage stays intact in the world’s foremost libraries, it is effectively lost if no one can access it easily. Many companies, libraries and organizations will play a role in saving and making available the works of the 20th century. Together, authors, publishers and Google are taking just one step toward this goal, but it’s an important step. Let’s not miss this opportunity.”

This sounds like something I’d read in a mediocre student essay.  

True, the article is intended to sway the uninitiated, and thus needs to speak generally.  But as I read it, I found myself wondering if the piece’s rhetorical doldrums don’t serve another purpose: appealing to the populist streak in our (American) zeitgeist.

At the risk of generalizing egregiously, many Americans distrust eggheads.  We like a hale-fellow-well-met.  Elitist nerds from the coasts don’t speak to the “real America.”  Is a strategy of pabulum effective in this context?  Assuming Google has editors on staff as well-trained as their engineering, I suspect they’ve banked on a ‘yes’ to that question.

Daniel Tunkelang on HCIR in ASIST Bulletin

I’ve been reading many blog posts recently on HCIR and cognate problems, due in no small part to the upcoming HCIR conference and the CFP for the 2nd Workshop on collaborative IR.  But a really clear, high-level articulation of the key factors in HCIR are laid out in Daniel Tunkelang‘s new piece in the ASIST Bulletin, “Reconsidering Relevance and Embracing Interaction.”

Besides a compelling overview of HCIR’s motivations (especially wrt the problematic status of relevance in many IR settings), Daniel offers three hallmarks of HCIR, at least if HCIR is done well.  Systems, Tunkelang suggests, should strive for:

  • transparency:  Communicate why the retrieved documents retrieved.
  • control: Allow the searcher to express (and revise) his or her information need in a way that bears directly on what’s communicated through the transparency mechanisms.
  • guidance: Shepherd searchers through the process of translating information needs into tractable queries.

Of course Daniel’s essay does a better job of describing these imperatives than I have done here.  Check it out.

meaningful text analysis

Last night I had dinner with a group of visiting scholars from Germany who are part of the textGrid project.  Textgrid entails an effort to bring grid computing to bear on digital humanities research.  We spent the evening talking not so much about grid technologies but rather about humanities computing in general.  The conversation also focused on the Monk project, with which our gracious host John Unsworth is closely involved.

The thrust of our discussion lay in what computing does, can, should, and cannot offer to the study of humanistic data.

The interesting question is, what should humanities computing be?

Kirsten Uszkalo was especially keen on the application of sentiment analysis to the work she does on early modern English literature.  But I wonder whether the already-difficult problem of identifying, say, positive and negative product reviews isn’t qualitatively different in the context of 16th Century popular lit.

Consider one example that we discussed: reports of demonic possession.  It struck me that a humanist is unlikely to be compelled by a classifier that achieves n% accuracy in separating medical and theological treatments of possession.   Instead, the interesting question in this case lies in identifying the textual features that would enable such a classifier.  That is what aspects of a text–vocabulary, physical dimensions, print type, etc.–speak to a meaningful difference in discourses?

I came away from the dinner wondering where the problem of feature creation, selection, reduction, etc. fits into humanities computing.  To what extent is feature selection a computing problem at all?  Maybe the features that would inform a classifier are the aim of the humanist in the first place.