meaningful text analysis

Last night I had dinner with a group of visiting scholars from Germany who are part of the textGrid project.  Textgrid entails an effort to bring grid computing to bear on digital humanities research.  We spent the evening talking not so much about grid technologies but rather about humanities computing in general.  The conversation also focused on the Monk project, with which our gracious host John Unsworth is closely involved.

The thrust of our discussion lay in what computing does, can, should, and cannot offer to the study of humanistic data.

The interesting question is, what should humanities computing be?

Kirsten Uszkalo was especially keen on the application of sentiment analysis to the work she does on early modern English literature.  But I wonder whether the already-difficult problem of identifying, say, positive and negative product reviews isn’t qualitatively different in the context of 16th Century popular lit.

Consider one example that we discussed: reports of demonic possession.  It struck me that a humanist is unlikely to be compelled by a classifier that achieves n% accuracy in separating medical and theological treatments of possession.   Instead, the interesting question in this case lies in identifying the textual features that would enable such a classifier.  That is what aspects of a text–vocabulary, physical dimensions, print type, etc.–speak to a meaningful difference in discourses?

I came away from the dinner wondering where the problem of feature creation, selection, reduction, etc. fits into humanities computing.  To what extent is feature selection a computing problem at all?  Maybe the features that would inform a classifier are the aim of the humanist in the first place.


7 Comments on “meaningful text analysis”

  1. […] Efron’s latest blog post about humanities computing reminded me of a breakout discussion we had at the BooksOnline’08 […]

  2. lingpipe says:

    I believe the point about humanists’ interests is more general.

    Researchers in the humanities and social sciences (and in epidemiology/drug testing) are primarily interested in the causal effects of various predictors (features) for a problem. For instance, does education affect income? Does word or clause complexity affect sentiment? Does this drug help improve survival rates among pneumonia sufferers? Therefore, they evaluate significance (though too rarely importance) of various predictors.

    Most of the machine learning and engineering types (including physicians) are more interested in building predictive systems. Should I give this applicant a credit card? Is this review a positive review about the movie Shrek? Does this patient have pneumonia and should I administer this drug? Therefore, they evaluate predictive accuracy.

  3. I think we need to distinguish between the social sciences and the humanities. The social sciences are, at least by aspiration, scientific; they aim at objectively describable and predictable observations. The humanities are essentially subjective: they aim at the human interpretation of other human artifacts. So while computational methods can provide an end-point in the social sciences, that is they can give you the answer, in the humanities they can only be tools, offering up additional data for interpretation (at least until computers can think like humans).

    • milesefron says:

      My example of literary analysis certainly speaks to your point. No amount of computation will solve a problem for which the idea of a solution isn’t valid. Any interesting problem that I think is implicit in your comment is that in many humanities computing settings we aren’t trying (or should not be trying) to deliver answers. Offering tools to pursue a thesis seems like a valid, but limited role for computing in this setting. But what I think is more exciting is the possibility that bringing computing into an arena like literary analysis might lead scholars to ask questions that would otherwise be left by the wayside. Perhaps also these tools would provide a novel basis for rhetoric.
      Would it be compelling (in the literary world) for a scholar to develop a a classifier to separate sonnets that concern their own materiality explicitly versus implicitly? I think it would because in order to work, the researcher would need to identify textual (or extra-textual) features indicative of this distinction. From there, he or she could argue the merits of these features by analyzing the successes and failures of the classifier.

  4. I mine search queries looking for informational searches that can then be turned into a piece of content for my company. I don’t want queries that are just names of things, so I look to a wide variety of POS- and text-based clues, among other things. Some broad groups of POS phrases or words will work with one domain, while it’s worthless on another.

    It’s not 100%, but I can pretty safely use negative contractions (can’t, won’t, isn’t, etc.) in conjunction with something like a list of software names, to assume it’s a web query about troubleshooting application-related problems (this works in web queries, not making general statements).

    Anyway, I’ve found that identifying, checking for, and weighting certain co-occurences is a lightweight, yet effective, method of finding likely indications of these kinds of things.

  5. I mine search queries looking for informational searches that can then be turned into a piece of content for my company. I don’t want queries that are just names of things, so I look to a wide variety of POS- and text-based clues, among other things. Some broad groups of POS phrases or words will work with one domain, while it’s worthless on another.

    It’s not 100%, but I can pretty safely use negative contractions (can’t, won’t, isn’t, etc.) in conjunction with something like a list of software names, to assume it’s a web query about troubleshooting application-related problems (this works in web queries, not making general statements).

    Anyway, I’ve found that identifying, checking for, and weighting certain co-occurences is a lightweight, yet effective, method of finding likely indications of these kinds of things.

    I think a lot of those back-to-the-basics strategies would quickly enable you to achieve a good recall/precision with a little trial and error.

  6. Patrick Herron says:

    As you said in class, Miles, there’s no substitute for knowing your data.


Leave a comment