Google N-Gram Studies

I’ve been playing around with the new Google n-gram edition. They’ve added some new tools: wild cards, part of speech tagging, mathematical operators. Ben Zimmer posted a useful summary at The Atlantic. They’ve also done a good job of cleaning up dating, the medial “s” problem and other OCR errors. Back in 2011 I gave a presentation where I used 3 slides that compared native n-gram plots (“comparison” vs. “comparifon” to highlight one of these errors) to my own plot of the Google 1-gram data, which cleaned up problems with the OCR and case sensitivity.

First, comparison vs. comparifon with the earlier data set:




And finally, my ca. 2011 plot of the adjusted counts:




Two years later, with the new revision to Google’s n-gram viewer dataset, here’s how the same search comes out natively:



They aren’t exactly the same (for one, my normalized plot pulls together all hits for the lemma “compar-”), but I don’t know whether to be more relieved that my graph ended up reflecting the current data or excited that the Google folks have done so much to improve the search.

You can also do various operators. For instance, you can compare the transition between related terms or phrases. Here I’ve combined the more modern genre descriptors “historical fiction” and “historical novel” and divided by Walter Scott’s preferred term, “historical romance” searched as ((historical fiction)+(historical novel))/(historical romance) :




It helps demonstrate that the two more modern descriptors were roughly on par with Scott’s until the latter nineteenth century, when “historical romance” began to fall into disuse. Note that this kind of search assumes that relative changes are meaningful (even if, in fact, historical trends are specific to only set of search terms).

I’m particularly excited about the wildcard operator (*), and the window into contextual use that it offers. For instance, here are the top bigrams for “comparative *” for both the broad English corpus and English fiction:






What you can see is that, within works of fiction, “comparative” becomes primarily a way to evaluate sentiments and characterizations, whereas in the broader English corpus, it is largely a term of art. (N.B.: David Ricardo has had quite an afterlife.)

Obviously, it would be nice if you could combine operators, wild cards, primitives, and case sensitivity. But I assume you can’t because what you are really doing is various sorting operations over a BigTable and each of these methods reflect a different sort strategy (though I can’t find confirmation that this is how Google handles the n-gram data).


UnProTip: The “embed” code for Google doesn’t offer the customizability of most media players and the fonts get mushy in the post format of most blogs (It really needs about 1000 pixels of width to work well). I’m still shrinking the browser about 60% after rendering the n-gram and doing a selected screen grab to keep things looking (relatively) clean. The drawback is you lose the ability to highlight specific plot lines (though you can use the embed code to create a hyperlink to that graph).

Surfing the Permanent Revolution: Digital Humanism at NAVSA 2013

This week I’m back from NAVSA. Well — not really back; it was just up the road in Pasadena. But I expect to spend some time nursing this (intellectual) hangover and thinking of the talks that I saw and the questions that were raised there.

Most immediately, it’s clear that digital work has hit the pavement in 19th century studies. Natalie Houston gave a fantastic talk about her “Visual Page” project, which uses Google’s tesseract OCR reader to analyze formal elements in a print corpus of Victorian poetry. It was stunning how much a computer can learn about a poetry collection just from the blank spaces on the page. Maeve Adams gave an intriguing paper that read across key terms in Victorian periodicals as “epistemic communities” and used this to ground a far-reaching argument about formalism in the 19th-century. And Rachel Buurma expanded on her work on Charles Reade and his archives — an eccentric even among archive rats. As she put it, his wildly profuse collections of documents, indexes, and indexes on indexes, add up to archives “on the way to becoming novels.” I’m almost convinced to read more Reade. It doesn’t sound like he would have appreciated YAHOO (I read the marginalia as: “In other words know the contents before you know anything about this”):

Neither @rbuurma nor @creade amused.

Neither @rbuurma nor @creade amused.

On Saturday I participated in a digital roundtable that Anne Helmreich of the Ghetty Foundation organized to field questions about research and pedagogy from conference attendees. The Prezi from my own talk, about some of the tools I’m using in class, (using Facebook as a social CMS and Google Drive for workshops) is posted here. My main point was that English seminars have always been “flipped”: focused on in-class workshopping and intellectual tinkering. Which makes it easy to fold in digital tools. (I take my inspiration here from Jentery Sayers and his Maker Lab.) But I was more interested in hearing what the other panelists and the attendees had to stay about the state of the digital union with C19 studies.

Mapping the World of Oliphant’s Novels

About a year ago, at the previous MLA, I gave a talk on a panel that detailed literary reactions to the Scottish Rising of 1745. I’d thought I’d written about it, but in the process of getting this server back up and running, I found this old draft post. As part of that panel, I gave a talk on Victorian reactions to the ’45, focusing on the novels of Margaret Oliphant and Robert Louis Stevenson. Part of the question I wanted to raise was whether the rising is typically understood at a site of political and historical closure that cements the constitution of “Britain” as a cultural entity. One way to get at this, I thought, was to see whether literature written about the rising emphasized Britain over Scotland and England.

Hacking: WYSIWYG

Screen Shot 2013-04-26 at 11.32.56 AMTwo weeks ago I noted that someone had recently tried to get into my WordPress server. My firewall traced the query back to an IP in China, though I don’t have the ability to figure out where it originated from initially. I linked it to news of escalating activity from abroad; it seems that attempts to get into academic networks are sharply on the rise.

Then a week ago my server collapsed under what seemed to be a DDOS attack. I tried to restart it several times, but everytime I got the server back up it was swamped with traffic. I’ve spent a good eight hours now launching a new server and migrating over content from a backup. Most of my posts are back, but I lost the last year’s worth of images. I’ve only been able to recreate or restore about half.

It’s all kind of creepy. And it may be beyond my capacity to try and stay on top of escalating security problems on a private blog. Apparently there’s a botnet that’s been hacking WordPress servers generally for the last several months. I like having my own site; I like the ability to post whatever content I want and try out different kinds of server technologies; my Omeka-based class last year depended on this capacity. But the bar is getting higher.

Machine Grading

Credit: Melanie Schultz. Used with permission.

A friend of mine drew my attention to the NYTimes’ recent article on advanced in essay-grading software. It’s technology that will raise hackles at campuses around the country. The claim is that such programs are becoming sophisticated enough to grade college-level writing. Of course, their effectiveness is widely debated. The article helpfully includes a link to a study by Les Perelman which critiques the data being used to support such claims (he argues that sample size problems, confusion between distinct kinds of essays and grading systems, and loose assertions undermine the argument). The software is getting better, but it still doesn’t look like it can quite replicate the scores produced by human graders.

But such criticism is an argument at the margins. There is now clearly room for debate on both sides. Machines are comparable on standardized tests. The long-term trajectory is evident: if machines are roughly as effective as a force of part-time human graders, standardized tests will end up using the software to save money. They’ll keep some humans in the loop cross checking and validating, but the key incentives all point in the direction of greater automization. The reductive structures and simplistic arguments which we train students to replicate for these tests has laid the groundwork. We’ve already whittled essay writing into an algorithm.
Arts & Humanities Degrees are Hard

The annual NSSE benchmark study of universities is out and it has a handy “Report Builder” that allows you to generate reports drawn from their broad survey of freshman and senior undergraduates at a huge range of institutions in the US and Canada. I decided to play around with it a bit, and generated these two models of student opinions about their major at competitive research universities in the US:

Freshman Responses by Major

Freshman Responses by Major

Freshman responses by Major


Senior Responses by Major

Senior Responses by Major

Senior Responses by Major



This seems to confirm the counterintuitive reaction I get when I tell people outside the university that I’m an English professor. 9 times out of 10, they tell me how *hard* they found their English courses in college. Continue reading

Peries Project Archive

While I was leaving last semester, the technical team at Penn and the college graciously agreed to host a static copy of the Peries Project archive that we developed as part of my Ben Franklin Scholars course. It’s currently hosted here. There’s a longer description of the project at the site and in a previous post. Unfortunately, it’s a static copy, so the crowd sourcing and dynamic features won’t work. But it still looks pretty good. When things get settled here at USC I’m planning to relaunch the site and expand it through additional class projects.

ICR2012: Zombies, Climate Change, and the End of the Two Cultures

Just got back from ICR 2012 in Tempe, AZ. Huge thanks to Ron Broglio and Mark Lussier for hosting (and to my friend Michael Gamer for organizing my panel). I made some new friends and had a hell of a time — too much fun, really. If you’re interested, I’ve put the talk I gave up here.

But I wanted to quickly jot down some take-aways. First, climate events had a much larger impact on the Romantic period than I’d understood — perhaps even helping catalyze the French Revolution. Second, in an era of climate worries and Zombie apocalypse obsessions, Mary Shelley’s The Last Man may end up having a larger influence than Frankenstein:

Finally, the marvelous Marilyn Gaull gave a talk on Romantic science that was an inspiring opener. Her main point: that the “two cultures” are never so far apart as they seem. But as I was mulling it over later, I realized that institutionally, it feels like the two culture divide is collapsing. At universities, the humanities and sciences are increasingly fighting a joint rear-guard action against the expansion of professional schools into their curriculum. After mulling this over with some others at the conference, I’m pretty sure this trend isn’t particular to the schools I’ve worked at. I might expand on this more in a later post, but C. P. Snow’s formulation, which emphasized the impact of institutional divides on campus cocktail parties, just feels kind of quaint at this point.

Damning Study of For-Profit Colleges

It is probably a little self-serving for me to criticize for-profit colleges, but the report put out by the Senate’s Health, Labor, and Education committee is withering.  Among the findings:

Well over 50% of the students who begin an associates or bachelor’s degree curriculum withdraw without receiving a degree.

For-profit colleges average a 19.2% profit on their revenue, spend 22.7% of their revenue on marketing, and only 17.2% on instruction.  Notable, the figures for publicly-traded schools were worse than privately held for-profit colleges.

A big driver: Federal aid.

Credit: Paso Robles Winery

In 2009-2010 alone, fully a quarter of Department of Education funding (which only discriminated between colleges on the basis of accreditation) went to for-profit colleges.  For-profit schools, unlike community colleges, have substantial financial aid offices that are adept at matching eligible students with federal student aid. The inquiry had been running for two years now, yielding testimony to extensive fraud and an industry that consistently places profit above education and student outcomes.

For me there are two key take-aways. First, for-profit colleges are profitable because they are serving a huge portion of the underrepresented, economically disadvantaged, and minority students who are eligible for various federal aid programs. It would be tremendously valuable public service if they did this while providing a solid education and finished degrees (and the committee singled out schools like Strayer, Walden, and National American University that are doing just that).

Second, the findings should amplify criticism of the trend to model higher-ed’s administrative structures on corporate governance. For-profit colleges should be seen as a test case for whether market forces and the profit motive can serve the public good more efficiently. The committee has found repeatedly that raises in tuition are considered with a view toward profit rather than the cost of instruction. Right now, the primary service of for-profit colleges is to siphon federal funding to shareholders while burying disadvantaged citizens under mountains of debt.


Student Debt Headlines and the Humanities Degree

Credit: Paso Robles Winery

The New York Times and the Chronicle have mounted a one-two this week to publicize the problem of student debt (something I’ve posted on previously here and here and here).  In an extensive front-page story from Sunday, the Times lays out the problem from the perspective of undergraduates, and the Chronicle followed up today with an analysis of graduate recipients, particularly Ph.D.s working as adjunct faculty, that need welfare and other public-service benefits to make ends meet.  Both provide extensive analysis; the Times piece in particular leverages an extraordinary amount of statistical support.

The implications of the problem are broad, now that more than half of graduating seniors are going to college.  If college is fast becoming a prerequisite of stable middle-class employment, it seems clear we need policies that provide education without burying the students under a lifetime of debt.  I was fortunate enough to attend a state school on scholarship; many of my colleagues won’t finish paying off their educational loans for a decade or two.

