I’ve been playing around with the new Google n-gram edition. They’ve added some new tools: wild cards, part of speech tagging, mathematical operators. Ben Zimmer posted a useful summary at The Atlantic. They’ve also done a good job of cleaning up dating, the medial “s” problem and other OCR errors. Back in 2011 I gave a presentation where I used 3 slides that compared native n-gram plots (“comparison” vs. “comparifon” to highlight one of these errors) to my own plot of the Google 1-gram data, which cleaned up problems with the OCR and case sensitivity.
First, comparison vs. comparifon with the earlier data set:
And finally, my ca. 2011 plot of the adjusted counts:
Two years later, with the new revision to Google’s n-gram viewer dataset, here’s how the same search comes out natively:
They aren’t exactly the same (for one, my normalized plot pulls together all hits for the lemma “compar-”), but I don’t know whether to be more relieved that my graph ended up reflecting the current data or excited that the Google folks have done so much to improve the search.
You can also do various operators. For instance, you can compare the transition between related terms or phrases. Here I’ve combined the more modern genre descriptors “historical fiction” and “historical novel” and divided by Walter Scott’s preferred term, “historical romance” searched as ((historical fiction)+(historical novel))/(historical romance) :
It helps demonstrate that the two more modern descriptors were roughly on par with Scott’s until the latter nineteenth century, when “historical romance” began to fall into disuse. Note that this kind of search assumes that relative changes are meaningful (even if, in fact, historical trends are specific to only set of search terms).
I’m particularly excited about the wildcard operator (*), and the window into contextual use that it offers. For instance, here are the top bigrams for “comparative *” for both the broad English corpus and English fiction:
What you can see is that, within works of fiction, “comparative” becomes primarily a way to evaluate sentiments and characterizations, whereas in the broader English corpus, it is largely a term of art. (N.B.: David Ricardo has had quite an afterlife.)
Obviously, it would be nice if you could combine operators, wild cards, primitives, and case sensitivity. But I assume you can’t because what you are really doing is various sorting operations over a BigTable and each of these methods reflect a different sort strategy (though I can’t find confirmation that this is how Google handles the n-gram data).
UnProTip: The “embed” code for Google doesn’t offer the customizability of most media players and the fonts get mushy in the post format of most blogs (It really needs about 1000 pixels of width to work well). I’m still shrinking the browser about 60% after rendering the n-gram and doing a selected screen grab to keep things looking (relatively) clean. The drawback is you lose the ability to highlight specific plot lines (though you can use the embed code to create a hyperlink to that graph).