“Google Books has digitized over fifteen million books: over 11% of all the books ever published (about 129 million book editions). The Google Books collection contains over five billion pages and two trillion words, with books dating back to as early as 1473 and with text in 478 languages. “Culturomics” is the term describing the analysis of this corpus, which enables to investigate cultural trends quantitatively, providing insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. “Culturomics” extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.” Jean-Baptiste Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science Express, Published Online 16 December 2010.
“The digitization process. For publisher-provided books, Google removes the spines and scans the pages with industrial sheet-fed scanners. For library-provided books, Google uses custom-built scanning stations designed to impose only as much wear on the book as would result from someone reading the book. As the pages are turned, stereo cameras overhead photograph each page. With stereo scanning, the book is cradled at an angle that minimizes stress on the spine of the book, but the resulting page is curved relative to the plane of the camera. The curvature is measured by projecting a fixed infrared pattern onto each page of the book. Using this curvature information, the scanned image of each page is digitally resampled so that the results correspond as closely as possible to the results of sheet-fed scanning. Details of this approach can be found in U.S. Patents 7463772 and 7508978. Finally, blocks of text are identified and optical character recognition (OCR) is used to convert those images into digital characters and words. Google estimates that over 98% of words are correctly digitized for modern English books.”
John Bohannon, “Digital Data: Google Opens Books to New Cultural Studies,” News of the Week, Science 330: 1600, 17 December 2010.
“By analyzing the growth, change, and decline of published words over the centuries, it should be possible to rigorously study the evolution of culture on a grand scale. For example, the size of the English language has nearly doubled over the past century, to more than 1 million words, hence about 500,000 English words missed by all dictionaries; moreover, the vocabulary seems to be growing faster now than ever before.
“Both the available data and analytical tools will expand: “We’re going to make this as open-source as possible.” With the study’s publication, Google is releasing the n-gram database for public use. The current version is available at www.culturomics.org.”
PS (Jan 13, 2010): “Culturomics” is the application of high-throughput data collection and analysis to the study of human culture. Books, newspapers, manuscripts, maps, artwork, and a myriad of other human creations are the evidence in the humanities; the challenge is the interpretation of this evidence. Jean-Baptiste Michel and Erez Lieberman Aiden, mathematicians at Harvard University, lead a research team focused on culturomics, a new field in social sciences proposing a data-intensive approach to the humanities. Google Books corpus of digitized texts contains over 15 million books [~12% of all books ever published]. Michel et al. have analyzed only 5,195,769 digitized books [~4% of all books ever printed]. Analysis of this corpus enables us to investigate cultural trends quantitatively. This approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Moreover, the data mining of Google’s digitized books database can also be used to improve Wikipedia’s entries. Cooperative editing by millions of volunteers has a big problem: the lack of structured data. DBpedia tries to enforce structure onto Wikipedia entries, but that effort is only starting. Culturonomics automatic tools may be used to structure and correct Wikipedia’s entries for Internet 3.0. John Bohannon, “Digital Data: Google Books, Wikipedia, and the Future of Culturomics,” Science 331: 135, 14 January 2011, summarizes the technical paper Jean-Baptiste Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science 331: 176-182, 14 January 2011.
Authors have been writing for millennia; ~129 million book editions have been published since the advent of the printing press. The five million books chosen for computational analysis contains over 500 billion words, in English (361 billion), French (45 billion), Spanish (45 billion), German (37 billion), Chinese (13 billion), Russian (35 billion), and Hebrew (2 billion). The oldest works were published in the 1500s. By 1800, the corpus grows to 98 million words per year; by 1900, 1.8 billion; and by 2000, 11 billion.
“In the future, everyone will be famous for 7.5 minutes” – Whatshisname. People, too, rise to prominence, only to be forgotten. Fame can be tracked by measuring the frequency of a person’s name. The analysis of all 42,358 people in the databases of the Encyclopaedia Britannica shows that people are getting more famous than ever before but are being forgotten more rapidly than ever. Science is a poor route to fame. Physicists and biologists eventually reached a similar level of fame as actors, but it took them far longer. Alas, even at their peak, mathematicians tend not to be appreciated by the public.