Uncharted: Big Data as a Lens on Human Culture by Erez Aiden & Jean-Baptiste Michel

The New Dictionary

  • Blessed with such a huge data set, the authors had to figure out how to use it. One problem was caused by the fact that most books published after 1920 still have active copyrights. Their challenge was to come up with a plan that would protect authors’ rights, be interesting, not offend Google, and be doable. Their n-gram system contains a record for every word and phrase. A word, for example is a 1-gram. A two-word phrase is a 2-gram and so on. When they limited their data to words that appear at least once in every billion words they ended up with about one million. This was over twice the number of words in the Oxford English Dictionary, the world’s largest at about 446,000. This means that 52% of the English language is what the authors call lexical dark matter. They also noticed that the number of words has doubled between 1950 and 2000.This may be due to our shrinking world, the explosion of scientific words, and the fact that just about anyone can write a book nowadays.

Cutting the Crap and Measuring Fame

  • This chapter starts with efforts by the authors to clean the crap out their data mountain. Like real libraries, they found that their card catalogs had many errors. In addition to cleanup, they had to convince Dan Clancy, the head of Google Books, to give them access. What it took was a thirty minute meeting with Clancy and the famous scientist Steven Pinker. Without Pinker’s fame, the meeting probably would have not happened.
  • Once they realized the importance of fame, they started to study it in their data set. They use the metaphor of the wind tunnel to explain how their approach is useful yet less than perfect. It was the wind tunnel that the Wright Brothers built in their garage that allowed them to design efficient wings. Even though their wind tunnel data wasn’t perfect, it was still very useful. They acknowledge that their efforts contain systemic error. For example, their system states that psychologist Carol Gilligan is more famous than the actor Robert Redford. This is due to the fact that Gilligan is a person who is far more likely to be mentioned in books.
  • Just as cohorts can be used to solve medical problems, the authors believe that their study of fame has its useful aspects. Their discussion of the dynamics of fame is well worth reading. They found that as time goes by, people tend to get famous at and earlier age and are also forgotten faster. It is also obvious that their system cannot differentiate between fame and infamy. Unfortunately, the best way to get famous fast is to commit an extreme act of evil.
Share this:
Share this page via Email Share this page via Stumble Upon Share this page via Digg this Share this page via Facebook Share this page via Twitter Share this page via Google Plus
DrDougGreen.com     If you like the summary, buy the book
Pages: 1 2 3 4