Google this: Number of books in the world

Google has the answer to a lot of questions we ask. And looks like it has the answer to this one too.

The company has been trying to digitize books through its Google Books initiative for a few years now, so it is surprising that they didn’t answer this question at the outset.

How did they do it?

There were four classification systems that Google looked at. Each had their own set of shortcomings.

(The following infographics are based on this post by Google)

Classification

Identifying various limitations with other classification systems, Google Books decided to use metadata and compile a list of only unique books. The metadata that Google used was provided by more than 150 providers.

Sorting the metadata

The weeding out of duplicates and exclusion of non-books left Google with a list of approximately 130 million books – a number I feel is definitely going to rise.

Google’s vision with its Google Books project is highly ambitious, to say the least. And considering it plans to digitize all possible books, it is a mammoth task! On the other hand, if there is one company that can do it, it is Google…

An interesting side note:

Last week I wrote a post on reCAPTCHA, a tool that prevents spam and digitizes books. While I am still in the process of getting someone to from the reCAPTCHA team to talk to me, I found that Google has bought reCAPTCHA and is using it to enhance its digitizing process.

So, how has reCAPTCHA worked for Google?

Estimates by the reCAPTCHA team suggest there are 200 million CAPTCHAs solved around the world every day and it takes 150,000 man hours to solve them.  If a book has 100,000 words in it (average size of a novel), it would take  little more than a minute to digitize.

If each page in a book had 250 words, Google would need to scan 400 pages of said average sized novel. Considering the company is capable of scanning 15 pages per minute (~1000 pages/hour), it would take a little more than 25 minutes to scan the book. Add to it the minute or so required to digitize the content using reCAPTCHA, and Google can create a digital copy in a little less than half an hour.  This does not factor in time for loading the book, manual intervention or that the 100,000 words inputed using reCAPTCHA would not be happening simultaneously. However, these are impressive numbers nonetheless.

So, if Google does manage to digitize a book in half an hour, how long would 130 million take? A little more than 7400 years…

I don’t expect Google to wait that long ;)

Last 5 posts by Vivek