ReCAPTCHA-ing old books

Digital is a place everyone seems to be going. Since Gutenberg’s modern press, printing presses across the world have churned out copious  amounts of content (estimates suggest printing revenues to be in excess of $ 1 trillion). To digitize all of it is a Herculean effort.

Publishers have pro-actively included digital forms while publishing their content and are active in digitizing their archives or back lists. However, this is a small sub-set of content that exists. Companies have already started addressing content that exists in library archives or as manuscripts, research papers, etc. Google, for example, has partnered with the Committee on Institutional Cooperation (CIC) to digitize collections. These collections will then be available on Google Books.

Another organization, the Internet Archive, recently digitized 23,000 books for the University of Illinois. There exist many such organizations that are offering digitization and conversion services. Printers, for example, offer their clients options to convert existing content to digital ready forms. But of all these, I would like to highlight the contribution made by ‘reCAPTCHA’, an anti-spam tool. Developed by The School of Computer Science at Carnegie Mellon University, reCAPTCHA uses scanned words from old books, newspapers and radio shows instead of random words.

Wait, what is CAPTCHA?

CAPTCHA (Completely Automated Public Turing Test To Tell Computers and Humans Apart) is a tool that protects websites and web pages from spam generated by bots. The tool creates a test that only humans can pass but computer programs can’t.

Dont Type

Source: blog.recaptcha.net

So, how is reCAPTCHA different?

reCAPTCHA helps digitize old books, newspapers, and radio shows.  These pages are photographed and subjected to Optical Character Recognition (OCR). This process yields low accuracy and requires human intervention to be successful, and this is where reCAPTCHA is so innovative. It uses the scanned text for CAPTCHA  and creates a unique system that validates how the text has been converted. So basically, you’re just typing a few words to prove your humanity to the internet, and a wonderful by-product is that you’re helping convert old texts to digital format in the process!

Estimates by the team suggest there are 200 million CAPTCHAs solved around the world every day and it takes 150,000 man hours to solve them.  If a book has 100,000 words in it (average size of a novel), it would take  little more than a minute to digitize… I wonder how many publishers can claim the same?!

Do leave a comment if you know of any other such unique tool to digitize books.

Last 5 posts by Vivek