Michael Hart - Interview on the Future of Libraries

Michael Hart invented the eBook, and founded Project Gutenberg, one of the world's largest online collection of free eBooks. Michael, what was your inspiration for starting Project Gutenberg?

Even I would have to say the whole thing was very serendipitous. I learned a bit about one of our local mainframes simply because a good friend was one of the operators, so I used to hang out in the computer room a lot, doing my homework in air conditioned comfort, and because it was closer than going back to my place.

One day I saw one of our favorite patrons come to the little stainless steel window, and because it was too busy a time for anyone to load and run it, he couldn't get his program run. I volunteered. Everyone, even my best friends, looked at me somewhat in shock. I said, "It is not that hard, is it?" So they asked me how I would do it, then they decided that this was not going to kill the computer, so they let me try. It all worked fine, and I was eventually kind of the first "hitchhiker" on the Internet.

Eventually, however, the big boss operator got worried that I would actually do damage, and insisted that I be given my own account and password; that way I wouldn't always be logged in with operators' privileges, able to delete all the files, he he. It just so happened that the day this account came through was July 4, 1971, complete with more money in it than I ever dreamed of, something like $100,000. It hit me that I should do something worthy of this much time and effort it took to give a computer account to me.

Instead of walking home after the fireworks, I camped out in the computer room. I pondered the situation, realizing that the hope of me writing program material that would still be around in a decade was slim to none, so I tried out a few ideas of what I could do that would be worth this kind of investment. I had just learned that we were on the Internet, and that we could send files to Berkeley and Harvard and many places in between, but that no one was doing an introductory message such as "What hath God wrought?" as via telegraph, so I decided to try to come up with something that would last like that.

Well, back in 1971 they were already doing things for the upcoming United States Bicentenniel in 1976. Someone had handed me a faux parchment copy of The Declaration of Independence, and, literally, just like in the comics and cartoons, the light went on over my head! I knew that if I typed in "The Declaration of Independence" that it would never, ever disappear from the Internet. I typed it in that very night, so technically it was July 5, by then, and the file was available for download after that, and that was the beginning of the efforts of Project Gutenberg, every year some new "History of Democracy" file was added for the rest of the 70s, and not always by me, either. The snowball had started down the mountain.

What are the major differences between Google Book Search and Project Gutenberg?

The major differences are that Project Gutenberg eBooks are yours to own, to edit as you see fit, to create new editions from, read in any font you choose, in any color you choose, any margins you choose and a host of other variables that are under your control. With Google's eBooks, it's more like reading over someone's shoulder - you pretty much have to leave most of the control to them. (I hear that Google has constantly been promising changes to some things of this nature, but I haven't actually seen the results). Also, Google does not provide a catalog of their eBooks. They don't copyright research on their eBooks. They don't make it trivial to download their books. They don't proofread their eBooks.

Why would proofreading be necessary if Google is just scanning print books into digital form?

Google eBooks do not have the integrity of a single work, they are in fact two indpendent works, one a graphicial/pictoral representation and the other the actual kind of computer text we are all used to in the email. These are not at all the same. You can't search the text in a picture. A picture of a book is NOT an eBook. I like to quote Magritte's famous painting here: Ceci n'est pas une pipe (This is not a pipe), under a depicted pipe.

What Google does is to make a quick and easy scan of the books, and then make a quick and dirty OCR pass [Optical Character Recognition] but then instead of cleaning up the OCR output as Project Gutenberg does with the army of Distributed Proofreaders, volunteers and others, instead Google's solution was to write a "fuzzy search engine" that would put up with the sloppy full text file created by their OCR programs.

This is why they didn't want people to download their files, it would be all too easy to see where the discrepancies were, and to realize that the Google Print Library or Google Book Search wasn't up to snuff, when it's compared to the eBooks that were already populating the Internet.

After Project Gutenberg initially set the standard for 99.9% accuracy in for the first edition of an eBook, The Library of Congress raised it for their own later operations to 99.95%, and later Project Gutenberg raised it again to 99.975% and we are currently working to get to 99.99%. As time goes on, we hope to keep moving to an increasing accuracy level. However, from Google's point of view it is much easier just to leave the quick and dirty OCR output.

Project Gutenberg eBooks are designed to be completely searchable end to end with any search engines, used with any word processor, cut and paste should work into emails, research papers, or what have you - and would be an easy source material for new paper editions or eBooks. Google has intentionally avoided all these valuable considerations for a more "Limited Distribution" philosophy that kept too many of their eBooks out of circulation, from being quoted, etc.

Remember, Google eBooks are TWO entities: one is an unproofread e-text output of their OCR transcription program, the other is a set of graphic files that usually number one for each page, and are much more difficult for the user to download, store, recall, etc. than a plain text eBook with a single book in a single file that can be read, quoted, edited, etc. by a vast majority of the hardware and software combinations out there.

Try downloading "The Balcony Scene" from Shakepeare's Romeo and Juliet - from both Project Gutenberg and Google Book Search, and you'll see.

It will take me only a minute. I got it on the first Google hit by searching "Project Gutenberg" "Romeo and Juliet" "wherefore art thou". Now, just think of how much time it would take to do the same thing from a graphical representation where you had to retype or OCR their files!!! This didn't even take me a minute to find, highlight, cut and paste!!! That is what eBooks should be all about.

You should own your own eBooks, correct any errors you find, cut & paste the entire play into a script for your own production, and so on.

Even with Google Book search, Project Gutenberg is still necessary, more than ever, because Project Gutenberg wants you to own the library in the same sense you can now own your own computer, or "supercomputer" as most of the computers being sold today would have been considered supercomputers not that long ago. Today you can add a brand new terabyte drive to any computer for under $400 as an internal drive, or a few dollars more in an external box. This is enough to hold a million eBooks without using compression, two and half million with the best compression.

A million books!

And there have been a million books freely available on the Internet, perhaps for the last two years or longer. Before The Gutenberg Press the average person could own zero books. Before Project Gutenberg the average person could own zero libraries, speaking only of the words, of course, not the physical entity or the library staff, etc.

Would you say Project Gutenberg inspired Google Book Search?

I got an email or phone call from most or all of the various eBook projects around the world, asking for advice. This goes all the way back to The World Library [the first eBook CD] which was my honor to present at the 1990 ALA Midwinter Conference in Chicago along with other aspects of eBook presentation, Project Gutenberg, etc. It also includes Voyager, the first model commercial eBook vendor with classic modern books such as Jurassic Park.

I gave them all friendly advice, and if Google had taken my advice the whole legal copyright suit issue probably would not have come up, nor the huge redefinitions of their projects as Google Print Library to Google Book Search, since their early philosophy was definitely not print and not library. I quote from their public response to these issues: "Google Book Search is a means for helping users discover books, not to read them online and/or download them."

It was obviously a commercial endeavor from the outset, and if they had made a serious attempt to show the publishers that they were trying to sell books for them to a wide public audience, perhaps that would have ameliorated situations that soon went out of control and into the lawsuit arena. The trouble is that Google wanted to pretend to be a public library without an equally sufficient effort to BE a public library, so it was all too obvious, sadly to say, that their real goal was more that of a commercial library. If they had gone either way, much more public library or much more commercial, or even both, with two separate projects, I think they could have made it.

As it was, there were too many obvious shortcomings, and not just shortcomings in the traditional sense, but obvious attempts to sabotage efforts of readers, so the readers could neither read the average Google book, nor download a small portion to read.

The limitations placed on readers were just too great, and when combined with their huge media campaign of December 14, 2004, in which every medium I could find was saturated with what appeared to be a new public eLibrary from the likes of Oxford, Harvard, Michigan, Stanford, NYPL, etc, well, the seeds of great disappointment were sewn.

With free eBooks online from Project Gutenberg, are libraries still important?

It is not that I think libraries are not important, I would like to think that they will change as much due to eBooks as they did via The Gutenberg Press (i.e., there will be more eBooks in periods of 50 years after eBooks showed up than paper books, just as when Gutenberg books arrived). In addition, with the advent of RAMsticks, thumbdrives, etc. and terabyte hard drives for under $400, I think "Personal Computers" are rapidly evolving into "Personal Libraries."

Personally, I think libraries containing music, movies, etc, as they have for decades now, are no different then libraries containing eBooks. After all, the discs are the same, only the bits are different. I think that the library should preserve whatever the current media used, and that this obviously has changed throughout history I am quite certain the same kinds of conversations took place when it was the change from stone tables to clay tablets, or clay to papyrus, or to linen, rag, or what we call "paper" today. We still use the phrase "written in stone" to emphasize a huge truth, but when is the last time you heard of anyone really going to stones, when searching for the original text?

We have "books" from the Ancient Egyptians, Greeks, Romans, Chinese - not to mention that so many of these came to us through Arabic, not to mention the great libraries of the Moors in Spain that gave us so many of the classics in science and literature. However, the authors of those "books" would not recognize our books just as books in the future will not be as physical as before - just as our books today aren't as physical as ancient tablets, scrolls, etc.

How will libraries change in the next 50 years?

The changes will be so great, just as The Gutenberg Press changes, that there will be forces at work beyond what perhaps anyone other than myself will even try to say today. However, my own predictions are measured in terms of 2021 which is the 50th anniversary of doing The Declaration of Independence. Let's just say that there will be 10 million public domain eBooks, if not more, by 2021.

Now here's the kicker: automated translation is a new factor that will be introduced by 2021, one that will convert those 10 million public domain eBooks into 100 languages, thus creating, in terms of sheer number of volumes, a library larger than any the world has yet considered, A Billion Book Library and and eBooks in this library will all be free of charge, if I can manage it. Of course, by then I predict there will also be petabyte drives at a cost the average computer owner can afford, that will hold these billion books, should anyone want to own the entire library.

Of course, at this moment, there is more resistance to translation by machine than I would have liked to see, and even more to adding 100 languages to the list they are working on. But I have my hopes, and I intend to do all I can to encourage the machine translation industry to use Project Gutenberg eBooks for a test bed, and to encourage each project to add one more language.

And how will these changes affect librarians?

Well, I can remember when librarians thought film was the greatest and latest thing they had to offer, and when many of them refused, or were simply unable, to work the various projectors, library "audio-visual" assistants were added. I can also recall when this same sort of thing happened with the telephone and certain members of the library staff who were more comfortable with the telephones ended up taking care of that end of things. Most people don't see that kind of turmoil, but it happens with almost every huge change in technology, it's just usually kept behind the scenes.

However, the computer revolution is going even faster than phones, movies, or anything other than music's switch from vinyl to CD and the result is that these changes are becoming more obvious.

No one I know has been talking about terabyte drives, and I say it loudly and clearly that they should have been, and should talk now about how long it will be to petabyte drives. Why? Because every word ever published could be stored on one petabyte. Now that is a library! Point is that individuals can download a million book files today, without undue expense to the average household budget.

Thank you very much, Michael for taking the time to share your thoughts with us. You can learn more about downloading free eBooks online at Project Gutenberg.

Customize Your Education