Search Engine Legality

by on October 20, 2005 · 2 comments

The complaint in the case of McGraw-Hill v. Google is available on FindLaw. The most interesting paragraph, in my opinion, is this one:

Google purports to justify its systematic copying of entire books on the ground that it is a necessary step to making them available for searching through, where excerpts from the books retrieved through the search will be presented to the user. Google analogizes the Google Library Project’s scanning of entire books to its reproduction of the content of websites for search purposes. This comparison fails. On the Internet, website owners have allowed their sites to be searchable via a Google (or other) search engine by not adopting one or more technological measures. That is not true of printed books found in library shelves. Moreover, books in libraries can be researched in a variety of ways without unauthorized copying. There is, therefore, no “need,” as Google would have it, to scan copyrighted books.

This is very confused. Let’s start with the business about “technological measures.” I assume that they’re talking about robots.txt, a file that webmasters use to tell search engines which content they are allowed to index. It’s worth noting that robots.txt is an opt-out convention. If a site doesn’t have a robots.txt file, search engines will index it.

So it seems like Google’s approach is entirely consistent with the web-search precedent. Just as robots.txt provides web site publishers with a mechanism for notifying search engines which pages not to index, Google is providing book publishers with a mechanism for indicating which books are not to be indexed. Publishers who fail to provide Google with a list, like webmasters who fail to put up a robots.txt file, can be said to have “allowed” their content to be indexed.

This paragraph also shows an ignorance of search engine history. Websites, like books, can be “researched in various ways without unauthorized copying.” Indeed, that’s where the search engine industry started. The first major Internet search engine, Yahoo!, was a keyword-based search engine analogous to a card catalog. Sites were added to the directory manually by a human being who would read the web site and write a summary for the directory.

Then in 1995, along came AltaVista, which offered the first full-text search of the web. The results were so obviously superior that Yahoo! licensed the technology in 1996. Soon every search engine had full-text functionality. So web sites, too, can be “researched in a variety of ways unauthorized copying.”

So the reason that all search engines today make copies of websites isn’t that it’s impossible to index them without doing so. Rather, it’s that full-text searches are vastly superior to the alternatives, and full-text searching is impossible without making a full-text copy.

If there had been an Association of Web Site Publishers in 1995, they could have made precisely the same argument about AltaVista. Had they prevailed, it’s hard to predict how things would have evolved, but it seems unlikely they would have gone as well. Search engines would have spent a great deal of time contacting and negotiating with web-site owners for permission to include them in their indexes. Some web sites might have signed exclusive deals with a particular search engine, or demanded that search engines pay a fee to include them in their searches. The most comprehensive search engines might have required users to subscribe, as LexisNexis and Factiva do.

Or maybe enlightened webmasters would have realized that search engines were a win-win proposition and permitted them to index their sites. Maybe they would have developed a standard way to indicate permission to index, and things would have evolved about the same way. But regardless, the analogy the publishers are trying to draw is bogus. Full-text searches–of books or web sites–require the creation of copies. If Google Print is copyright infringement, then so is Google itself. I hope it’s obvious to everyone that declaring Google illegal would be a bad idea.

Comments on this entry are closed.

Previous post:

Next post: