In Challenge To Google, Yahoo Will Scan Books

Source: New York Times

An unusual alliance of corporations, nonprofit groups and universities plans to announce today an ambitious plan to digitize hundreds of thousands of books over the next several years and put them on the Internet, with the full text accessible to anyone.

The effort is being led by Yahoo, which appears to be taking direct aim at a similar project announced by its archrival, Google, whose own program to create searchable digital copies of entire collections at leading research libraries has run into a series of challenges since it was announced nine months ago.

The new project, called the Open Content Alliance, has the wide-ranging goal of digitizing historical works of fiction along with specialized technical papers. In addition to Yahoo, its members include the Internet Archive, the University of California, and the University of Toronto, as well as the National Archive in England and others.

The digitization of print materials has been a continual effort on the part of various research libraries for the last several years. But the potential power of the new collaboration lies in the collective ability of many institutions to compare and cross-reference materials, said Daniel Greenstein, librarian for the California Digital Library at the University of California.

”This is the kind of platform we’ve been looking for for a long time,” said Dr. Greenstein. ”Libraries digitize their stuff and put it up, but none of the libraries have comprehensive collections of everything. Now we can say: ‘We have this particular edition of Mark Twain, but it’s not as good as that one over there,’ and we add it to the collection.”

The Library of Congress, for instance, has one of the largest library collections in the world, but even that collection is incomplete. ”It’s all about gap-filling and collection development,” said Dr. Greenstein.

Although the new project will not be a direct source of revenue of Yahoo, it could give the company’s search feature more visibility. The announcement also establishes a new round in the battle between Yahoo and Google over index size — the number of documents that can be found in a search engine’s database.

Yet the new project’s approach differs from Google’s in several ways. Once a book has been digitized, Yahoo will integrate the content into its index and provide an engine for the group’s Web site (opencontentalliance.org). ”As soon as it’s made available on the O.C.A. Web site, we’ll get a feed letting us know, so it can be indexed by us immediately,” said David Mandelbrot, vice president of search content at Yahoo.

In a departure from Google’s approach, the Open Content Alliance will also make the books accessible to any search engine, including Google’s. (Under Google’s program, a digitized book would show up only through a Google search.) And by focusing at first on works that are in the public domain — such as thousands of volumes of early American fiction — the group is sidestepping the tricky question of copyright violation.

Last month, a group including the Authors Guild, which represents several thousand writers, filed a lawsuit against Google. The suit contends that the company’s program, Google Print, is engaged in copyright infringement because although only text fragments are displayed, a book must be digitized in its entirety to make it searchable.

In August, Google suspended until Nov. 1 its plan to scan copyrighted books, to give authors and other copyright holders the opportunity to opt out of the program; Google refused to give specifics, but said a number of copyright holders had opted out.

Google has defended its practice, arguing that although a copyrighted work is scanned in its entirety, only ”snippets” of text are shown in a search result, falling within the fair use provision of copyright law, which allows limited use of such material.

”We believe what we’re doing is fully consistent with copyright law,” said Susan Wojcicki, a vice president at Google.

When it comes to copyrighted materials, the newly formed group appears to be taking a more cautious approach by seeking permission from copyright holders and by making works available though a Creative Commons license, whereby the copyright holder stipulates how a work can be used.

”Other projects talk about snippets,” said Brewster Kahle, the founder of the Internet Archive, a nonprofit organization in San Francisco that is building a vast digital library. ”We don’t talk about snippets. We talk about books.”

Dr. Greenstein said that the University of California, which plans to contribute as much as $500,000 to the project in the first year, will scan 5,000 volumes of early American fiction at the outset, with the eventual goal of scanning another 5,000 to 15,000 volumes within the next year. The books will be drawn from the 33 million volumes contained in the university’s 10 libraries.

But Dr. Greenstein said he planned to be selective. ”We aren’t planning to march through and scan everything we have,” he said. ”Our approach is very collection-focused, to seed meaningful collections and get other libraries around the world to do the same.”

Yahoo did not disclose the overall budget for the project, although its own contribution has been estimated at between $300,000 and $500,000 for the first year. Hewlett-Packard and Adobe Systems are contributing equipment to the project, and the Internet Archive will do the actual digitizing and archiving of the books. The Internet Archive has set up shop at the University of Toronto and has scanned some 2,000 volumes at a cost of about 10 cents a page.

”What’s so interesting about all of this are the collections that can come forward that are relatively specialized,” said Carole Moore, chief librarian at the University of Toronto. ”This will put it together on a global scale, which is really exciting.”

Richard Terdiman, a professor of literature at the University of California, Santa Cruz, said the Creative Commons approach appeared less likely to antagonize copyright holders. ”Until the Supreme Court decides what is fair use and is the Google model acceptable, they won’t have to spend three-quarters of their time fighting lawsuits.”

In the meantime, there is no shortage of public domain materials. ”Whole chunks of libraries are out of copyright,” said Dr. Greenstein, who estimated that some 15 percent of most university library collections no longer have such restrictions.

Peter Givler, the executive director of the Association of American University Presses, who has been an outspoken critic of the Google project, was also more sanguine about the Open Content Alliance. ”They want to start working with publishers from the get-go,” said Mr. Givler. ”And I certainly like the idea that their index will be searchable by other search engines.”

The new group is calling for others to join. And Mr. Kahle of the Internet Archive said he hoped to recruit Google.

”The thing I want to have happen out of all this is have Google join in,” he said. ”I know we’re dealing with archcompetitors, but if there’s room for these guys to bend, by the time my kid goes to college, we could have a library system that is just astonishing.”

In Challenge To Google, Yahoo Will Scan Books

Join PEN America Today

A Letter from the President of PEN America: American Authors Demand a Second Draft of PEN America, and the World in Which We Write

Response to Prism Reports Article Raising Concern Over Cash Prize Money Delays for Prize-Winning Incarcerated Writers

Salman Rushdie on His New Memoir: ‘I Had to Face This’

PEN America: USC Decision to Cancel Valedictorian’s Graduation Speech Amounts to “Heckler’s Veto”