Subscribe to the Free Print Edition!
Celebrating 25 Years

Exploring the deep web

Internal and external federated systems lead users to treasures that regular search engines can’t find

By Drew Robb, Special to GCN

For the past decade, the Energy Department’s Office of Scientific and Technical Information in Oak Ridge, Tenn., has been using the Internet to speed research processes.

“When we first started posting information on the Web in 1997, we relied on search engines provided by the database vendors,” said OSTI Director Walt Warnick. “It soon occurred to us that it would be helpful to provide our patrons with the ability to search across multiple databases at one time.”

That led the agency to install federated search software — a search engine that simultaneously executes a query against a number of databases in real time, then aggregates and ranks those results. In April 1999, OSTI launched the EnergyFiles site (www.osti.gov/EnergyFiles/), providing access to over 500 DOE databases and sites. That was followed in 2002 by Science.gov, which allows a single query to pull data from 30 scientific research databases at 12 federal agencies. February 2007 saw the release of Science.gov 4.0 with greatly enhanced relevance ranking. OSTI is now working to expand the system to include government research sites worldwide.

“Our mission is to accelerate the spread of knowledge to accelerate the advance of science,” Warnick said. “Federated search is a very useful way for making that happen.”

The dark Web
Google may dominate the search market, but it has two major shortcomings. The first is that it barely accesses what is known as the deep Web, invisible Web or hidden Web — data that is available over the Internet but cannot be indexed by Web crawlers, at least not without Webmasters preparing a text file listing all the entries of that database. All this material that resides in databases can only be summoned by dispatching a query or by filling out a form.

“In 2000/2001 we did some analysis and realized that the quantity of documents from these deep-Web databases was far bigger than what everyone was calling the Internet,” said Jerry Tardif, vice president at search firm Bright Planet.



GCN Popup