Search white paper 2005 article pages

Traditional search engines create their indices by spidering or crawling surface Web pages. To be discovered, the page must be static and linked to other pages. Traditional search engines can not "see" or retrieve content in the deep Web — those pages do not exist until they are created dynamically as the result of a specific search. Because traditional search engine crawlers can not probe beneath the surface, the deep Web has heretofore been hidden.

The deep Web is qualitatively different from the surface Web.

The White Paper FAQ (Frequently Asked Questions)

Deep Web sources store their content in searchable databases that only produce results dynamically in response to a direct request. But a direct query is a "one at a time" laborious way to search. BrightPlanet's search technology automates the process of making dozens of direct queries simultaneously using multiple-thread technology and thus is the only search technology, so far, that is capable of identifying, retrieving, qualifying, classifying, and organizing both "deep" and "surface" content. If the most coveted commodity of the Information Age is indeed information, then the value of deep Web content is immeasurable.

With this in mind, BrightPlanet has quantified the size and relevancy of the deep Web in a study based on data collected between March 13 and 30, Our key findings include:.

One comment

To put these findings in perspective, a study at the NEC Research Institute [1] , published in Nature estimated that the search engines with the largest number of Web pages indexed such as Google or Northern Light each index no more than sixteen per cent of the surface Web. Since they are missing the deep Web when they use such search engines, Internet searchers are therefore searching only 0. Clearly, simultaneous searching of multiple surface and deep Web sources is necessary when comprehensive information retrieval is needed.

Internet content is considerably more diverse and the volume certainly much larger than commonly understood. This paper does not consider further these non-Web protocols. Second, even within the strict context of the Web, most users are aware only of the content presented to them via search engines such as Excite , Google , AltaVista , or Northern Light , or search directories such as Yahoo! Eighty-five percent of Web users use search engines to find needed information, but nearly as high a percentage cite the inability to find desired information as one of their biggest frustrations.

The importance of information gathering on the Web and the central and unquestioned role of search engines — plus the frustrations expressed by users about the adequacy of these engines — make them an obvious focus of investigation. Until Van Leeuwenhoek first looked at a drop of water under a microscope in the late s, people had no idea there was a whole world of "animalcules" beyond their vision.

Deep-sea exploration in the past thirty years has turned up hundreds of strange creatures that challenge old ideas about the origins of life and where it can exist. Discovery comes from looking at the world in new ways and with new tools. The genesis of the BrightPlanet study was to look afresh at the nature of information on the Web and how it is being identified and organized.

Search engines obtain their listings in two ways: Authors may submit their own Web pages, or the search engines "crawl" or "spider" documents by following one hypertext link to another. The latter returns the bulk of the listings. Crawlers work by recording every hypertext link in every page they index crawling. Like ripples propagating across a pond, search-engine crawlers are able to extend their indices further and further from their starting points.

The surface Web contains an estimated 2. Legitimate criticism has been leveled against search engines for these indiscriminate crawls, mostly because they provide too many results search on "Web," for example, with Northern Light, and you will get about 47 million hits. Also, because new documents are found from links within other documents, those documents that are cited are more likely to be indexed than new documents — up to eight times as likely. To overcome these limitations, the most recent generation of search engines notably Google have replaced the random link-following approach with directed crawling and indexing based on the "popularity" of pages.

In this approach, documents more frequently cross-referenced than other documents are given priority both for crawling and in the presentation of results. This approach provides superior results when simple queries are issued, but exacerbates the tendency to overlook documents with few links.

And, of course, once a search engine needs to update literally millions of existing Web pages, the freshness of its results suffer. Numerous commentators have noted the increased delay in posting and recording new information on conventional search engines. Moreover, return to the premise of how a search engine obtains its listings in the first place, whether adjusted for popularity or not.

That is, without a linkage from another Web document, the page will never be discovered. But the main failing of search engines is that they depend on the Web's linkages to identify what is on the Web. Figure 1 is a graphical representation of the limitations of the typical search engine. The content identified is only what appears on the surface and the harvest is fairly indiscriminate. There is tremendous value that resides deeper than this surface content.

The information is there, but it is hiding beneath the surface of the Web. How does information appear and get presented on the Web? In the earliest days of the Web, there were relatively few documents and sites. It was a manageable task to post all documents as static pages. Because all pages were persistent and constantly available, they could be crawled easily by conventional search engines. In July , the Lycos search engine went public with a catalog of 54, documents.

Sites that were required to manage tens to hundreds of documents could easily do so by posting fixed HTML pages within a static directory structure. However, beginning about , three phenomena took place.

Second, the Web became commercialized initially via directories and search engines, but rapidly evolved to include e-commerce. This confluence produced a true database orientation for the Web, particularly for larger sites. It is now accepted practice that large data producers such as the U. Census Bureau , Securities and Exchange Commission , and Patent and Trademark Office , not to mention whole new classes of Internet-based companies, choose the Web as their preferred medium for commerce and information transfer.

OSA | Author Style Guide

What has not been broadly appreciated, however, is that the means by which these entities provide their information is no longer through static pages but through database-driven designs. It has been said that what cannot be seen cannot be defined, and what is not defined cannot be understood.

So… what exactly is a white paper?

Such has been the case with the importance of databases to the information content of the Web. And such has been the case with a lack of appreciation for how the older model of crawling static Web pages — today's paradigm for conventional search engines — no longer applies to the information content of the Internet. In , Dr. Jill Ellsworth first coined the phrase "invisible Web" to refer to information content that was "invisible" to conventional search engines. For this study, we have avoided the term "invisible Web" because it is inaccurate. The only thing "invisible" about searchable databases is that they are not indexable nor able to be queried by conventional search engines.

Using BrightPlanet technology, they are totally "visible" to those who need to access them. Figure 2 represents, in a non-scientific way, the improved results that can be obtained by BrightPlanet technology. By first identifying where the proper searchable databases reside, a directed query can then be placed to each of these sources simultaneously to harvest only the results desired — with pinpoint accuracy. Additional aspects of this representation will be discussed throughout this study.

For the moment, however, the key points are that content in the deep Web is massive — approximately times greater than that visible to conventional search engines — with much higher quality throughout. BrightPlanet's technology is uniquely suited to tap the deep Web and bring its results to the surface.

Ten Simple Rules for Writing a Literature Review

The simplest way to describe our technology is a "directed-query engine. Like any newly discovered phenomenon, the deep Web is just being defined and understood. Daily, as we have continued our investigations, we have been amazed at the massive scale and rich content of the deep Web. This white paper concludes with requests for additional insights and information that will enable us to continue to better understand the deep Web. This paper does not investigate non-Web sources of Internet content.