A BRIEF HISTORY OF INFORMATION RETRIEVAL
Almost as soon as textual information was stored on computers researchers began to investigate how it could be easily retrieved. Significant progress was made in the 1960s and operational systems were widely available by the 1970s.
The field was reasonably mature by the 1990s, with the primary users being professional librarians and researchers (see Lesk, 1995). By the early 1990s most of the low-hanging fruit had been harvested and intensive users of information retrieval technology were worried that technological progress was grinding to a halt. This concern led to the creation in 1992 of the TREC (Text Retrieval and Extraction Conference) by DARPA.DARPA compiled training data consisting of many queries and many documents along with a 0-1 indicator of whether or not the document was relevant to the query. Human judges determined the relevance of these indicators. Research teams then trained their systems on the TREC data. Subsequently, TREC provided a second set of data for which the research teams tried to forecast relevance using their trained systems. Hence TREC provided a test collection and forum for exchange of ideas and most groups working in information retrieval participated in TREC (see Vorhees, 1999). Having a standard base for comparing different algorithms was very helpful in evaluating different approaches to the task.
Though search engines use a variety of techniques, one that will be very familiar to economists is logistic regression. One chooses characteristics of the document and the query and then tries to predict the probability of relevance using simple logistic regression. As an example of this approach, Cooper et al. (1993, 1994) used the following variables:
• the number of terms in common between the document and the query;
• log of the absolute frequency of occurrence of a query term in the document averaged over all terms that co-occur in the query and document;
• square root of the query length;
• frequency of occurrence of a query term in the collection;
• square root of the collection size;
• the inverse collection frequency, which is a measure of how rare the term is in the collection.
Other systems use different variables and different forms for predicting relevance, but this list is representative for the time.
By the mid-1990s it was widely felt that search had become commoditized. There were several algorithms that had roughly similar performance and improvements tended to be incremental. When the web came along in 1995, the need for better Internet search engines became apparent and many of the algorithms developed by the TREC community were used to address this need. However, the challenge of indexing the web wasn’t as compelling to the information retrieval community as one might have thought. The problem was that the web was not TREC. TREC had become so successful in defining the information retrieval problem that most attention was focused on that particular research challenge, to the exclusion of other applications.
The computer scientists, on the other hand, saw the web as the problem du jour. The NSF Digital Library project and other similar initiatives provided funding for research on wide-scale information retrieval. The Stanford computer science department received one of these Digital Library grants and two graduate students there, Larry Page and Sergey Brin, became interested in the web search problem. They developed the PageRank algorithm - an approach to information retrieval that used the link structure of the web. The basic idea (to oversimplify somewhat) was that sites that had a lot of links from important sites pointing to them were likely to contain relevant information.1 PageRank was a big improvement on existing algorithms and Page and Brin dropped out of school in 1998 to build a commercial search engine: Google. The algorithm that Google now uses for search is proprietary, of course. It is also very complex. The basic design combines PageRank score with an information retrieval score. The real secret to Google’s success is that it is constantly experimenting with the algorithm, adjusting, tuning and tweaking virtually continuously.
One of the tenets of the Japanese approach to quality control is kaizen, which is commonly translated as ‘continuous improvement’. One reason for the rapid pace of technological progress on the web is that it is very easy to experiment - to use a new search algorithm for one query out of 1000. If the new algorithm outperforms the old one, it can quickly be deployed. Using this sort of simple experimentation, Google has refined its search engine over the years to offer a highly refined product with many specialized features. Google is hardly the only online business that engages in kaizen; Amazon, eBay, Yahoo! and others are constantly refining their websites. Such refinements are typically based on systematic experimentation and statistical analysis, as in the traditional quality control practice.
18.4