Low-cost and robust evaluation of information retrieval systems
Research in Information Retrieval has progressed against a background of rapidly increasing corpus size and heterogeneity, with every advance in technology quickly followed by a desire to organize and search more unstructured, more heterogeneous, and even bigger corpora. But as retrieval problems get larger and more complicated, evaluating the ranking performance of a retrieval engine gets harder: evaluation requires human judgments of the relevance of documents to queries, and for very large corpora the cost of acquiring these judgments may be insurmountable. This cost limits the types of problems researchers can study as well as the data they can be studied on.
We present methods for understanding performance differences between retrieval engines in the presence of missing and noisy relevance judgments. The work introduces a model of the cost of experimentation that incorporates the cost of human judgments as well as the cost of drawing incorrect conclusions about differences between engines in both the training and testing phases of engine development. Through adopting a view of evaluation that is more concerned with distributions over performance differences rather than estimates of absolute performance, the expected cost can be minimized so as to reliably differentiate between engines with less than 1% of the human effort that has been used in past experiments.