• 您的位置我爱Aspx >> 数据库 >> Google的技术剖析:创始人Sergey Brin 和 Lawrence Page的研究论文
  • Google的技术剖析:创始人Sergey Brin 和 Lawrence Page的研究论文

  • 作者:aspxer  来源:internet  日期:2007-5-22 0:31:15  关键字:google
  • 3.1 Information RetrievalWork in information retrieval systems goes back many years and is well developed [Witten 94]. However, most of the research on information retrieval systems is on small well controlled homogeneous collections such as collections of scientific papers or news stories on a related topic. Indeed, the primary benchmark for information retrieval, the Text Retrieval Conference [TREC 96], uses a fairly small, well controlled collection for their benchmarks. The "Very Large Corpus" benchmark is only 20GB compared to the 147GB from our crawl of 24 million web pages. Things that work well on TREC often do not produce good results on the web. For example, the standard vector space model tries to return the document that most closely approximates the query, given that both query and document are vectors defined by their word occurrence. On the web, this strategy often returns very short documents that are the query plus a few words. For example, we have seen a major search engine return a page containing only "Bill Clinton Sucks" and picture from a "Bill Clinton" query. Some argue that on the web, users should specify more accurately what they want and add more words to their query. We disagree vehemently with this position. If a user issues a query like "Bill Clinton" they should get reasonable results since there is a enormous amount of high quality information available on this topic. Given examples like these, we believe that the standard information retrieval work needs to be extended to deal effectively with the web.

    我对这篇文章有话说?
  • 广告位招租,广告代号:content_468_15
  • 上一篇:小写转大写金额
    下一篇:实用的存储过程之二