For introductory information retrieval courses at the undergraduate and graduate level in computer science, information science and computer engineering departments. Written by a leader in the field of information retrieval, Search Engines: Information Retrieval in Practice, is designed to give unde
Search engines: information retrieval in practice
β Scribed by Strohman, Trevor; Metzler, Donald; Croft, W. Bruce
- Publisher
- Pearson; Addison-Wesley
- Year
- 2009;2010
- Tongue
- English
- Leaves
- 547
- Category
- Library
No coin nor oath required. For personal study only.
β¦ Synopsis
Search Engines: Information Retrieval in Practiceis ideal for introductory information retrieval courses at the undergraduate and graduate level in computer science, information science and computer engineering departments. It is also a valuable tool for search engine and information retrieval professionals. Written by a leader in the field of information retrieval,Search Engines: Information Retrieval in Practice, is designed to give undergraduate students the understanding and tools they need to evaluate, compare and modify search engines. Coverage of the underlying IR and mathematical models reinforce key concepts. The book's numerous programming exercises make extensive use of Galago, a Java-based open source search engine.
β¦ Table of Contents
Cover......Page 1
Contents......Page 10
1.1 What Is Information Retrieval?......Page 28
1.2 The Big Issues......Page 31
1.3 Search Engines......Page 33
1.4 Search Engineers......Page 36
2.1 What Is an Architecture?......Page 40
2.2 Basic Building Blocks......Page 41
2.3.1 Text Acquisition......Page 44
2.3.2 Text Transformation......Page 46
2.3.3 Index Creation......Page 49
2.3.4 User Interaction......Page 50
2.3.5 Ranking......Page 52
2.3.6 Evaluation......Page 54
2.4 How Does It Really Work?......Page 55
3.1 Deciding What to Search......Page 58
3.2 Crawling the Web......Page 59
3.2.1 Retrieving Web Pages......Page 60
3.2.2 The Web Crawler......Page 62
3.2.3 Freshness......Page 64
3.2.5 Deep Web......Page 68
3.2.6 Sitemaps......Page 70
3.2.7 Distributed Crawling......Page 71
3.3 Crawling Documents and Email......Page 73
3.4 Document Feeds......Page 74
3.5 The Conversion Problem......Page 76
3.5.1 Character Encodings......Page 77
3.6 Storing the Documents......Page 79
3.6.2 Random Access......Page 80
3.6.3 Compression and Large Files......Page 81
3.6.4 Update......Page 83
3.6.5 BigTable......Page 84
3.7 Detecting Duplicates......Page 87
3.8 Removing Noise......Page 90
4.1 From Words to Terms......Page 100
4.2 Text Statistics......Page 102
4.2.1 Vocabulary Growth......Page 107
4.2.2 Estimating Collection and Result Set Sizes......Page 110
4.3.1 Overview......Page 113
4.3.2 Tokenizing......Page 114
4.3.3 Stopping......Page 117
4.3.4 Stemming......Page 118
4.3.5 Phrases and N-grams......Page 124
4.4 Document Structure and Markup......Page 128
4.5 Link Analysis......Page 131
4.5.2 PageRank......Page 132
4.5.3 Link Quality......Page 138
4.6 Information Extraction......Page 140
4.6.1 Hidden Markov Models for Extraction......Page 142
4.7 Internationalization......Page 145
5.1 Overview......Page 152
5.2 Abstract Model of Ranking......Page 153
5.3 Inverted Indexes......Page 156
5.3.1 Documents......Page 158
5.3.2 Counts......Page 160
5.3.3 Positions......Page 161
5.3.4 Fields and Extents......Page 163
5.3.5 Scores......Page 165
5.3.6 Ordering......Page 166
5.4 Compression......Page 167
5.4.1 Entropy and Ambiguity......Page 169
5.4.2 Delta Encoding......Page 171
5.4.3 Bit-Aligned Codes......Page 172
5.4.4 Byte-Aligned Codes......Page 175
5.4.5 Compression in Practice......Page 176
5.4.7 Skipping and Skip Pointers......Page 178
5.5 Auxiliary Structures......Page 181
5.6.1 Simple Construction......Page 183
5.6.2 Merging......Page 184
5.6.3 Parallelism and Distribution......Page 185
5.6.4 Update......Page 191
5.7 Query Processing......Page 192
5.7.1 Document-at-a-time Evaluation......Page 193
5.7.2 Term-at-a-time Evaluation......Page 195
5.7.3 Optimization Techniques......Page 197
5.7.4 Structured Queries......Page 205
5.7.5 Distributed Evaluation......Page 207
5.7.6 Caching......Page 208
6.1 Information Needs and Queries......Page 214
6.2.1 Stopping and Stemming Revisited......Page 217
6.2.2 Spell Checking and Suggestions......Page 220
6.2.3 Query Expansion......Page 226
6.2.4 Relevance Feedback......Page 235
6.2.5 Context and Personalization......Page 238
6.3.1 Result Pages and Snippets......Page 242
6.3.2 Advertising and Search......Page 245
6.3.3 Clustering the Results......Page 248
6.4 Cross-Language Search......Page 253
7.1 Overview of Retrieval Models......Page 260
7.1.1 Boolean Retrieval......Page 262
7.1.2 The Vector Space Model......Page 264
7.2 Probabilistic Models......Page 270
7.2.1 Information Retrieval as Classification......Page 271
7.2.2 The BM25 Ranking Algorithm......Page 277
7.3 Ranking Based on Language Models......Page 279
7.3.1 Query Likelihood Ranking......Page 281
7.3.2 Relevance Models and Pseudo-Relevance Feedback......Page 288
7.4 Complex Queries and Combining Evidence......Page 294
7.4.1 The Inference Network Model......Page 295
7.4.2 The Galago Query Language......Page 300
7.5 Web Search......Page 306
7.6 Machine Learning and Information Retrieval......Page 310
7.6.1 Learning to Rank......Page 311
7.6.2 Topic Models and Vocabulary Mismatch......Page 315
7.7 Application-Based Models......Page 318
8.1 Why Evaluate?......Page 324
8.2 The Evaluation Corpus......Page 326
8.3 Logging......Page 332
8.4.1 Recall and Precision......Page 335
8.4.2 Averaging and Interpolation......Page 340
8.4.3 Focusing on the Top Documents......Page 345
8.4.4 Using Preferences......Page 348
8.5 Efficiency Metrics......Page 349
8.6.1 Significance Tests......Page 352
8.6.2 Setting Parameter Values......Page 357
8.6.3 Online Testing......Page 359
8.7 The Bottom Line......Page 360
9 Classification and Clustering......Page 366
9.1 Classification and Categorization......Page 367
9.1.1 NaΓ―ve Bayes......Page 369
9.1.2 Support Vector Machines......Page 378
9.1.4 Classifier and Feature Selection......Page 386
9.1.5 Spam, Sentiment, and Online Advertising......Page 391
9.2 Clustering......Page 400
9.2.1 Hierarchical and K-Means Clustering......Page 402
9.2.2 K Nearest Neighbor Clustering......Page 411
9.2.3 Evaluation......Page 413
9.2.4 How to Choose K......Page 414
9.2.5 Clustering and Search......Page 416
10.1 What Is Social Search?......Page 424
10.2 User Tags and Manual Indexing......Page 427
10.2.1 Searching Tags......Page 429
10.2.2 Inferring Missing Tags......Page 431
10.2.3 Browsing and Tag Clouds......Page 433
10.3.1 What Is a Community?......Page 435
10.3.2 Finding Communities......Page 436
10.3.3 Community-Based Question Answering......Page 442
10.3.4 Collaborative Searching......Page 447
10.4.1 Document Filtering......Page 450
10.4.2 Collaborative Filtering......Page 459
10.5.1 Distributed Search......Page 465
10.5.2 P2P Networks......Page 469
11.1 Overview......Page 478
11.2 Feature-Based Retrieval Models......Page 479
11.3 Term Dependence Models......Page 481
11.4 Structure Revisited......Page 486
11.4.1 XML Retrieval......Page 488
11.4.2 Entity Search......Page 491
11.5 Longer Questions, Better Answers......Page 493
11.6 Words, Pictures, and Music......Page 497
11.7 One Search Fits All?......Page 506
References......Page 514
C......Page 540
D......Page 541
I......Page 542
N......Page 543
R......Page 544
S......Page 545
W......Page 546
Z......Page 547
β¦ Subjects
Science;Computer Science;Nonfiction
π SIMILAR VOLUMES
Information retrieval is the foundation for modern search engines. This textbook offers an introduction to the core topics underlying modern search technologies, including algorithms, data structures, indexing, retrieval, and evaluation. The emphasis is on implementation and experimentation; each ch
<p><p>We are living in a multilingual world and the diversity in languages which are used to interact with information access systems has generated a wide variety of challenges to be addressed by computer and information scientists. The growing amount of non-English information accessible globally a
<p><p>We are living in a multilingual world and the diversity in languages which are used to interact with information access systems has generated a wide variety of challenges to be addressed by computer and information scientists. The growing amount of non-English information accessible globally a