0% found this document useful (0 votes)

4 views

Chapter 5_Index Compression

The document discusses index compression techniques, focusing on the importance of compressing dictionaries and postings in inverted indexes to save disk space and improve memory usage and data transfer speeds. It presents various compression methods, including lossless and lossy compression, and highlights empirical laws like Heaps' law and Zipf's law that describe vocabulary size and term frequency distributions. The document concludes with a summary of different dictionary compression techniques and their respective sizes, demonstrating significant reductions in space usage.

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Chapter 5_Index Compression

Uploaded by

golanihimanshu2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Index Compression

Dr. Subrat Kumar Nayak

Associate Professor
Dept. of CSE, ITER, SOADU
What to be Discussed?

 Collection statistics in more detail (with RCV1)

 How big will the dictionary and postings be?
 Dictionary compression
 Postings compression
Why compression (in general)?
 Use less disk space
 Saves a little money
 Keep more stuff in memory
 Increases speed
 Increase speed of data transfer from disk to memory
 [read compressed data | decompress] is faster than [read
uncompressed data]
 Premise: Decompression algorithms are fast
 True of the decompression algorithms we use
Why compression for inverted indexes?
 Dictionary
 Make it small enough to keep in main memory
 Make it so small that you can keep some postings lists in main memory too

 Postings file(s)
 Reduce disk space needed
 Decrease time needed to read postings lists from disk
 Large search engines keep a significant part of the postings in memory.
 Compression lets you keep more in memory

 We will devise various IR-specific compression

schemes
Recall Reuters RCV1
symbol statistic value
N documents 800,000
L avg. # tokens per doc 200
M terms (= word types) ~400,000
avg. # bytes per token 6
(incl. spaces/punct.)
avg. # bytes per token 4.5
(without spaces/punct.)

avg. # bytes per term7.5

non-positional postings 100,000,000
Index parameters vs. what we index
size of word types (terms) non-positional positional postings
postings
dictionary non-positional index positional index

Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul

(K) % % % % %
Unfiltered 484 109,971 197,879
No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9
Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9
30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38
150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52
stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52
Lossless vs. lossy compression
 Lossless compression: All information is preserved.
 What we mostly do in IR.
 Lossy compression: Discard some information
 Several of the preprocessing steps can be viewed as lossy
compression: case folding, stop words, stemming, number
elimination.
 Chap/Lecture 7: Prune postings entries that are unlikely to turn up in
the top k list for any query.
 Almost no loss quality for top k list.
Vocabulary vs. collection size
 How big is the term vocabulary?
 That is, how many distinct words are there?
 Can we assume an upper bound?
 Not really: At least 7020 = 1037 different words of length 20
 In practice, the vocabulary will keep growing with the collection
size
 Especially with Unicode ☺
Vocabulary vs. collection size
 Heaps’ law: M = kTb
 M is the size of the vocabulary, T is the number of tokens in the
collection
 Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5
 In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a
line with slope about ½
 It is the simplest possible relationship between the two in log-log space
 An empirical finding (“empirical law”)
Heaps’ Law
 For RCV1, the dashed line
 log10M = 0.49 log10T + 1.64 is
the best least squares fit.
 Thus, M = 101.64T0.49 so k =
101.64 ≈ 44 and b = 0.49.

 Good empirical fit for

Reuters RCV1 !

 For first 1,000,020 tokens,

 law predicts 38,323 terms;
 actually, 38,365 terms

Fig 5.1 p81

Zipf’s law
 Heaps’ law gives the vocabulary size in collections.
 We also study the relative frequencies of terms.
 In natural language, there are a few very frequent terms and very many
very rare terms.
 Zipf’s law: The ith most frequent term has frequency proportional to 1/i .
 cfi ∝ 1/i = K/i where K is a normalizing constant
 cfi is collection frequency: the number of occurrences of the term ti in the
collection.
Zipf consequences
 If the most frequent term (the) occurs cf1 times
 then the second most frequent term (of) occurs cf1/2 times
 the third most frequent term (and) occurs cf1/3 times …
 Equivalent: cfi = ci^k, so
 log cfi = log c + klog i = log c – log i
 Linear relationship between log cfi and log i

 Another power law relationship

Zipf’s law for Reuters RCV1
Compression
 Now, we will consider compressing the space for the dictionary
and postings
 Basic Boolean index only
 No study of positional indexes, etc.
 We will consider compression schemes
Why compress the dictionary?
 Search begins with the dictionary
 We want to keep it in memory
 Memory footprint competition with other applications
 Embedded/mobile devices may have very little memory
 Even if the dictionary isn’t in memory, we want it to be
small for a fast search startup time
 So, compressing the dictionary is important
Dictionary storage - first cut
 Array of fixed-width entries
 ~400,000 terms; 28 bytes/term = 11.2 MB.

Terms Freq. Postings ptr.

a 656,265
aachen 65
…. ….
zulu 221

Dictionary search 20 bytes 4 bytes each

structure
Fixed-width terms are wasteful
 Most of the bytes in the Term column are wasted – we allot 20 bytes
for 1 letter terms.
 And we still can’t handle supercalifragilisticexpialidocious or
hydrochlorofluorocarbons.
 Written English averages ~4.5 characters/word.
 Exercise: Why is/isn’t this the number to use for estimating the dictionary
size?
 Ave. dictionary word in English: ~8 characters
 How do we use ~8 characters per dictionary term?
 Short words dominate token counts but not type average.
Compressing the term list: Dictionary-as-a-String

Store dictionary as a (long) string of characters:

◼
◼Pointer to next word shows end of current word
◼Hope to save up to 60% of dictionary space.

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.

Total string length =
33
400K x 8B = 3.2MB
29
44
Pointers resolve 3.2M
126
positions: log23.2M =
22bits = 3bytes
Space for dictionary as a string
 4 bytes per term for Freq.  Now avg. 11
 bytes/term,
 4 bytes per term for pointer to Postings.  not 20.
 3 bytes per term pointer
 Avg. 8 bytes per term in term string
 400K terms x 19  7.6 MB (against 11.2MB for fixed
width)
Blocking
 Store pointers to every kth term string.
 Example below: k=4.
 Need to store term lengths (1 extra byte)

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.

33
29
 Save 9 bytes Lose 4 bytes on
44  on 3 term lengths.
126  pointers.
7
Net
 Example for block size k = 4
 Where we used 3 bytes/pointer without blocking
 3 x 4 = 12 bytes,
now we use 3 + 4 = 7 bytes.

Shaved another ~0.5MB. This reduces the size of the

dictionary from 7.6 MB to 7.1 MB.
We can save more with larger k.

Why not go with larger k?

Exercise
 Estimate the space usage (and savings compared to 7.6 MB) with blocking,
for block sizes of k = 4, 8 and 16.
Dictionary search without blocking

 Assuming each
dictionary term equally
likely in query (not really
so in practice!), average
number of comparisons
= (1+2*2+4*3+4)/8 ~2.6

Exercise: what if the

frequencies of query terms
were non-uniform but
known, how would you
structure the dictionary
search tree?
Dictionary search with blocking

 Binary search down to 4-term block;

 Then linear search through terms in block.
 Blocks of 4 (binary tree), avg. = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares
Exercise
Estimate the impact on search performance (and slowdown compared to
k=1) with blocking, for block sizes of k = 4, 8 and 16.
Front coding
 Front-coding:
 Sorted words commonly have long common prefix – store differences only
 (for last k-1 in a block of k)
8automata8automate9automatic10automation

→8automat*a1e2ic3ion

Encodes automat Extra length

beyond automat.
Begins to resemble general string compression. 26
RCV1 dictionary compression summary

Technique Size in MB

Fixed width 11.2

Dictionary-as-String with pointers to every term 7.6

Also, blocking k = 4 7.1

Also, Blocking + front coding 5.9

Configuring Content Collector For SAP With IBM FileNet P8
No ratings yet
Configuring Content Collector For SAP With IBM FileNet P8
262 pages
Correspondence in SAP - Configuration & Types
No ratings yet
Correspondence in SAP - Configuration & Types
21 pages
Unit 2
No ratings yet
Unit 2
157 pages
Lecture 5p2 - Index Construction & Compressing
No ratings yet
Lecture 5p2 - Index Construction & Compressing
40 pages
Pression
No ratings yet
Pression
44 pages
Lecture4 Compression
No ratings yet
Lecture4 Compression
61 pages
05comp.flat
No ratings yet
05comp.flat
59 pages
Index Compression: Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
100% (1)
Index Compression: Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
56 pages
lecture5-compression
No ratings yet
lecture5-compression
47 pages
Lecture4 Compression 1per
No ratings yet
Lecture4 Compression 1per
50 pages
Information Retrieval - 3
No ratings yet
Information Retrieval - 3
36 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
48 pages
AI6122 Topic 3.1 - Index
No ratings yet
AI6122 Topic 3.1 - Index
40 pages
Lecture5 Compression
No ratings yet
Lecture5 Compression
48 pages
Lecture5 Index Compression
No ratings yet
Lecture5 Index Compression
48 pages
Compression
No ratings yet
Compression
46 pages
C4 Compression
No ratings yet
C4 Compression
44 pages
IR
No ratings yet
IR
8 pages
Lecture4 Compression V1
No ratings yet
Lecture4 Compression V1
43 pages
Index Compression
100% (1)
Index Compression
38 pages
Algorithms: Compressed Matching in Dictionaries
No ratings yet
Algorithms: Compressed Matching in Dictionaries
14 pages
11 FM-Index
No ratings yet
11 FM-Index
6 pages
Introduction To: Information Retrieval
No ratings yet
Introduction To: Information Retrieval
60 pages
Chapter 3 Indexing Structures
No ratings yet
Chapter 3 Indexing Structures
63 pages
Chapter Four Indexing Structure
100% (2)
Chapter Four Indexing Structure
60 pages
Advanced Indexing Issues
No ratings yet
Advanced Indexing Issues
52 pages
ISR Chap...4
No ratings yet
ISR Chap...4
43 pages
04 Index Construction
No ratings yet
04 Index Construction
48 pages
9 Dictionaries and Tolerant Retrieval
No ratings yet
9 Dictionaries and Tolerant Retrieval
58 pages
Chapter 3 Part 1
No ratings yet
Chapter 3 Part 1
43 pages
05 Index Construction
No ratings yet
05 Index Construction
47 pages
Chapter-2 - Automatic Text Anlysis
No ratings yet
Chapter-2 - Automatic Text Anlysis
67 pages
Week 2 Practice Quiz
No ratings yet
Week 2 Practice Quiz
5 pages
4_Indexing (2)
No ratings yet
4_Indexing (2)
29 pages
String Processing II
No ratings yet
String Processing II
29 pages
Lecture 4 - Index Construction _ Compressing
No ratings yet
Lecture 4 - Index Construction _ Compressing
90 pages
Succinct Suffix Arrays Based On Run-Length Encoding
No ratings yet
Succinct Suffix Arrays Based On Run-Length Encoding
26 pages
Midterm Sol
No ratings yet
Midterm Sol
6 pages
Information Retrival Using Indexing
No ratings yet
Information Retrival Using Indexing
19 pages
lecture2-indexing
No ratings yet
lecture2-indexing
78 pages
Unit 1 Data Compression
No ratings yet
Unit 1 Data Compression
30 pages
Lecture 2 Inverted Index PDF
No ratings yet
Lecture 2 Inverted Index PDF
24 pages
File Organization Lec910
No ratings yet
File Organization Lec910
37 pages
10.1016@j.aei.2008.05.001
No ratings yet
10.1016@j.aei.2008.05.001
8 pages
Lecture2 Indexing
No ratings yet
Lecture2 Indexing
49 pages
A New Approach For Compression On Textual Data
No ratings yet
A New Approach For Compression On Textual Data
4 pages
TF Idf
100% (3)
TF Idf
38 pages
chap5-index-construction
No ratings yet
chap5-index-construction
38 pages
Learning Guide Unit 3 _ Home
No ratings yet
Learning Guide Unit 3 _ Home
10 pages
20 Tolerantretrieval
No ratings yet
20 Tolerantretrieval
39 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
3 Index Construction
No ratings yet
3 Index Construction
43 pages
L05
No ratings yet
L05
33 pages
Lecture 5p1 - Index Construction & Compressing
No ratings yet
Lecture 5p1 - Index Construction & Compressing
42 pages
04const Flat
No ratings yet
04const Flat
54 pages
Dictionaries: Collection of Items. Each Item Is A Pair
No ratings yet
Dictionaries: Collection of Items. Each Item Is A Pair
27 pages
Chap 4
No ratings yet
Chap 4
76 pages
Chapter-4 - Data Structure-File Structure
No ratings yet
Chapter-4 - Data Structure-File Structure
34 pages
Compression: Some Slides Courtesy James Allan@umass
No ratings yet
Compression: Some Slides Courtesy James Allan@umass
47 pages
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Python Data Structures
No ratings yet
Python Data Structures
178 pages
Descriptor Matching With Convolutional Neural Networks: A Comparison To SIFT
No ratings yet
Descriptor Matching With Convolutional Neural Networks: A Comparison To SIFT
10 pages
3 - Assignment 2
No ratings yet
3 - Assignment 2
7 pages
A Guide to Developing Augmented Reality Indoor Navigation Applications
No ratings yet
A Guide to Developing Augmented Reality Indoor Navigation Applications
19 pages
Compiler Construction Notes
No ratings yet
Compiler Construction Notes
61 pages
Project Progress Seminar Phase 2: To Design CNC Machine For PCB Milling
No ratings yet
Project Progress Seminar Phase 2: To Design CNC Machine For PCB Milling
23 pages
Cisco Telepresence Mcu 5300 Series - Bringing More People Together With High Definition Video
No ratings yet
Cisco Telepresence Mcu 5300 Series - Bringing More People Together With High Definition Video
23 pages
MioPocket Help Guide
No ratings yet
MioPocket Help Guide
7 pages
Frequently Asked Questions: International Valuation Standards
No ratings yet
Frequently Asked Questions: International Valuation Standards
4 pages
Points Sipforum
No ratings yet
Points Sipforum
41 pages
Empowerment Technologies 6
No ratings yet
Empowerment Technologies 6
37 pages
Audience Responsesystem: Gapless Communication! Seamless Interaction!
No ratings yet
Audience Responsesystem: Gapless Communication! Seamless Interaction!
6 pages
First Midterm Exam Grade 6
No ratings yet
First Midterm Exam Grade 6
5 pages
Digital Investigation: Thomas G Obel, Harald Baier
No ratings yet
Digital Investigation: Thomas G Obel, Harald Baier
10 pages
Key Difference Break and Continue N Java
No ratings yet
Key Difference Break and Continue N Java
7 pages
L4 Teori Teori Co4
No ratings yet
L4 Teori Teori Co4
17 pages
Panel Plus Galaxy Eng
No ratings yet
Panel Plus Galaxy Eng
4 pages
Bcos 183
No ratings yet
Bcos 183
4 pages
Abap News For Release 740 Inline Declarations
No ratings yet
Abap News For Release 740 Inline Declarations
35 pages
ID Flash Disk Saya
No ratings yet
ID Flash Disk Saya
1 page
Use Excel Data Model To Manage Your Cashflow: Let's Get Started
No ratings yet
Use Excel Data Model To Manage Your Cashflow: Let's Get Started
17 pages
7 - Technology Disruptions Enabling FinTech Part 2 Ed
No ratings yet
7 - Technology Disruptions Enabling FinTech Part 2 Ed
45 pages
C# Syllabus
No ratings yet
C# Syllabus
3 pages
Configuration Management Slides
No ratings yet
Configuration Management Slides
48 pages
Silo - Tips - Arabic Mathematical Symbol Insertion Application System Using Arabic Pack For Math Type Software
No ratings yet
Silo - Tips - Arabic Mathematical Symbol Insertion Application System Using Arabic Pack For Math Type Software
8 pages
Report
No ratings yet
Report
25 pages
Certificate: Karmaveer Bhaurao Patil College of Engineering, Satara
No ratings yet
Certificate: Karmaveer Bhaurao Patil College of Engineering, Satara
12 pages
PlantSimulation Step by Step ENU Tcm1224 143387
100% (1)
PlantSimulation Step by Step ENU Tcm1224 143387
618 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Chapter 5_Index Compression

Uploaded by

Chapter 5_Index Compression

Uploaded by

Index Compression

Dr. Subrat Kumar Nayak

 Collection statistics in more detail (with RCV1)

 We will devise various IR-specific compression

avg. # bytes per term7.5

Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul

 Good empirical fit for

 For first 1,000,020 tokens,

Fig 5.1 p81

 Another power law relationship

Terms Freq. Postings ptr.

Dictionary search 20 bytes 4 bytes each

Store dictionary as a (long) string of characters:

Freq. Postings ptr. Term ptr.

Freq. Postings ptr. Term ptr.

Shaved another ~0.5MB. This reduces the size of the

Why not go with larger k?

Exercise: what if the

 Binary search down to 4-term block;

Encodes automat Extra length

Fixed width 11.2

Dictionary-as-String with pointers to every term 7.6

Also, blocking k = 4 7.1

Also, Blocking + front coding 5.9

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.