0% found this document useful (0 votes)
4 views

Chapter 5_Index Compression

The document discusses index compression techniques, focusing on the importance of compressing dictionaries and postings in inverted indexes to save disk space and improve memory usage and data transfer speeds. It presents various compression methods, including lossless and lossy compression, and highlights empirical laws like Heaps' law and Zipf's law that describe vocabulary size and term frequency distributions. The document concludes with a summary of different dictionary compression techniques and their respective sizes, demonstrating significant reductions in space usage.

Uploaded by

golanihimanshu2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Chapter 5_Index Compression

The document discusses index compression techniques, focusing on the importance of compressing dictionaries and postings in inverted indexes to save disk space and improve memory usage and data transfer speeds. It presents various compression methods, including lossless and lossy compression, and highlights empirical laws like Heaps' law and Zipf's law that describe vocabulary size and term frequency distributions. The document concludes with a summary of different dictionary compression techniques and their respective sizes, demonstrating significant reductions in space usage.

Uploaded by

golanihimanshu2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Index Compression

Dr. Subrat Kumar Nayak


Associate Professor
Dept. of CSE, ITER, SOADU
What to be Discussed?

 Collection statistics in more detail (with RCV1)


 How big will the dictionary and postings be?
 Dictionary compression
 Postings compression
Why compression (in general)?
 Use less disk space
 Saves a little money
 Keep more stuff in memory
 Increases speed
 Increase speed of data transfer from disk to memory
 [read compressed data | decompress] is faster than [read
uncompressed data]
 Premise: Decompression algorithms are fast
 True of the decompression algorithms we use
Why compression for inverted indexes?
 Dictionary
 Make it small enough to keep in main memory
 Make it so small that you can keep some postings lists in main memory too

 Postings file(s)
 Reduce disk space needed
 Decrease time needed to read postings lists from disk
 Large search engines keep a significant part of the postings in memory.
 Compression lets you keep more in memory

 We will devise various IR-specific compression


schemes
Recall Reuters RCV1
symbol statistic value
N documents 800,000
L avg. # tokens per doc 200
M terms (= word types) ~400,000
avg. # bytes per token 6
(incl. spaces/punct.)
avg. # bytes per token 4.5
(without spaces/punct.)

avg. # bytes per term7.5


non-positional postings 100,000,000
Index parameters vs. what we index
size of word types (terms) non-positional positional postings
postings
dictionary non-positional index positional index

Size ∆% cumul Size (K) ∆ cumul Size (K) ∆ cumul


(K) % % % % %
Unfiltered 484 109,971 197,879
No numbers 474 -2 -2 100,680 -8 -8 179,158 -9 -9
Case folding 392 -17 -19 96,969 -3 -12 179,158 0 -9
30 stopwords 391 -0 -19 83,390 -14 -24 121,858 -31 -38
150 stopwords 391 -0 -19 67,002 -30 -39 94,517 -47 -52
stemming 322 -17 -33 63,812 -4 -42 94,517 0 -52
Lossless vs. lossy compression
 Lossless compression: All information is preserved.
 What we mostly do in IR.
 Lossy compression: Discard some information
 Several of the preprocessing steps can be viewed as lossy
compression: case folding, stop words, stemming, number
elimination.
 Chap/Lecture 7: Prune postings entries that are unlikely to turn up in
the top k list for any query.
 Almost no loss quality for top k list.
Vocabulary vs. collection size
 How big is the term vocabulary?
 That is, how many distinct words are there?
 Can we assume an upper bound?
 Not really: At least 7020 = 1037 different words of length 20
 In practice, the vocabulary will keep growing with the collection
size
 Especially with Unicode ☺
Vocabulary vs. collection size
 Heaps’ law: M = kTb
 M is the size of the vocabulary, T is the number of tokens in the
collection
 Typical values: 30 ≤ k ≤ 100 and b ≈ 0.5
 In a log-log plot of vocabulary size M vs. T, Heaps’ law predicts a
line with slope about ½
 It is the simplest possible relationship between the two in log-log space
 An empirical finding (“empirical law”)
Heaps’ Law
 For RCV1, the dashed line
 log10M = 0.49 log10T + 1.64 is
the best least squares fit.
 Thus, M = 101.64T0.49 so k =
101.64 ≈ 44 and b = 0.49.

 Good empirical fit for


Reuters RCV1 !

 For first 1,000,020 tokens,


 law predicts 38,323 terms;
 actually, 38,365 terms

Fig 5.1 p81


Zipf’s law
 Heaps’ law gives the vocabulary size in collections.
 We also study the relative frequencies of terms.
 In natural language, there are a few very frequent terms and very many
very rare terms.
 Zipf’s law: The ith most frequent term has frequency proportional to 1/i .
 cfi ∝ 1/i = K/i where K is a normalizing constant
 cfi is collection frequency: the number of occurrences of the term ti in the
collection.
Zipf consequences
 If the most frequent term (the) occurs cf1 times
 then the second most frequent term (of) occurs cf1/2 times
 the third most frequent term (and) occurs cf1/3 times …
 Equivalent: cfi = ci^k, so
 log cfi = log c + klog i = log c – log i
 Linear relationship between log cfi and log i

 Another power law relationship


Zipf’s law for Reuters RCV1
Compression
 Now, we will consider compressing the space for the dictionary
and postings
 Basic Boolean index only
 No study of positional indexes, etc.
 We will consider compression schemes
Why compress the dictionary?
 Search begins with the dictionary
 We want to keep it in memory
 Memory footprint competition with other applications
 Embedded/mobile devices may have very little memory
 Even if the dictionary isn’t in memory, we want it to be
small for a fast search startup time
 So, compressing the dictionary is important
Dictionary storage - first cut
 Array of fixed-width entries
 ~400,000 terms; 28 bytes/term = 11.2 MB.

Terms Freq. Postings ptr.


a 656,265
aachen 65
…. ….
zulu 221

Dictionary search 20 bytes 4 bytes each


structure
Fixed-width terms are wasteful
 Most of the bytes in the Term column are wasted – we allot 20 bytes
for 1 letter terms.
 And we still can’t handle supercalifragilisticexpialidocious or
hydrochlorofluorocarbons.
 Written English averages ~4.5 characters/word.
 Exercise: Why is/isn’t this the number to use for estimating the dictionary
size?
 Ave. dictionary word in English: ~8 characters
 How do we use ~8 characters per dictionary term?
 Short words dominate token counts but not type average.
Compressing the term list: Dictionary-as-a-String

Store dictionary as a (long) string of characters:



◼Pointer to next word shows end of current word
◼Hope to save up to 60% of dictionary space.

….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo….

Freq. Postings ptr. Term ptr.


Total string length =
33
400K x 8B = 3.2MB
29
44
Pointers resolve 3.2M
126
positions: log23.2M =
22bits = 3bytes
Space for dictionary as a string
 4 bytes per term for Freq.  Now avg. 11
 bytes/term,
 4 bytes per term for pointer to Postings.  not 20.
 3 bytes per term pointer
 Avg. 8 bytes per term in term string
 400K terms x 19  7.6 MB (against 11.2MB for fixed
width)
Blocking
 Store pointers to every kth term string.
 Example below: k=4.
 Need to store term lengths (1 extra byte)

….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo….

Freq. Postings ptr. Term ptr.


33
29
 Save 9 bytes Lose 4 bytes on
44  on 3 term lengths.
126  pointers.
7
Net
 Example for block size k = 4
 Where we used 3 bytes/pointer without blocking
 3 x 4 = 12 bytes,
now we use 3 + 4 = 7 bytes.

Shaved another ~0.5MB. This reduces the size of the


dictionary from 7.6 MB to 7.1 MB.
We can save more with larger k.

Why not go with larger k?


Exercise
 Estimate the space usage (and savings compared to 7.6 MB) with blocking,
for block sizes of k = 4, 8 and 16.
Dictionary search without blocking

 Assuming each
dictionary term equally
likely in query (not really
so in practice!), average
number of comparisons
= (1+2*2+4*3+4)/8 ~2.6

Exercise: what if the


frequencies of query terms
were non-uniform but
known, how would you
structure the dictionary
search tree?
Dictionary search with blocking

 Binary search down to 4-term block;


 Then linear search through terms in block.
 Blocks of 4 (binary tree), avg. = (1+2∙2+2∙3+2∙4+5)/8 = 3 compares
Exercise
Estimate the impact on search performance (and slowdown compared to
k=1) with blocking, for block sizes of k = 4, 8 and 16.
Front coding
 Front-coding:
 Sorted words commonly have long common prefix – store differences only
 (for last k-1 in a block of k)
8automata8automate9automatic10automation

→8automat*a1e2ic3ion

Encodes automat Extra length


beyond automat.
Begins to resemble general string compression. 26
RCV1 dictionary compression summary

Technique Size in MB

Fixed width 11.2

Dictionary-as-String with pointers to every term 7.6

Also, blocking k = 4 7.1

Also, Blocking + front coding 5.9

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy