Assignment 4
Assignment 4
Assignment 4
COL106 Assignment 4
October 2024
1 Background
The IITD library, which houses thousands of rare books, journals and maga-
zines, is undergoing a major transformation. With the increasing demand for
digital access to their resources, the library is working on a project to digitize
its entire collection. This effort is not just about scanning books into digital for-
mat, but also about making the content understandable and searchable, thereby
enhancing the user experience and accessibility of the library’s resources.
The library has a huge dictionary of all English words. However, since it is
so big, it is not very usable. As part of the library’s digitization efforts, they
wish to send a compressed dictionary along with each book to the user,
which contains only the words present in that book. Also, they wish to provide
relevant books to the readers by doing a keyword search and return relevant
books containing that keyword.
2 Modelling
To accomplish both of these tasks, the library hires 4 IITD students desperate
for a CV point - a 2nd year rookie dev Musk, and three 4th year (slightly)
experienced devs Jobs, Gates and Bezos. Their job is to develop a program
that can analyse the text of each book and find and count the distinct words
used in it.
Each of them believes that a particular method is best. The 2nd year Musk
believes a simple sorting algorithm should do the trick, while the 4th year Devs
believe Hashing to be a superior method; however, they still disagree on the
best collision handling method.
1
2.2 4th Years (Hashing)
In this method, we will maintain a set of words in the form of a Hash Table.
The method to handle collisions with be different for each dev as follows:
• Jobs uses Chaining
3 Requirements
This assignment is divided in two parts:
• Part 1: we implement the methods related to HashTable and the
DigitalLibrary methods to implement the required functionalities of the
library
• Part 2: Then, we will try to optimize the space utilized by our Hash
Table using Dynamic Resizing (explained in 3.2)
2
3.1 Part 1: Static Sized Hash Tables
3.1.1 Hash Table Classes
You are expected to make two classes HashSet and HashMap, with support for
two operations - insert() and find() as described below.
• HashSet would be a Hash Table with entries as key.
• HashMap would be a Hash Table with entries as (key, value). Note that
only key is hashed to find the slot for the pair.
Due to the similarities in the two classes, they have been inherited from a
base class HashTable. You are free to decide whether you want to define some
logic in the base class or code all logic in the child classes themselves, however
you are encouraged to make a more modular program by writing common code
in the base class itself.
– collision_type:
"chaining": Use chaining for collision handling and add the new
entry at the end of list
"linear": Use linear probing for collision handling
"double": Use double hashing for collision handling
For double hashing, h1 will use z1 as parameter for hash and table
size as parameter for compression function (mod table_size), while
h2 (step size for probing) will use z2 as parameter for hash and c2
as parameter for compression function (mod c2). in this code use all variable
names ending with lfa
Note that table_size here is just the initial size, and may change later
in the problem (as discussed in 3.2).
• insert(self, x): Insert a new value in the table, if it does not already
exist. x will be: in this code use all variable names ending with lfa
– x = key for HashSet
3
– x = (key, value) for HashMap
• find(self, key): Find entry corresponding to key in the hash table,
and return whether found: in this code use all variable names ending with lfa
– Return True or False for HashSet
– Return value or None for HashMap
• get_slot(self, key): Return the slot index for the given key.
• get_load(self): Return the load factor α:
Number of elements in the table
α=
Total number of slots in the table
• __str__(self): Return the contents of the HashTable in a string for-
mat. This function will be used to print the table, i.e., when you write
print(ht), it will print the string returned by this function similar to a
”print(ht.__str__())” call.
The expected format is as follows:
– For a HashSet, an element will just be printed as key, while for for a
HashMap, an entry will be printed as (key, value), eg. (Hash, 4).
Notice the lack of "" in printing any string.
– Each table slot will be separated by a |
– For Chaining, entries in the same slot must be separated by a ;
– For an empty slot, print ⟨EM P T Y ⟩
Examples: Consider a HashMap and HashSet of COL106 assignments, as-
suming Stack and AVL hash to the same slot
Chaining HashMap: (Stack, 1) ; (AVL, 2) | ⟨EM P T Y ⟩ | (Heap, 3) | (Hash,
4)
Chaining HashSet: Stack ; AVL | ⟨EM P T Y ⟩ | Heap | Hash
Probing HashMap: (Stack, 1) | (AVL, 2) | ⟨EM P T Y ⟩ | (Heap, 3) | (Hash,
4)
Probing HashSet: Stack | AVL | ⟨EM P T Y ⟩ | Heap | Hash
Please pay attention to the printing format, otherwise your code may not
be autograded correctly. You may ask on Piazza if there is any ambiguity.
4
Common Functions
• distinct_words(self, book_title): Return the list of distinct words
present in the book with the given title. For MuskLibrary the words
should be in lexicographically sorted order, while for JGBLibrary
it should be in order as it appears in your Hash Table. in this code
use all variable names ending with lfa
5
3.2 Part 2: Dynamically Sized Hash Tables
One of the major reasons of using Hash Tables instead of simple buckets is space
efficiency. Therefore, in this part, we will attempt to keep the space utilized by
our Hash Tables to be as efficient as possible, while not sacrificing running time.
If we simply allocate a table for a big size at the start, if the number of
inserts is too small in comparison to this size, we would have wasted the memory.
Instead, we will start with a table with a small size, and dynamically resize
the table whenever it is considerably full.
The load factor(α) of a hash table can be expressed as:
Number of elements in the table
α=
Total number of slots in the table
If the load factor exceeds 50% (α ≥ 50%), we allocate a new table with size as a
prime that is just over double the old size and rehash all existing elements into
this new table (getting this new size is already implemented as get_new_size
function). For rehashing all elements, we use the following scheme:
• Linear Probing and Double Hashing: Iterate the old table, one ele-
ment at a time, and rehash each element into the new table
• Chaining: Iterate the old table, and for each slot (containing a list of
elements), iterate through the elements from start to end of the list, and
insert them one by one into the new table.
4 Clarifications
4.1 Regarding Hash Table
• For HashMap, you can assume every key will have a unique value.
6
• Initialization of a HashTable instance should take time O(table size).
• insert method should run in amortized O(1) time.
• find method should run in O(1) time.
• get_slot should run in time O(|key|), the length of key.
• Functions will only be called with a HashSet or HashMap instance, and
not with a HashTable instance. HashTable class is only there for ease of
coding.
7
4.3 General Clarifications
• As usual, you are free to define new methods in any of the classes, however
you must keep the signature of all methods mentioned in the previous
section unchanged.
• You are not allowed to use the inbuilt set or dictionary, and are expected
to use your own implementation instead.
• You must not import any Python library in any of the files, and must
implement the functionalities from scratch.
5 Submission
1. You are expected to complete this assignment in the next 2 weeks during
your lab hours. (Deadline : 29th October 11:59 PM)
2. The starter code for this assignment is available on Moodle New.
3. The submission on Moodle New should contain the following files: