Hashing
Hashing
Hashing
Curs 2016
Data Structures: Remainder
Given a universe U, a dynamic set of records, where each record:
k Key
Satellit Data
Record
I Array
I Linked List (and variations)
I Stack (LIFO): Supports push and pop
I Queue (FIFO): Supports enqueue and dequeue
I Deque: Supports push, pop, enqueue and dequeue
I Heaps: Supports insertions, deletions, find Max and MIN
I Hashing
Dynamic Sets.
DICTIONARY
Data structure for maintaining S ⊂ U together with operations:
I Search (S, k): decide if k ∈ S
I Insert (S, k): S := S ∪ {k}
I Delete (S, k): S := S\{k}
PRIORITY QUEUE
Data structure for maintaining S ⊂ U together with operations:
I Insert (S, k): S := S ∪ {k}
I Maximum (S): Returns element of S with largest k
I Extract-Maximum (S): Returns and erase from S the element
of S with largest k
Priority Queue
Linked Lists:
I INSERT: O(n)
I EXTRACT-MAX: O(1)
Heaps:
I INSERT: O(lg n)
I EXTRACT-MAX: O(lg n)
Finding similar
documents in the WWW
• Proliferation of almost
identical documents
• Approximately 30% of
the pages on the web
are (near) duplicates.
• Another way to find
plagiarism
Hashing functions
Data Structure that supports dictionary operations on an universe
of numerical keys.
h
S Collision
U
Simple uniform hashing function.
h(k) = k mod m .
For each table address, construct a linked list of the items whose
keys hash to that address.
|H|
|{h ∈ H | h(x) = h(y )}| ≤ .
m
Theorem
If we pick a u.a.r. h from a universal H and build a table using and
hash n keys to T with size m, for any given key x let Zx be a
random variable counting the number of collisions with others keys
y in T .
E [#collisions] ≤ n/m.
X
E [Zxy ] = E Zxy
y ∈T −{x}
X
= E [Zxy ]
y ∈T −{x}
X n−1
= 1/m = 2
m
y ∈T −{x}
1. hab : Zp → Zm .
2. |H| = p(p − 1). (We can select a in p − 1 ways and b in p
ways)
3. Specifying an h ∈ H requires O(lg p) = O(lg N) bits.
4. To choose h ∈ H select a, b independently and u.a.r. from Z+
p
and Zp .
5. Evaluating h(x) is fast.
Theorem
The family H is universal.
Create a one bit hash table T [0, . . . , m − 1], and a hash function h.
Initially all m bits are set to 0.
Giving a set S = {x1 , . . . , xn } define a hashing function h : S → T .
For every xi ∈ S, h(xi ) → T [j] and T [j] := 1.
Given a set S a function h() and a table T [m]:
inS(y )
Insert (x) h(x) → i
h(x) → i if T [i] == 1 then
if T [i] == 0 then return Yes
T [i] = 1 else
end if return No
end if
Notice: once we have hashed S into T we can erase S.
False positives
0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 T
S x y z u w
w
0 0 0 10 0 10 10 10 0 10 10 10 10 0 1 10 1 0 01 0 00
Pr [T [i] = 1] = (1 − e −kn/m )k .
p = (1 − e −kn/m )k .
Asymptotic estimations for k and m
dp
To minimize the probability of having a false positive: dk =0
Let f (k) = ln p then f (k) = k ln(1 − e −kn/m )
kne −kn/m
⇒ f 0 (k) = ln(1 − e −kn/m ) + m(1−e −kn/m )
m1 9 m
kopt = ln 2 =
n2 13 n
9 m n 9 m 1 9m m
p0 = (1 − e 13 n m ) 13 n ∼ ( ) 13n = 0.619223 n .
2
Asymptotic estimations for k and m
9 m n ln p
− ln 2 = ln p ⇒ m = − .
13 n 2.083
Therefore, to maintain a fixed false positive probability, the length
of the Bloom table must grow linearly with n.
Optimal number of hash functions
n ln p
m=−
(ln 2)2
Bloom filters are useful when a set of keys is used and space is
important.
I Packet routing: Bloom filters provide a means to speed up or
simplify packet routing protocols.
I IP Tracebook
I Useful tool for measurement infrastructures used to create
data summaries in routers or other network devices.
One complication is that the cuckoo may loop for ever. The
probability of such an event is small. In such a case choose an
upper bound in the number of slot exchanges, and if it exceeds, do
a rehash: choose new functions and start .
Example: We have {y , x, w , z, u}
0
h1 (x) = 2; h1 (y ) = 2; h1 (w ) = 4; h1 (z) = 4
1 x
h2 (x) = 1; h2 (y ) = 1; h2 (w ) = 2; h2 (z) = 2
2 y w
3
Next we hash u: h1 (u) = 4 and h2 (u) = 2 4 z
5 u
If insertion gets into a cycle, we perform a rehash: choose new
h1 , h2 and insert all elements back into the table.
Cuckoo Hashing: An example
We wish to hash the set of keys:(20, 50, 53, 75, 100, 67, 105, 3, 36, 39, 6)
k
using h1 (k) = k mod 11 and h2 (k) = b 11 c mod 11.
h1 h2 0 0
20 9 1 1 100 1 20
50 6 4 2 2
53 9 4 3 3
75 9 6 4 4 53
100 1 9 5 5
67 1 6 6 50 6
105 6 9 7 7
3 3 0 8 8
36 3 3 9 75 9
39 6 3 10 10
6 6 0
T1 T2
Cuckoo Hashing: An example
h1 h2
20 9 1
0 0
50 6 4 1 67 1 20
53 9 4 2 2
75 9 6 3 3
100 1 9 4 4 50
67 1 6 5 5
105 6 9 6 105 6 75
3 3 0 7 7
36 3 3 8 8
39 6 3 9 53 9 100
6 6 0
Cuckoo Hashing: An example
h1 h2
20 9 1 0 0 3
50 6 4 1 67 1 20
53 9 4 2 2
75 9 6 3 36 3
100 1 9 4 4 50
67 1 6 5 5
105 6 9 6 105 6 75
3 3 0 7 7
36 3 3
8 8
9 53 9 100
39 6 3
6 6 0
Cuckoo Hashing: An example
h1 h2
20 9 1
50 6 4
0 0 3
53 9 4
1 100 1 20
75 9 6
2 2
100 1 9 3 36 3 39
67 1 6 4 4 53
105 6 9 5 5
3 3 0
6 39 50 6 67
36 3 3
7 7
39 6 3
8 8
6 6 0
9 75 9 105
With 6 we have to rehash!!!
Complexity
Search (TX,P)
for i = 1 to n − ` do
if PT [1, . . . , `] = TX[i, . . . , i + ` − 1] then
print P occurs at i
end if
end for
P: A A A G
TX: A A A A A G A G T C
TX P
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4
8 6 1 7 9 3 5 7 3 4 2 1 7 9 3 5
S0 S1 S S6
2 S3
P 17935
h S0 86179
S1 61793
T S2 17935
Brute force implementation of the algorithm
Knowing si to get si+1 with we only have to deal with the element
leaving (Si [i]) and the element incorporating (Si+` ):
h(si+1 ) = ((h(si ) −(Si [i]) mod m ∗ 10` ) mod m))∗10+Si+1 [i+`])) mod m
|{z} | {z } | {z }
known Si [i] pre-comp.
8 6 1 7 9 3 5 7
S0
S1
h
TX=861793, m = 73,
Preprocess: h(86179) = 39 and 104 mod 73 = 72.
Karp-Rabin (TX, P, T )
p = 0; s0 = 0; q = 10`−1 mod m
for j = 0 to ` − 1 do
h(p) = (10p + P[j]) mod m
h(s0 ) = (10s0 + TX[j] mod m
end for
for i = 0 to n − ` do
if h(p) == h(si ) then
if P[0 . . . ` − 1] == TX[i . . . i + ` − 1] then
return Match at i
end if
else
h(si+1 ) = (10(si − T [i + 1]q) + T [i + ` + 1]) mod m
end if
end for
Complexity
Encrypt Decrypt
Plaintext M Ciphertext C
Key K Key K
A B
Public-Key Systems:
Diffie-Hellman
S = F is private and secret,
P = F −1 is public. To know P
does not help in discovering S.
M: A → B. E eavesdropper.
Public Key: PA , PB ,
Secret Key: SA , SB ,
Secret and Public keys must have the following property: for any
person A we must have M = SA (PA (M)) = PA (SA (M)).
To send M: A → B,
(1.-) A gets PB ,
(2.-) A computes the ciphertext C = PB (M),
(3.-) A sends C to B.
B
B
A B
A E B
Digital signature
A sends to B (M, σ) such that B knows only A could have send M.
σ is called the signature
(1.-) A computes σ = SA (M) and sends to B C = (PB (σ)),
(2.-) B decrypts M = SB (PA (C )) (only A knows SA so only A
could compute σ)
A B
A E B
A A
B
B
Applications: Cryptographic hash functions
h Hexadecimal
DFCD3454BBEA788A
751A6 96C 2 4D9700 9
Hola CA992D17
Cryptographyc
hash
function
Setze jutges d’un jutjat 46042841 935C7F80
mengen el fetge d’un penjat 9158585AB94AE241
si el penjat es despenges 26EB3CEA
es menjaria els setze fetges
dels setze jutges
que l’han jutjat
Cryptographic hash
Cryptographic hash functions have to string of any length and
output a fixed-length hash value, in general in hexadecimal.
Hexadecimal= Radix 16
(4CF 5)16 = (4 × 163 + 12 ×
162 + 15 × 161 + 5 × 160 ) =
19701
https:www.sha1−online.com
Applications: Message Digest
Direct application of the Collision resistance property:
Alice wants to update a very large document in Dropbox like
repository.
She wants to be sure when she download the document it is
exactly the same document.
An adversary wants to substitute Alice’s document for a forged
one.
Repository
Applications: Message Digest
Alice appends a cryptohash h of the document to the stored
document, and keeps a copy of the hash digest for her (very short).
The adversary has access to h but as soon as he tampers with the
document the digest of the document will be different than the
one append to the original document.
When Alice retrieves the document she just have to compare the
digest of the document with the copy that she kept.
Repository
A65FED252A6B
2A94EC2A0E7
70EEF0D235D6
362C752A0CE7
Crypto h 70EEF0D235D6
Hash 362C752A0CE7
Crypto h A65FED252A6B
Hash 2A94EC2A0E7
Applications: Password verification
Alice Bob
Crypto h
Hash
70EEF0D235D6 70EEF0D235D6
362C752A0CE7 = 362C752A0CE7
DOC.
?
A public
Crypto h 70EEF0D235D6
Hash 362C752A0CE7 Cyber
public Hash number A private text
How to construct secure hc : Merkel’s scheme
Assume the M is the message that we want to compute its crypto
hash function, and M is already in binary.
I The input message M is partitioned into L bit blocks, each of
size exactly m bits.
I For extra security, the ending block includes the total length
of the message whose hash function is to be computed.
I The scheme has L sequential stages one for each block.
I The i-stage has as input the m bits from the i-th. block and
the n bit output of the previous stage. The 1-stage is provided
with an n-bit vector, the Initialization Vector (IV)
M B1 B2 B3 BL
m−bits m−bits m−bits m−bits
VI n−bits
f f f f
n−bits n−bits n−bits n−bits n−bits Hash
Given a message with size < 264 -bits, SHA-1 produces an message
digest (Hash output) exactly of 160-bites.
| {z } 01100011
{z } 01100010
Toy Example: M = abc, so M = |01100001 | {z }
a b c
the whole padding to form a block:
| {z } 01100010
01100001 · · · 0} 00 · · · 0 |11000
| {z } 1 |00 {z
| {z } 01100011 {z }
a c 423
b
| {z 24 }
64
| {z }
512
Initial Vector
a a a a
b b b b
c 0 c 79 c c xj
x(j−1) d d d d
e e e e
Given an input M the SHA-1 yields a 160 bit crypto hash of M by:
1. Padding M as a binary string multiple of 512. Partitions it
into L blocs of size 512 bits.
2. Computing in cascade fashion the Compression Function for
each Bj , it takes as input the 5 words hash buffer, from Bj−1
and also the Bj itself, and returns the new values a||b||c||d||e,
with total length 160 bits, which will be part of the input for
the computation on Bj+1 .
3. The output for the last block is the message digest, i.e. the
crypto hash function for M.
Security of the SHA family
Blockchain: linked list data structure where the links are hash
pointers.
Nice data structure to implement decentralized consensus, where
authority and trust are transferred to a decentralized virtual
network and enables its nodes to sequentially record transactions
on a public block, creating a unique blockchain.
Blockchain
D1 D2 D3 D4 Documents
Properties of Merkle’s trees