Hash Tables#

TL;DR

A hash table is a data structure that provides efficient insertion, deletion, and lookup operations on key-value pairs. It works by using a hash function to map each key to a position in an array, called the hash table, where the corresponding value is stored.

The hash function takes a key as input and generates a hash code, which is used to determine the position in the array where the value should be stored. Ideally, the hash function should distribute the keys uniformly across the hash table, to minimize collisions (i.e., when two or more keys map to the same position in the array).

To handle collisions, a hash table typically uses a collision resolution strategy, such as chaining or open addressing. Chaining involves storing all the values that hash to the same position in a linked list, while open addressing involves finding the next available position in the array to store the value.

One of the advantages of hash tables is their speed. In the average case, operations on a hash table have a constant-time complexity of O(1), meaning that the time taken to perform an operation does not depend on the size of the hash table. This makes hash tables ideal for applications where fast lookup and insertion times are important.

Overall, hash tables are an important data structure in computer science, and are widely used in applications such as databases, compilers, and web servers. However, the efficiency of hash tables depends on the quality of the hash function used, and collisions can still occur, which can degrade performance.

Additional Resources

Storing data#

../../_images/18_00.png

Fig. 66 Summary Table#

Hash Tables#

  • implements an associative array or dictionary

  • an abstract data type that maps keys to values

  • uses a hash function to compute an index, also called a hashcode

  • at lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.

https://s3.ap-south-1.amazonaws.com/s3.studytonight.com/tutorials/uploads/pictures/1604593128-76844.png

Fig. 67 Hash Table#

Why not…#

https://miro.medium.com/max/970/1*f2oDQ0cdY54olxCFOIMIdQ.png
  • Search O(log n)

  • Insert/Delete, much more costly

Balanced BST
https://media.geeksforgeeks.org/wp-content/cdn-uploads/BinaryTree3-300x188.png
  • Guarenteed O(log n)

https://www.kindsonthegenius.com/wp-content/uploads/2020/09/Direct-Address-Table-1.jpg
  • Best-case O(1)

  • Practical limitations

    • Extra space

    • A given integer in a programmming language may not store n digits

    • Therefore, not always a viable option

Hash Functions#

  • a function converting a piece of data into a smaller, more practical integer

  • the integer value is used as the index between 0 and m1 for the data in the hash table

  • ideally, maps all keys to a unique slot index in the table

  • perfect hash functions may be difficult, but not impossible to create

https://www.vladimircicovic.com/content/images/20200502181417-hash_function.jpg
Properties of good hash functions
  • Efficiently computable

  • Should uniformly distribute the keys (each table position equally likely for each)

  • Should minimize collisions

  • Should have a low load factor # items in tabletable size

Modular Hashing#

To uniformly create hashes, hash functions may use heuristic techniques of division or multiplication
Legend

h = hash function
x = key
HT = hash table
m = table size
b = buckets
r = items per bucket

Rules
0h(x)<m ,or0h(x)<b1
Entry lookupHT[h(x)]
Syntax
h(x)=x mod m=x % m
Example

Suppose there are six students:

a1,a2,a3,a4,a5,a6 in the Data Structures class and their IDs are:

a1:197354863;a2:933185952;a3:132489973;a4:134152056;a5:216500306;a6:106500306.
Hashing
h:{k1,k2,k3,k4,k5,k6}
    {0,1,2,...12} by h(k1)=k1% 13
h(k1)=197354863  % 13=4h(k2)=933185952  % 13=10h(k3)=132489973  % 13=5h(k4)=134152056  % 13=12h(k5)=216500306  % 13=9h(k6)=106500306  % 13=3
Outcome

Suppose HT[b]a

HT[4]197354863HT[5]132489973HT[9]216500306HT[10]933185952HT[12]134152056HT[3]106500306

Uniform Hashing#

Assumption#

Any key is equally likely (and independent of other keys) to hash to one of m possible indices

Bins and Balls#

Toss n balls uniformly at random into m bins

Bad News [birthday problem]#

In a random group of 23 people, more likely than not that two people share the same birthday Expect two balls in the same bin after πm2          //=23.9 when m=365

Good News#

when n>>m, expect most bins to have nm balls when n=m, expect most loaded bin has ln nln ln n balls

../../_images/18_04.png
../../_images/18_05.png

Collisions#

Two distinct keys that hash to the same index birthday problem

can’t avoid collisions

load balancing

no index gets too many collisions
ok to scan though all colliding keys

https://www.log2base2.com/images/algo/hash-collision.png

Fig. 68 collision#

Separate Chaining#

  • keeps a list of all elements that hash to the same value

Performance

m = Number of slots in hash table
n = Number of keys to be inserted in hash table

Load factor α=n/m
Expected time to search or delete = O(1+α)

Time to insert = O(1)
Time complexity of search, insert, and delete is O(1) if α is O(1)

Example

Example
h:0,81,64,25,36,49,1,4,16,9

h(k1)=0    % 10=0            HT[0]0
h(k2)=81  % 10=1            HT[1]81
h(k3)=64  % 10=4            HT[4]64
h(k4)=25  % 10=5            HT[5]25
h(k5)=36  % 10=6            HT[6]36
h(k6)=49  % 10=9            HT[9]49
h(k7)=1    % 10=1            HT[1]1
h(k8)=4    % 10=4            HT[4]4
h(k9)=16  % 10=6            HT[6]16
h(k10)=9   % 10=9            HT[9]9

../../_images/18_06.png

Fig. 69 A separate chaining hash table#

Advantages / Disadvantages??
Advantages
  • Simple to implement.

  • Hash table never fills up, we can always add more elements to the chain.

  • Less sensitive to the hash function or load factors.

  • It is mostly used when it is unknown how many and how frequently keys may be inserted or deleted.

Disadvantages
  • The cache performance of chaining is not good as keys are stored using a linked list. Open addressing provides better cache performance as everything is stored in the same table.

  • Wastage of Space (Some Parts of the hash table are never used)

  • If the chain becomes long, then search time can become O(n) in the worst case

  • Uses extra space for links

Open Addressing#

Linear Probing
  • keeps a list of all elements that hash to the same value

Rule

hi(x)=(Hash(x)+i) % HashTableSize


If h0(x)=(Hash(x)+0) % HashTableSize
If h1(x)=(Hash(x)+1) % HashTableSize
If h2(x)=(Hash(x)+2) % HashTableSize
… and so on

https://media.geeksforgeeks.org/wp-content/cdn-uploads/gq/2015/08/openAddressing1.png
Example
h:{50,700,76,85,92,73,101}

h0(50)=50    % 7=1h0(700)=700  % 7=0h0(76)=76   % 7 =6
h0(85)=85   % 7 =1h1(85)=(85+1)  % 7=2h0(92)=92  % 7=1h1(92)=(92+1)  % 7=2h2(92)=(92+2)  % 7=3h0(73)=73  % 7=3h1(73)=(73+1)  % 7=4h0(101)=101  % 7=3h1(101)=(101+1)  % 7=4h2(101)=(101+2)  % 7=5

Quadratic Probing#

Rule
hi(x)=(Hash(x)+i2) %  HashTableSize(Hash(x)+ii) %  HashTableSizeIf h0(x)=(Hash(x)+00) %  HashTableSizeIf h1(x)=(Hash(x)+11) %  HashTableSizeIf h2(x)=(Hash(x)+22) %  HashTableSize...and so on if hi is already full...
Example
h:{50,700,76,85,92,73,101}

h0(50)=50    % 7=1h0(700)=700  % 7=0h0(76)=76  % 7=6h0(85)=85  % 7=1h1(85)=85+(11)  % 7=2h0(92)f=92  % 7=1h1(92)=92+(11)  % 7=2h2(92)=92+(22)  % 7=5h0(73)=73  % 7=3h0(101)=101  % 7=3h1(101)=101+(11)  % 7=4

Double Hashing#

Rule

Ha(x)=Hash1(x) %  HashTableSizeHb(x)=Hash2(x) %  HashTableSizeh(k,i)=[ha(k)+ihb(k)] % n

Example

h:{50,700,76,85,92,73,101}
size:7
h0(50)=50    % 7=1h0(700)=700  % 7=0h0(76)=76  % 7=6

Arr[0]

700

Arr[1]

50

Arr[2]

Arr[3]

Arr[4]

Arr[5]

Arr[6]

76

h0(85)=85  % 7=1h1(85)=[ha(85)+ihb(85)] % 7=[1+1(85 % 7)] % 7=[1+11] % 7=2

Arr[0]

700

Arr[1]

50

Arr[2]

85

Arr[3]

Arr[4]

Arr[5]

Arr[6]

76

h0(92)=92  % 7=1h1(92)=[ha(92)+ihb(92)] % 7=[1+1(92 % 7)] % 7=[1+11] % 7=2h2(92)=[ha(92)+ihb(92)] % 7=[1+2(92 % 7)] % 7=[1+21] % 7=3

Arr[0]

700

Arr[1]

50

Arr[2]

85

Arr[3]

92

Arr[4]

Arr[5]

Arr[6]

76

h0(73)=73  % 7=3h1(73)=[ha(73)+ihb(73)] % 7=[3+1(73 % 7)] % 7=[3+13] % 7=6h2(73)=[ha(73)+ihb(73)] % 7=[3+2(73 % 7)] % 7=[3+23] % 7=2h3(73)=[ha(73)+ihb(73)] % 7=[3+3(73 % 7)] % 7=[3+33] % 7=5

Arr[0]

700

Arr[1]

50

Arr[2]

85

Arr[3]

92

Arr[4]

Arr[5]

73

Arr[6]

76

h0(101)=101  % 7=3h1(101)=[ha(101)+ihb(101)] % 7=[3+1(101 % 7)] % 7=[3+13] % 7=6h2(101)=[ha(101)+ihb(101)] % 7=[3+2(101 % 7)] % 7=[3+23] % 7=2h3(101)=[ha(101)+ihb(101)] % 7=[3+3(101 % 7)] % 7=[3+33] % 7=5h4(101)=[ha(101)+ihb(101)] % 7=[3+4(101 % 7)] % 7=[3+43] % 7=1h5(101)=[ha(101)+ihb(101)] % 7=[3+5(101 % 7)] % 7=[3+53] % 7=4

Arr[0]

700

Arr[1]

50

Arr[2]

85

Arr[3]

92

Arr[4]

101

Arr[5]

73

Arr[6]

76

Comparison#

Linear Probing
  • Easy to implement

  • Best cache performance

  • Suffers from clustering

Quadratic Probing
  • Average cache performance

  • Suffers less from clustering

Double Hashing
  • Poor cache performance

  • No clustering

  • Requires more computation time