Taking Hash Tables Off The Shelf

本文是阅读《Taking Hash Tables Off The Shelf》和《Hashing Out Hash Functions》的读书笔记,文章来自Vaidehi Joshi在medium上发布baseCS系列,该栏目用可爱的漫画来生动地解释计算机基础、数据结构、算法、理论在实践中的应用。


Part 1. Intro

  • It’s hard to understand breadth first search if you don’t already know Big O notation. And Big O notation doesn’t mean anything to you unless you’ve seen it in action, in the context of an algorithm.
  • Metaphors比喻 can be a great way of learning through analogy类比, and they’re useful in illustrating otherwise complicated concepts.

Books on books on books

We can think of our book collection as a dataset

Imagine a library as a data structure

  • array or linked list:O(n) space+time
  • using binary search : O(logn) space+time
  • not the most efficient to find

Mappings to make our lives easier

What exactly is a hash table

  • Hash tables : an array + a hash function
  • what makes them so efficient for our purposes is that they create a mapping, which is a relationship between two sets of data.
  • Hash tables are made up of sets of pairs: a key and a value, often referred to as (k, v). The idea here is that if you have the key in one set of data, you can find the value that corresponds to it in another set of data.
  • In the case of a hash table, it is the hash function that creates the mapping. Since we’re working with arrays, the key is the index of the array, and the value is the data that lives at that index.
  • The hash function’s job is to take an item in as input, and calculate and return the key for where that item should live in the hash table.

Example 1

Hash buckets

Hash table has a size of 12, every single book needs to be stored somewhere within these twelve slots.

A hash table with a size of12

Code in JS

1
2
3
4
5
6
7
8
9
10
11
function bookHashing(bookTitle, hashTableSize) {
// Remove any spaces from book title.
var strippedBookTitle = bookTitle.replace(/\s/g, '')
// Divide the length of the title by the hash table size.
// Return the remainder.
return strippedBookTitle.length % hashTableSize;
}
bookHashing("The Grapes of Wrath", 12);
// 4
bookHashing("The Sound and the Fury", 12);
// 6

Collision

A collision occurs when two elements are supposed to be inserted at the same place in an array.

A collision occurs when two elements are supposed to be inserted at the same place in an array.

  • For starters, we can be sure that no hash function will always return a unique hash bucket value for every item that give it.
  • In fact, a hash function will always input multiple elements to the same output (hash bucket), because the size of our dataset will usually be larger than the size of our hash table.

Super cool constant time

Practical context

Hash functions that lean on alphabetization always result in 26 hash buckets.

Hash functions that lean on alphabetization always result in 26 hash buckets.

  • An equal distribution of data amongst all the possible hash buckets.

Efficiency

Hash tables leverage constant time search, which is what makes them so powerful.

Hash tables leverage constant time search, which is what makes them so powerful.

  • It also only takes constant time to insert or delete an item into a hash table, because we’re always going to lean on our hash function to determine where to add or remove something!
  • Libraries around the world use the Dewey Decimal Classification System, which is really just a hash table with 10 buckets (and of course, a complex hash function that still I don’t quite understand yet).

Maggie Appleton


Part 2. Intro

  • everything has its drawbacks.
  • Hash tables, are really great options for storing and retrieving specific data, quickly. They’re not always the best tool for the job — for example, they’re not great for finding ordered data .
  • hash tables do intrinsically have some issues of their own, and the usefulness of a hash table is tied directly to its hash function.

Hash table, are you any good?

a good hash table

  • hinges on how good of a hash function the hash table has.

What makes a good hash table

Collisions occur whenever

Collisions occur whenever a hash function generates the same key for two elements


Collision resolution tactics

  • There are a handful of ways to handle collisions in a hash function, and the important thing to remember that none of them is necessarily the “right tactic” to use.
  • It all depends on your dataset, the size of your hash table, and what operations you know you’ll want to perform on the table later on.

Linear Probing

  • looking for the next empty hash bucket nearby.
  • This is a kind of rehashing, and this technique is known as linear probing.

The interesting thing about linear probing is that if the next hash bucket is also filled by an element, the hash function will just continue to probe through the hash table until it finds an empty bucket, cycling back if necessary.

Linear probing

Clustering

  • simply moving over to the next available hash bucket and inserting an element at the “next free space” leads to something called clustering.
    • either our hash function isn’t using up the entire hash table range
    • it’s not spreading data evenly across the hash buckets of our hash table.
  • well-spaced : data is spread across the table, across many different hash buckets, and spans the entire range of the table

Clustered vs. well-balanced tables

Clustered vs. well-balanced tables

  • this again depends on our dataset.
  • If we have a lot of elements that are ending up at one hash bucket, and we use linear probing to solve the problem of collision, we’ll end up with a clustered table!
  • if designed a good hashing function, and table is big enough to accommodate our dataset, then linear probing can be a great solution for handling collisions.

Chaining

  • store multiple things in one hash bucket.

  • a handy dandy linked list!!

    Instead of storing a single item at each hash bucket, we can chain multiple elements together so that each key of the table has a pointer that references a linked list.

Chaining

  • O(n) time:searching through a linked list takes as much time as there are elements.

Chaining still averages out to a constant search time

Chaining still averages out to a constant search time

  • we should have approximately the same size of linked lists at any hash bucket where there is a collision.
  • The great thing about this is that, with a good hash function, chaining still averages out to have a search time of O(1), or constant lookup time.

The power is all in the function

  • Unix and Linux operating systems use an internal hash table in order to store the contents of all the directories that are referenced by the environmental PATHvariable.
  • If you’ve ever used a version of the rehash command, what you were really doing when you ran that command was recomputing and rehashing the system’s internal hash table!

Spell checker

Using a hash table to implement a spell checker


C++中无序容器的支持

关联式容器用于搜索大量数据。用hash table的时间复杂度为O(1)。

unordered_multiset

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
#include <unordered_set>
void test_unordered_multiset(long& value)
{
cout << "\ntest_unordered_multiset().......... \n";

unordered_multiset<string> c;//note.1
char buf[10];

clock_t timeStart = clock();
for(long i=0; i< value; ++i)
{
try {
snprintf(buf, 10, "%d", rand());
c.insert(string(buf));//note.2
}
catch(exception& p) {
cout << "i=" << i << " " << p.what() << endl;
abort();
}
}
cout << "milli-seconds : " << (clock()-timeStart) << endl;
cout << "unordered_multiset.size()= " << c.size() << endl;
cout << "unordered_multiset.max_size()= " << c.max_size() << endl; //357913941
cout << "unordered_multiset.bucket_count()= " << c.bucket_count() << endl; //note.3
cout << "unordered_multiset.load_factor()= " << c.load_factor() << endl;
cout << "unordered_multiset.max_load_factor()= " << c.max_load_factor() << endl;//note.4
cout << "unordered_multiset.max_bucket_count()= " << c.max_bucket_count() << endl;
for (unsigned i=0; i< 20; ++i) {
cout << "bucket #" << i << " has " << c.bucket_size(i) << " elements.\n";
}

string target = get_a_target_string();
{
timeStart = clock();
auto pItem = find(c.begin(), c.end(), target); //比 c.find(...) 慢很多
cout << "std::find(), milli-seconds : " << (clock()-timeStart) << endl;
if (pItem != c.end())
cout << "found, " << *pItem << endl;
else
cout << "not found! " << endl;
}

{
timeStart = clock();
auto pItem = c.find(target); //比 std::find(...) 快很多 .note.5
cout << "c.find(), milli-seconds : " << (clock()-timeStart) << endl;
if (pItem != c.end())
cout << "found, " << *pItem << endl;//note.6
else
cout << "not found! " << endl;
}

c.clear();
}

unordered_multiset

  1. 定义set时只用声明value的类型即可,因为key等于value:unordered_multiset c;
  2. set直接插入value即可:c.insert(string(buf));
  3. 容器总bucket数:c.bucket_count()。篮子数一定比元素总数多,当篮子数等于元素总数时,篮子总数就会double,然后打散元素重新放入到篮子中去。每个bucket中不一定有元素。选取的前二十个bucket中没有一个有元素很正常。
  4. 载重因子:c.load_factor(),最大为1。每个篮子的链表不能太长,否则影响查找效率。
  5. 容器自带的find永远比std的find快,上篇已经讲过了。
  6. set打印时,直接解引用:*pItem。

unordered_multimap

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
#include <unordered_map>
void test_unordered_multimap(long& value)
{
cout << "\ntest_unordered_multimap().......... \n";

unordered_multimap<long, string> c;//note.1
char buf[10];

clock_t timeStart = clock();
for (long i = 0; i< value; ++i)
{
try {
snprintf(buf, 10, "%d", rand());
//multimap 不可使用 [] 進行 insertion
c.insert(pair<long, string>(i, buf));//note.2
}
catch (exception& p) {
cout << "i=" << i << " " << p.what() << endl;
abort();
}
}
cout << "milli-seconds : " << (clock() - timeStart) << endl;
cout << "unordered_multimap.size()= " << c.size() << endl;
cout << "unordered_multimap.max_size()= " << c.max_size() << endl; //357913941

long target = get_a_target_long();
timeStart = clock();
auto pItem = c.find(target);//note.3
cout << "c.find(), milli-seconds : " << (clock() - timeStart) << endl;
if (pItem != c.end())
cout << "found, value=" << (*pItem).second << endl;//note.4
else
cout << "not found! " << endl;
}

unordered_multiset

  1. 定义map的时候,注意模板类型:unordered_multimap<long, string> c; 。
  2. 要注意的依旧是multimap 不可使用 [] 進行 insertion,需要用pair分别指定key和value的类型。因为key是可以重复的: c.insert(pair<long, string>(i, buf));
  3. target是key,根据key找到item:auto pItem = c.find(target);
  4. 取得value的方法,pair的第二个:(*pItem).second 。

unordered_set

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
#include <unordered_set>
void test_unordered_set(long& value)
{
cout << "\ntest_unordered_set().......... \n";

unordered_set<string> c;
char buf[10];

clock_t timeStart = clock();
for(long i=0; i< value; ++i)
{
try {
snprintf(buf, 10, "%d", rand());
c.insert(string(buf));
}
catch(exception& p) {
cout << "i=" << i << " " << p.what() << endl;
abort();
}
}
cout << "milli-seconds : " << (clock()-timeStart) << endl;
cout << "unordered_set.size()= " << c.size() << endl;
cout << "unordered_set.max_size()= " << c.max_size() << endl; //357913941
cout << "unordered_set.bucket_count()= " << c.bucket_count() << endl;
cout << "unordered_set.load_factor()= " << c.load_factor() << endl;
cout << "unordered_set.max_load_factor()= " << c.max_load_factor() << endl;
cout << "unordered_set.max_bucket_count()= " << c.max_bucket_count() << endl;
for (unsigned i=0; i< 20; ++i) {
cout << "bucket #" << i << " has " << c.bucket_size(i) << " elements.\n";
}

string target = get_a_target_string();
{
timeStart = clock();
auto pItem = find(c.begin(), c.end(), target); //比 c.find(...) 慢很多
cout << "std::find(), milli-seconds : " << (clock()-timeStart) << endl;
if (pItem != c.end())
cout << "found, " << *pItem << endl;
else
cout << "not found! " << endl;
}

{
timeStart = clock();
auto pItem = c.find(target); //比 std::find(...) 快很多
cout << "c.find(), milli-seconds : " << (clock()-timeStart) << endl;
if (pItem != c.end())
cout << "found, " << *pItem << endl;
else
cout << "not found! " << endl;
}
}

unordered_set

unordered_map

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#include <unordered_map>
void test_unordered_map(long& value)
{
cout << "\ntest_unordered_map().......... \n";

unordered_map<long, string> c;
char buf[10];

clock_t timeStart = clock();
for(long i=0; i< value; ++i)
{
try {
snprintf(buf, 10, "%d", rand());
c[i] = string(buf);
}
catch(exception& p) {
cout << "i=" << i << " " << p.what() << endl;
abort();
}
}
cout << "milli-seconds : " << (clock()-timeStart) << endl;
cout << "unordered_map.size()= " << c.size() << endl; //357913941
cout << "unordered_map.max_size()= " << c.max_size() << endl;

long target = get_a_target_long();
timeStart = clock();
//! auto pItem = find(c.begin(), c.end(), target); //map 不適用 std::find()
auto pItem = c.find(target);//note.1

cout << "c.find(), milli-seconds : " << (clock()-timeStart) << endl;
if (pItem != c.end())
cout << "found, value=" << (*pItem).second << endl;
else
cout << "not found! " << endl;
}

unordered_map

  1. 注意map不适用std的find():auto pItem = c.find(target);

Resources

  1. Basics of Hash Tables, Prateek Garg
  2. Data Structure and Algorithms — Hash Table, Tutorialspoint
  3. Hash Tables, Professor John Morris
  4. CS210 Lab: Hash Table,University of Regina, Dept. of Computer Science
  5. Hash Tables, SparkNotes
  6. Introduction to Hash Tables, Professor Steve Cutchin
  7. Hash Function, Wolfram Mathworld
  8. Dictionaries and Hash Tables, Professor Sinisa Todorovic
  9. Linear Probing Hash Tables, Department of Computer Science, RMIT University
  10. Problem Solving with Algorithms and Data Structures: Hashing, Interactive Python
  11. Hash Tables, Professors Robert Sedgewick & Kevin Wayne
  12. Introduction to Hashing, Professor Ananda Gunawardena
Thanks for Support.