目录 [+]

1 简介

sparsehash: An extremely memory-efficient hash_map implementation. 2 bits/entry overhead! The SparseHash library contains several hash-map implementations, including implementations that optimize for space or speed.

实现语言是C++，主要是看看设计思路。我看的2.0.2版本，主页没有wiki用来介绍设计及实现，所以做点摘录，便于查看。源码跟文档在一个压缩包里面，可以从https://code.google.com/p/sparsehash/下载。

2 sparsetable

实现“optimize for space”的是sparsetable（可以当做数组用于hash_set跟hash_map）。相当于是二维数组，只是每一维都是按需分配；再加上上层类可以缩减每一维，内存利用率还是不错的。相对于一维数组来说，扩容、减容、增加元素、删除元素，这些操作的时间复杂度都得到降低。

sparsetable源码里有一段注释：
Terminology: the position of an item in the overall table (from 1 .. t) is called its "location." The logical position in a group (from 1 .. M ) is called its "position." The actual location in the array (from 1 .. # of non-empty buckets in the group) is called its "offset."

2.1 sparsetable文档

For specificity, consider the declaration
sparsetable<Foo> t(100); // a sparse array with 100 elements

A sparsetable is a random container that implements a sparse array, that is, an array that uses very little memory to store unassigned indices (in this case, between 1-2 bits per unassigned index). For instance, if you allocate an array of size 5 and assign a[2] = [big struct], then a[2] will take up a lot of memory but a[0], a[1], a[3], and a[4] will not. Array elements that have a value are called "assigned". Array elements that have no value yet, or have had their value cleared using erase() or clear(), are called "unassigned". For assigned elements, lookups return the assigned value; for unassigned elements, they return the default value, which for t is Foo().

sparsetable is implemented as an array of "groups". Each group is responsible for M array indices. The first group knows about t[0]..t[M-1], the second about t[M]..t[2M-1], and so forth. (M is 48 by default.) At construct time, t creates an array of (99/M + 1) groups. From this point on, all operations -- insert, delete, lookup -- are passed to the appropriate group. In particular, any operation on t[i] is actually performed on (t.group[i / M])[i % M].（这里的i指的是location）

Each group contains of a vector, which holds assigned values, and a bitmap of size M, which indicates which indices are assigned. A lookup works as follows: the group is asked to look up index i, where i < M. The group looks at bitmap[i]. If it's 0, the lookup fails. If it's 1, then the group has to find the appropriate value in the vector.（这里的i指的是position，position=location%M）

find()
Finding the appropriate vector element is the most expensive part of the lookup. The code counts all bitmap entries <= i that are set to 1. (There's at least 1 of them, since bitmap[i] is 1.) Suppose there are 4 such entries. Then the right value to return is the 4th element of the vector: vector[3]. This takes time O(M), which is a constant since M is a constant.
这里通过position计算出popcount，即offset。sparsetable使用了Precompute-8bit这种方法，把预先计算好的结果放在数组里，计算的时候直接使用，这里的数组就是完美的哈希表。
更多popcount的信息可以参考：Puzzle: Fast Bit Counting - Author: Gurmeet Manku - gurmeet.net

insert()
Insert starts with a lookup. If the lookup succeeds, the code merely replaces vector[3] with the new value. If the lookup fails, then the code must insert a new entry into the middle of the vector. Again, to insert at position i, the code must count all the bitmap entries <= i that are set to i. This indicates the position to insert into the vector. All vector entries above that position must be moved to make room for the new entry. This takes time, but still constant time since the vector has size at most M.
(Inserts could be made faster by using a list instead of a vector to hold group values, but this would use much more memory, since each list element requires a full pointer of overhead.)
The only metadata that needs to be updated, after the actual value is inserted, is to set bitmap[i] to 1. No other counts must be maintained.

delete()
Deletes are similar to inserts. They start with a lookup. If it fails, the delete is a noop. Otherwise, the appropriate entry is removed from the vector, all the vector elements above it are moved down one, and bitmap[i] is set to 0.

3 sparse_hash_set<Key, HashFcn, EqualKey, Alloc>

For specificity, consider the declaration
   sparse_hash_set<Foo> t;

sparse_hash_set is a hashtable. For more information on hashtables, see Knuth. Hashtables are basically arrays with complicated logic on top of them. sparse_hash_set uses a sparsetable to implement the underlying array.

In particular, sparse_hash_set stores its data in a sparsetable using quadratic internal probing (see Knuth). Many hashtable implementations use external probing, so each table element is actually a pointer chain, holding many hashtable values. sparse_hash_set, on the other hand, always stores at most one value in each table location. If the hashtable wants to store a second value at a given table location, it can't; it's forced to look somewhere else.

insert()
As a specific example, suppose t is a new sparse_hash_set. It then holds a sparsetable of size 32. The code for t.insert(foo) works as follows:
1) Call hash<Foo>(foo) to convert foo into an integer i. (hash<Foo> is the default hash function; you can specify a different one in the template arguments.)
2a) Look at t.sparsetable[i % 32]. If it's unassigned, assign it to foo. foo is now in the hashtable.
2b) If t.sparsetable[i % 32] is assigned, and its value is foo, then do nothing: foo was already in t and the insert is a noop.
2c) If t.sparsetable[i % 32] is assigned, but to a value other than foo, look at t.sparsetable[(i+1) % 32]. If that also fails, try t.sparsetable[(i+3) % 32], then t.sparsetable[(i+6) % 32]. In general, keep trying the next triangular number.
3) If the table is now "too full" -- say, 25 of the 32 table entries are now assigned -- grow the table by creating a new sparsetable that's twice as big, and rehashing every single element from the old table into the new one. This keeps the table from ever filling up.
4) If the table is now "too empty" -- say, only 3 of the 32 table entries are now assigned -- shrink the table by creating a new sparsetable that's half as big, and rehashing every element as in the growing case. This keeps the table overhead proportional to the number of elements in the table.

Instead of using triangular numbers as offsets, one could just use regular integers: try i, then i+1, then i+2, then i+3. This has bad 'clumping' behavior, as explored in Knuth. Quadratic probing, using the triangular numbers, avoids the clumping while keeping cache coherency in the common case. As long as the table size is a power of 2, the quadratic-probing method described above will explore every table element if necessary, to find a good place to insert.

(As a side note, using a table size that's a power of two has several advantages, including the speed of calculating (i % table_size). On the other hand, power-of-two tables are not very forgiving of a poor hash function. Make sure your hash function is a good one! There are plenty of dos and don'ts on the web (and in Knuth), for writing hash functions.)

The "too full" value, also called the "maximum occupancy", determines a time-space tradeoff: in general, the higher it is, the less space is wasted but the more probes must be performed for each insert. sparse_hash_set uses a high maximum occupancy, since space is more important than speed for this data structure.

The "too empty" value is not necessary for performance but helps with space use. It's rare for hashtable implementations to check this value at insert() time -- after all, how will inserting cause a hashtable to get too small? However, the sparse_hash_set implementation never resizes on erase(); it's nice to have an erase() that does not invalidate iterators. Thus, the first insert() after a long string of erase()s could well trigger a hashtable shrink.

find()
find() works similarly to insert. The only difference is in step (2a): if the value is unassigned, then the lookup fails immediately.

delete()
delete() is tricky in an internal-probing scheme. The obvious implementation of just "unassigning" the relevant table entry doesn't work. Consider the following scenario:
    t.insert(foo1); // foo1 hashes to 4, is put in table[4]
    t.insert(foo2); // foo2 hashes to 4, is put in table[5]
    t.erase(foo1); // table[4] is now 'unassigned'
    t.lookup(foo2); // fails since table[hash(foo2)] is unassigned
To avoid these failure situations, delete(foo1) is actually implemented by replacing foo1 by a special 'delete' value in the hashtable. This 'delete' value causes the table entry to be considered unassigned for the purposes of insertion -- if foo3 hashes to 4 as well, it can go into table[4] no problem -- but assigned for the purposes of lookup.

What is this special 'delete' value? The delete value has to be an element of type Foo, since the table can't hold anything else. It obviously must be an element the client would never want to insert on its own, or else the code couldn't distinguish deleted entries from 'real' entries with the same value. There's no way to determine a good value automatically. The client has to specify it explicitly. This is what the set_deleted_key() method does.
Note that set_deleted_key() is only necessary if the client actually wants to call t.erase(). For insert-only hash-sets, set_deleted_key() is unnecessary.
When copying the hashtable, either to grow it or shrink it, the special 'delete' values are not copied into the new table. The copy-time rehash makes them unnecessary.

4 sparse_hash_map<Key, Data, HashFcn, EqualKey, Alloc>

sparse_hash_map is implemented identically to sparse_hash_set. The only difference is instead of storing just Foo in each table entry, the data structure stores pair<Foo, Value>.

5 dense_hash_set<Key, HashFcn, EqualKey, Alloc>

optimize for speed
The hashtable aspects of dense_hash_set are identical to sparse_hash_set: it uses quadratic internal probing, and resizes hashtables in exactly the same way. The difference is in the underlying array: instead of using a sparsetable, dense_hash_set uses a C array. This means much more space is used, especially if Foo is big. However, it makes all operations faster, since sparsetable has memory management overhead that C arrays do not.

The use of C arrays instead of sparsetables points to one immediate complication dense_hash_set has that sparse_hash_set does not: the need to distinguish assigned from unassigned entries. In a sparsetable, this is accomplished by a bitmap. dense_hash_set, on the other hand, uses a dedicated value to specify unassigned entries. Thus, dense_hash_set has two special values: one to indicate deleted table entries, and one to indicated unassigned table entries. At construct time, all table entries are initialized to 'unassigned'.

dense_hash_set provides the method set_empty_key() to indicate the value that should be used for unassigned entries. Like set_deleted_key(), set_empty_key() requires a value that will not be used by the client for any legitimate purpose. Unlike set_deleted_key(), set_empty_key() is always required, no matter what hashtable operations the client wishes to perform.

6 dense_hash_map<Key, Data, HashFcn, EqualKey, Alloc>

optimize for speed
dense_hash_map is identical to dense_hash_set except for what values are stored in each table entry.

7 doc/performance.html#hashfn

A Note on Hash Functions
For good performance, the sparsehash hash routines depend on a good hash function: one that distributes data evenly. Many hashtable implementations come with sub-optimal hash functions that can degrade performance. For instance, the hash function given in Knuth's _Art of Computer Programming_, and the default string hash function in SGI's STL implementation, both distribute certain data sets unevenly, leading to poor performance.
As an example, in one test of the default SGI STL string hash function against the Hsieh hash function (see below), for a particular set of string keys, the Hsieh function resulted in hashtable lookups that were 20 times as fast as the STLPort hash function. The string keys were chosen to be "hard" to hash well, so these results may not be typical, but they are suggestive.
There has been much research over the years into good hash functions. Here are some hash functions of note.
    Bob Jenkins: http://burtleburtle.net/bob/hash/
    Paul Hsieh: http://www.azillionmonkeys.com/qed/hash.html
    Fowler/Noll/Vo (FNV): http://www.isthe.com/chongo/tech/comp/fnv/
    MurmurHash: http://murmurhash.googlepages.com/

=文章版本=

20130309
20130506

sandynz.tech

Search This Blog

2013-05-01

sparsehash略读及摘录