Hash Table lookup - with perfect hash, in C

https://stackoverflow.com/questions/7328622

27-10-2019
|

Question

I have a C-language app where I need to do table lookups.

The entries are strings, All are known at the start of runtime. The table is initialized once, and then looked up many times. The table can change, but it's basically as if the app starts over. I think this means I can use a perfect-hash? It's ok to consume some time for the hashtable initialization, as it happens just once.

There will be between 3 and 100,000 entries, each one unique, and I estimate that 80% of cases will have fewer than 100 entries. A simple naive lookup is "fast enough" in those cases. (== no one is complaining)

However in the cases where there are 10k+ entries, the lookup speed of a naive approach is unacceptable. What's a good approach for delivering good hashtable-based lookup performance for strings in C? Assume I do not have a 3rd-party commercial library like Boost/etc. What hash algorithm should I use? how do I decide?

Solution

Generating a perfect hash is not a simple problem. There's libraries devoted to the task. In this case the most popular one is probably CMPH. I haven't used it though so can't help beyond that. gperf is another tool, but it requires the strings to be known at compile time (you could work around it by compiling a .so and loading, but kind of overkill).

But frankly, I'd at least try to go with a binary search first. Simply sort the array using qsort, then search with bsearch (or roll your own). Both those are part of stdlib.h since C89.

OTHER TIPS

If a naive (I assume you mean linear) approach is ok for 100 entries (so 50 comparisons are done on average) then a binary search will be more than sufficient for 100,000 entries (it takes at most 17 comparisons).

So I wouldn't bother with hashes at all but just resort to sorting your string table on startup (e.g. using qsort) and later using a binary search (e.g. using bsearch) to look up entries.

If the (maximal) table size is known, a plain hashtable with chaining is very easy to implement. Size overhead is only two ints per item. With a reasonable hash function only 1.5 probes per lookup are needed on average, this for a 100% loaded table.

Constructing a perfect hash is only feasible if your data does not change. Once it changes, you'll have to recompute and rehash, which is way more expensive than doing a few extra compares.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow