Performance comparison of sort vs custom hash function for strings in Python

https://stackoverflow.com/questions/13522262

01-12-2021
|

質問

I'm comparing the performance of sort vs a custom hash function for strings of varying length and the results are a bit surprising. I expected the function prime_hash (and especially prime_hash2) in the following code to outperform sort_hash although the opposite is true. Can anyone explain the perf difference? Can anyone offer an alternate hash? [The function should produce identical values for strings containing the same distribution of letters and different values for all other strings].

Here are the results I'm getting:

For strings of max length: 10
sort_hash: Time in seconds: 3.62555098534
prime_hash: Time in seconds: 5.5846118927
prime_hash2: Time in seconds: 4.14513611794
For strings of max length: 20
sort_hash: Time in seconds: 4.51260590553
prime_hash: Time in seconds: 8.87842392921
prime_hash2: Time in seconds: 5.74179887772
For strings of max length: 30
sort_hash: Time in seconds: 5.41446709633
prime_hash: Time in seconds: 11.4799649715
prime_hash2: Time in seconds: 7.58586287498
For strings of max length: 40
sort_hash: Time in seconds: 6.42467713356
prime_hash: Time in seconds: 14.397785902
prime_hash2: Time in seconds: 9.58741497993
For strings of max length: 50
sort_hash: Time in seconds: 7.25647807121
prime_hash: Time in seconds: 17.0482890606
prime_hash2: Time in seconds: 11.3653149605

And here is the code:

#!/usr/bin/env python

from time import time
import random
import string
from itertools import groupby

def prime(i, primes):
   for prime in primes:
      if not (i == prime or i % prime):
         return False
   primes.add(i)
   return i

def historic(n):
   primes = set([2])
   i, p = 2, 0
   while True:
      if prime(i, primes):
         p += 1
         if p == n:
            return primes
      i += 1

primes = list(historic(26))

def generate_strings(num, max_len):
   gen_string = lambda i: ''.join(random.choice(string.lowercase) for x in xrange(i))
   return [gen_string(random.randrange(3, max_len)) for i in xrange(num)]

def sort_hash(s):
   return ''.join(sorted(s))

def prime_hash(s):
   return reduce(lambda x, y: x * y, [primes[ord(c) - ord('a')] for c in s])

def prime_hash2(s):
   res = 1
   for c in s:
      res = res * primes[ord(c) - ord('a')]
   return res

def dotime(func, inputs):
   start = time()
   groupby(sorted([func(s) for s in inputs]))
   print '%s: Time in seconds: %s' % (func.__name__, str(time() - start))

def dotimes(inputs):
   print 'For strings of max length: %s' % max([len(s) for s in inputs])
   dotime(sort_hash, inputs)
   dotime(prime_hash, inputs)
   dotime(prime_hash2, inputs)

if __name__ == '__main__':
   dotimes(generate_strings(1000000, 11))
   dotimes(generate_strings(1000000, 21))
   dotimes(generate_strings(1000000, 31))
   dotimes(generate_strings(1000000, 41))
   dotimes(generate_strings(1000000, 51))

解決 2

Based on the input from BoppreH, I was able to get a version of the arithmetic-based hash which outperforms even the C-implemented 'sorted'-based version:

primes = list(historic(26))
primes = {c : primes[ord(c) - ord('a')] for c in string.lowercase}

def prime_hash2(s):
   res = 1
   for c in s:
      res = res * primes[c]
   return res

他のヒント

I think you're asking why sort_hash (which is O(n*log n) ) is faster than the other functions that are O(n).

The reason is that your n is too small for log(n) to be significant.

Python's sort() is coded in C. If you coded your arithmetic functions in C, you would see the n*log(n) lose out for much smaller values of n

Aside: When you have lots of repeated items, timsort will be better than n*log(n). Since there are only 256 characters, you will likely find that timsort approaches O(n) long before the strings are long enough to see the arithmetic versions get an advantage

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow