2

I understand that insertion for hash tables is O(1) and sometimes O(n) depending on the load factor. This makes sense to me, however, I'm still confused. When talking about insertion, are we including the hash function in that measurement or is it just placing some value at that index? For ints, I could see how it could be O(1), but what about strings or any other objects?

Edit: This seemed to answer my question, sorry about the confusion. Time complexity of creating hash value of a string in hashtable

7
  • The hash function should be O(1). It shouldn't scale with the input. Hence can be ignored for insertion. Commented Feb 22, 2022 at 18:27
  • That's what I thought @VLAZ but rici has a different opinion but idk which one to believe. Commented Feb 22, 2022 at 18:44
  • A hashing function used for a hash table should be fast because if it's not, it's not really worth using a hash table if storing and retrieving values is going to be linear time. It's the same as a list, in that case. It's quite reasonable to assume O(1) hashing for a well-designed hash table. It's quite probably for a hash to achieve that: Complexity of Hashing. The implementation should probably be tuned to the volume of data you expect but that's also expressed as the load factor anyway. Commented Feb 22, 2022 at 18:50
  • @vlaz: Suppose you have a small hash table with huge keys, say resource URIs. How do you avoid the length of the keys affecting the hash computation? That's not related to the occupancy of the table; it's orthogonal. Commented Feb 22, 2022 at 18:55
  • @rici As I said, hash function may need to be tuned. There is no need for a perfect hash function over strings of unbound length. Such a thing cannot exist. However, a function that is unlikely to produce hashes over a known type of input is not impossible. You still have to compute a hash that is distributed evenly enough through different hash buckets. If you know you have 100 items coming in or 1000 and you know the general shape you can endure, say, 1% collision rate. If you mean to say that it's impossible to create such a hash function I'd be disappointed with the CompSci field. Commented Feb 22, 2022 at 19:13

1 Answer 1

0

Yes, the hash function needs to be included in the cost of lookup in a hash table, just the like comparison function needs to be included in the cost of lookup in a sorted table. If the keys have unbounded size, then the key length must be somehow accounted for.

You could stop computing the hash at a certain fixed key length. (Lua does that, for example). But there is a pathological case where every new key is a suffix of the previously inserted key, which would eventually reduce a bounded-length hash function to a linear search.

Regardless of the hash function, the hash table lookup must eventually compare the found key --if there is one-- with the target key, to ensure that they are the same. So that must take time proportional to the size of the key.

In short, constant average-time hash table lookup requires that keys have a bounded size, which is not necessarily the case. But since alternative lookup algorithms would also be affected by key size, this fact doesn't generally help in comparing different lookup up algorithms.

Sign up to request clarification or add additional context in comments.

10 Comments

@VLAZ has a different opinion so idk which one to believe Edit: I have always believed that we do not include the hash function.
True, you could stop at a certain length, but that doesn't mean it's O(1). How would this work for strings? It would have to be some O(n) where n is the length of the string in this case. You will need to go through each index and compute some unique hash for each character.
@sam_i_am: Cormen (p. 227) says "We assume that the hash value h(k) can be computed in O(1) time." That's not the same as saying "we don't count the computation of h(k)". Similarly, the complexity of binary search is usually cited with the assumption that you can compare keys in O(1), even though that's clearly not the case if the keys have unbounded size. But it simplifies the analysis.
@sam_i_am: You compute the hash from a fixed number of characters. For example, you chould just hash the first 64 characters of long strings. Or you could use a more sophisticated algorithm; IIRC, Lua divides the total length of the string by some constant like 32 to get a skip amount, which is uses to skip through the string so that it compares the 32 characters at positions 0, k, 2k, ..., 31k. (That avoids the suffix problem I noted in my example.) Stochastic fixed-size hashing usually works, but it's open to DoS attacks if you care about those.
@sam_i_am: O(1) means constant time. 64 is a constant. In general constant factors don't count in Big-O notation; O(3n²+8) and O(n²) are the same set of functions.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.