2

I want to keep a list of words (let's say 20k). The only purpose of it is to check if the given word exists there. So, it's going be something known as a hash set.

-- dict here is just an example, it's going to be loadaed from a text file
dict = {["foo"] = true, ["bar"] = true, ...}


function dict.contains(self, word) do
    return self[word]
end

Is this option good enough? Or maybe this one would be better:

-- hash(string) is an exisiting external function

dict = {hash("foo") = true, hash("bar") = true, ... up to 20k}

function dict.contains(self, word) do
    return self[hash(word)]
end

I am inclining to vote for the 2nd option but thinking of internal implementation of lua hashes makes me a little bit confused. If it is just an address of an interned string, probably the 1st option is quite good one (in terms of memory consumption and better performance)? But from the other side, it should be good for statically defined tables (with interned strings). While I am going to load keys, maybe the 2nd options is still better?

3 Answers 3

4

Hash sets in Lua

Yes, a table with string keys and truthy values (usually true) is the idiomatic, and most likely most efficient, way to implement a hash set of strings (or numbers, or "pointers" (like functions or tables, which are compared by reference)) in Lua.

Method - data name collisions

This however:

function dict.contains(self, word) do
    return self[word]
end

is a bad idea, because suddenly dict:contains("contains") will be truthy (it will be the contains function).

Similarly, dict.word will suddenly be truthy as well, even though you probably want all accesses to go through contains.

I'd recommend just directly doing something like if words[word] then rather than unnecessarily complicating this by forcing it into an OOP style when it is not needed or useful. If you do want this, encapsulate the internal "set" of words, so store it in a field like dict._words to not have the methods conflict with the actual data you're storing.


Premature optimization concerns

Is this option good enough?

It most probably is - try it and see. "20k" is a very small number of words for Lua to handle. I see no reason why the first solution shouldn't be fully satisfying.

In asymptotical terms, accessing this table - both to insert new strings, or to get strings - has expected constant time complexity.

Using the optimized hash table implementation provided by your Lua implementation is most probably a better idea than rolling your own.

Or maybe this one would be better: [...] I am inclining to vote for the 2nd option

So far you have provided no reason why it should be better, and barring exceptional circumstances, I don't think it will be. In many aspects, it's very blatantly worse:

  • It's more complex. This complexity would need to be justified.
  • It adds overhead. You need your own hashing of strings; you'll be duplicating the work Lua is already doing when it interns strings itself.
  • A hashing algorithm written in Lua is very unlikely to outperform the optimized hashing algorithm present in your Lua implementation, especially if you're using a well-optimized one like LuaJIT.

If it is just an address of an interned string, probably the 1st option is quite good one (in terms of memory consumption and better performance)?

Yes, on Lua 5.2 and earlier, all strings are interned and there is no way to turn this off. What this practically means is that different strings have different addresses, same strings reuse the same memory. String comparison is a simple address comparison and thus constant time. This is what enables table access to always be O(1) instead of O(length of string).

So indeed, on Lua 5.2 and earlier, the first option is most certainly better, and by a lot.

On Lua 5.3 and later, things get a bit more complicated: Very "long" strings (think thousands of characters) are not interned anymore, but instead are hashed on-demand, meaning comparison for these long strings is O(n), as is table access.

Now there are two "exceptional circumstances" where the custom hashing could be justified, but I consider it very unlikely that you find yourself in either one:

  1. You're using an older Lua / LuaJIT implementation which does not protect against hash collision attacks, and need to protect against such attacks, which can make table access O(n) and thus lead to denial of service. In this case, custom hashing can help - as does just "salting" the keys with a salt unknown to the attacker.
  2. With very efficient custom hashing (think implementing a very good hashing algorithm in C, which somehow beats Lua's general-purpose string hashing for your use case), you could potentially beat Lua's built-in hashing. But at that point, you should probably be writing this performance-critical code in C or another language suitable for maximizing performance to begin with.

But from the other side, it should be good for statically defined tables (with interned strings).

If anything, this is another plus point for using Lua's implementation.

While I am going to load keys, maybe the 2nd options is still better?

No.


TL;DR: Don't optimize prematurely; you'd most likely end up with more complex, less efficient code. Simply use a table with string keys.

Sign up to request clarification or add additional context in comments.

Comments

0

(Actually you can skip the hash function and set key value as key hashedValue)

Code:

function createHashChecker(hashFunction)
    local hashTable = {}
    return function(value)
        local hashedValue = hashFunction(value)
        if hashTable[hashedValue] then
            return true -- hahsed value is already in hashTable
        else
            hashTable[hashedValue] = true
            return false -- hahsed value was not in hashTable
        end
    end
end

Example:

local function hashFunction (str)
    return str
end

local checkHash = createHashChecker(hashFunction)

print (checkHash('foo')) -- false
print (checkHash('bar')) -- false
print (checkHash('foo')) -- true

Comments

0

@Luatic

Nice note about method names collision. Although my code was a dummy example, I think it's worth to to be aware of such particularities. Thank you!

And my impediments were mostly of the following:

  • If I actually do not need a value/content of string, should I keep it? To keep a hash (4/8 bytes of hash value in memory looks better than let's say 5-15 chars per a word).

  • If I load those word dynamically, why "intern" them? and even more correctly - how can it be interned?

    local data = load()   
    if data then
        for word in data:gmatch("[^\r\n]+") do
            self._words[word] = true
        end 
    end
    

I totally agree with your point on premature optimisation. Just wanted to get comprehension - does lua keep in memory three entities for each table entry: hash_key, value and an "origin_key" (let's call it a display value)? My approach was to get rid of that redundant visual representation of a "pretty long" key. Now I see, in my 2nd option I still have a "display value" - it is just a display value of a hash of string.

And yes, pure strings as keys for 20k-words tables work perfectly. Will stick with them )

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.