Entropy Estimator based on the Lempel-Ziv algorithm using Python

Question

This function allows to estimate the entropy of a time series. It is based on the Lempel-Ziv compression algorithm. For a time series of length n, the entropy is estimate as:

E= (1/n SUM_i L_i )^-1 ln(n)

where L_i is the longness of the shortest substring starting at position i which doesn't previously appear from position 1 to i-1. The estimated entropy converges to the real entropy of the time series when n approaches to infinity.

There is already an implementation in MATLAB functions: https://cn.mathworks.com/matlabcentral/fileexchange/51042-entropy-estimator-based-on-the-lempel-ziv-algorithm?s_tid=prof_contriblnk

I would like to implement is in Python and I did it like this:

def contains(small, big):
    for i in range(len(big)-len(small)+1):
        if big[i:i+len(small)] == small:
            return True
    return False

def actual_entropy(l):
    n = len(l)
    sequence = [l[0]]
    sum_gamma = 0

    for i in range(1, n):
        for j in range(i+1, n+1):
            s = l[i:j]
            if contains(s, sequence) != True:
                sum_gamma += len(s)
                sequence.append(l[i])
                break

    ae = 1 / (sum_gamma / n ) * math.log(n)            
    return ae

However, I found it calculate too slow when the data size is getting bigger. For example, I used a list of 23832 elements as an input and time consumed is like this: (data can be found here)

0-1000: 1.7068431377410889 s
1000-2000: 18.561192989349365 s
2000-3000: 84.82257103919983 s
3000-4000: 243.5819959640503 s
...

I have thousands of lists like this to be calculated and such long time is unbearable. How should I optimize this function and make it work faster?

Instead of working with lists, you can try to work with numpy - this may speed up the process — Kev1n91
– Kev1n91, Commented Sep 19, 2017 at 9:43
Besides being careful with data-structures (many appends on array; ouch; numpy won't help here), use a profiler to check what's slow. The two candidates here are contains and sequence.append. The former can be probably optimized by reusing the old position as start, but you have to check your algorithm. — sascha
– sascha, Commented Sep 19, 2017 at 11:10

cstich · Accepted Answer · 2017-12-20 18:30:04Z

I played around a bit and tried a few different approaches from another thread on StackOverflow. And this is the code I came up with:

def contains(small, big):
    try:
        big.tostring().index(small.tostring())//big.itemsize
        return True
    except ValueError:
        return False

def actual_entropy(l):
    n = len(l)
    sum_gamma = 0

    for i in range(1, n):
        sequence = l[:i]

        for j in range(i+1, n+1):
            s = l[i:j]
            if contains(s, sequence) != True:
                sum_gamma += len(s)
                break

    ae = 1 / (sum_gamma / n) * math.log(n)
    return ae

Funnily enough casting the numpy arrays to strings is faster than working with strings directly. A very crude benchmark of my code on my machine with the data you provide is:

   N:  my code - your code
1000:   0.039s -    1.039s
2000:   0.266s -   18.490s
3000:   0.979s -   74.761s
4000:   2.891s -  285.488s

You maybe can make this even faster if you parallelize the outer loop.

Collectives™ on Stack Overflow

Entropy Estimator based on the Lempel-Ziv algorithm using Python

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related