1

This function allows to estimate the entropy of a time series. It is based on the Lempel-Ziv compression algorithm. For a time series of length n, the entropy is estimate as:

E= (1/n SUM_i L_i )^-1 ln(n)

where L_i is the longness of the shortest substring starting at position i which doesn't previously appear from position 1 to i-1. The estimated entropy converges to the real entropy of the time series when n approaches to infinity.

There is already an implementation in MATLAB functions: https://cn.mathworks.com/matlabcentral/fileexchange/51042-entropy-estimator-based-on-the-lempel-ziv-algorithm?s_tid=prof_contriblnk

I would like to implement is in Python and I did it like this:

def contains(small, big):
    for i in range(len(big)-len(small)+1):
        if big[i:i+len(small)] == small:
            return True
    return False

def actual_entropy(l):
    n = len(l)
    sequence = [l[0]]
    sum_gamma = 0

    for i in range(1, n):
        for j in range(i+1, n+1):
            s = l[i:j]
            if contains(s, sequence) != True:
                sum_gamma += len(s)
                sequence.append(l[i])
                break

    ae = 1 / (sum_gamma / n ) * math.log(n)            
    return ae

However, I found it calculate too slow when the data size is getting bigger. For example, I used a list of 23832 elements as an input and time consumed is like this: (data can be found here)

0-1000: 1.7068431377410889 s
1000-2000: 18.561192989349365 s
2000-3000: 84.82257103919983 s
3000-4000: 243.5819959640503 s
...

I have thousands of lists like this to be calculated and such long time is unbearable. How should I optimize this function and make it work faster?

3
  • 1
    Instead of working with lists, you can try to work with numpy - this may speed up the process Commented Sep 19, 2017 at 9:43
  • So you suggest to use the numpy array instead of list? Commented Sep 19, 2017 at 9:57
  • 1
    Besides being careful with data-structures (many appends on array; ouch; numpy won't help here), use a profiler to check what's slow. The two candidates here are contains and sequence.append. The former can be probably optimized by reusing the old position as start, but you have to check your algorithm. Commented Sep 19, 2017 at 11:10

1 Answer 1

3

I played around a bit and tried a few different approaches from another thread on StackOverflow. And this is the code I came up with:

def contains(small, big):
    try:
        big.tostring().index(small.tostring())//big.itemsize
        return True
    except ValueError:
        return False

def actual_entropy(l):
    n = len(l)
    sum_gamma = 0

    for i in range(1, n):
        sequence = l[:i]

        for j in range(i+1, n+1):
            s = l[i:j]
            if contains(s, sequence) != True:
                sum_gamma += len(s)
                break

    ae = 1 / (sum_gamma / n) * math.log(n)
    return ae

Funnily enough casting the numpy arrays to strings is faster than working with strings directly. A very crude benchmark of my code on my machine with the data you provide is:

   N:  my code - your code
1000:   0.039s -    1.039s
2000:   0.266s -   18.490s
3000:   0.979s -   74.761s
4000:   2.891s -  285.488s

You maybe can make this even faster if you parallelize the outer loop.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.