Efficient algorithm for counting unique elements in "suffixes" of an array

Question

I was doing 368B on CodeForces with Python 3, which basically asks you to print the numbers of unique elements in a series of "suffixes" of a given array. Here's my solution (with some additional redirection code for testing):

import sys

if __name__ == "__main__":
    f_in = open('b.in', 'r')
    original_stdin = sys.stdin
    sys.stdin = f_in

    n, m = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
    a = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
    l = [None] * m
    for i in range(m):
        l[i] = int(sys.stdin.readline().rstrip())

    l_sorted = sorted(l)
    l_order = sorted(range(m), key=lambda k: l[k])

    # the ranks of elements in l
    l_rank = sorted(range(m), key=lambda k: l_order[k])

    # unique_elem[i] = non-duplicated elements between l_sorted[i] and l_sorted[i+1]
    unique_elem = [None] * m

    for i in range(m):
        unique_elem[i] = set(a[(l_sorted[i] - 1): (l_sorted[i + 1] - 1)]) if i < m - 1 else set(a[(l_sorted[i] - 1): n])

    # unique_elem_cumulative[i] = non-duplicated elements between l_sorted[i] and a's end
    unique_elem_cumulative = unique_elem[-1]

    # unique_elem_cumulative_count[i] = #unique_elem_cumulative[i]
    unique_elem_cumulative_count = [None] * m
    unique_elem_cumulative_count[-1] = len(unique_elem[-1])

    for i in range(m - 1):
        i_rev = m - i - 2
        unique_elem_cumulative = unique_elem[i_rev] | unique_elem_cumulative
        unique_elem_cumulative_count[i_rev] = len(unique_elem_cumulative)

    with open('b.out', 'w') as f_out:
        for i in range(m):
            idx = l_rank[i]
            f_out.write('%d\n' % unique_elem_cumulative_count[idx])

    sys.stdin = original_stdin
    f_in.close()

The code shows correct results except for the possibly last big test, with n = 81220 and m = 48576 (a simulated input file is here, and an expected output created by a naive solution is here). The time limit is 1 sec, within which I can't solve the problem. So is it possible to solve it within 1 sec with Python 3? Thank you.

UPDATE: an "expected" output file is added, which is created by the following code:

import sys

if __name__ == "__main__":
    f_in = open('b.in', 'r')
    original_stdin = sys.stdin
    sys.stdin = f_in

    n, m = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
    a = [int(i) for i in sys.stdin.readline().rstrip().split(' ')]
    with open('b_naive.out', 'w') as f_out:
        for i in range(m):
            l_i = int(sys.stdin.readline().rstrip())
            f_out.write('%d\n' % len(set(a[l_i - 1:])))

    sys.stdin = original_stdin
    f_in.close()

@thefourtheye I've attached the link of the "expected" output created by a naive solution. — Ziyuan
– Ziyuan, Commented Nov 27, 2013 at 10:29

flup · Accepted Answer · 2013-11-27 18:54:14Z

You'll be cutting it close, I think. On my admittedly rather old machine, the I/O alone takes 0.9 seconds per run.

An efficient algorithm, I think, will be to iterate backwards through the array, keeping track of which distinct elements you've found. When you find a new element, add its index to a list. This will therefore be a descending sorted list.

Then for each l_i, the index of l_i in this list will be the answer.

For the small sample dataset

10 10
1 2 3 4 1 2 3 4 100000 99999
1
2
3
4
5
6
7
8
9
10

The list would contain [10, 9, 8, 7, 6, 5] since when reading from the right, the first distinct value occurs at index 10, the second at index 9, and so on.

So then if l_i = 5, it has index 6 in the generated list, so 6 distinct values are found at indices >= l_i. Answer is 6

If l_i = 8, it has index 3 in the generated list, so 3 distinct values are found at indices >= l_i. Answer is 3

It's a little fiddly that the excercise numbers 1-indexed and python counts 0-indexed. And to find this index quickly using existing library functions, I've reversed the list and then use bisect.

import timeit
from bisect import bisect_left

def doit():
    f_in = open('b.in', 'r')
    n, m = [int(i) for i in f_in.readline().rstrip().split(' ')]
    a = [int(i) for i in f_in.readline().rstrip().split(' ')]
    found = {}
    indices = []
    for i in range(n - 1, 0, -1):
        if not a[i] in found:
            indices.append(i+1)
            found[a[i]] = True

    indices.reverse()
    length = len(indices)
    for i in range(m):
        l = int(f_in.readline().rstrip())
        index = bisect_left(indices, l)
        print length - index

if __name__ == "__main__":
    print (timeit.timeit('doit()', setup="from bisect import bisect_left;from __main__ import doit", number=10))

On my machine outputs 12 seconds for 10 runs. Still too slow.

Although I made my code accepted by using |= in the set union operation, yours is more efficient according to my test.

Collectives™ on Stack Overflow

Efficient algorithm for counting unique elements in "suffixes" of an array

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related