2

I have a list of positive (random) integers with the following properties:

Number of elements: 78495

Maximum value of element: 999982

Length of list when converted to a string: 517115 (string looks like "6,79384,238956,...")

Size of list in text file on disk: 520 kb

I am trying to use this list as a precomputed list for an online judge problem because it takes a long time to actually generate this list. However, it is too large to be accepted if I paste it directly into the source code, which has a cap of 50 kb.

I looked into zlib as a way to compress the string but it only seemed to cut the size in half.

Is there a way to really shrink this down so I can unpack it / use it in the source code?

14
  • 2
    You say these are random integers. Why do you need these particular random integers, and why does it take you so long to just rerun your RNG? Commented Jul 16, 2016 at 0:47
  • @user2357112 Random for the sake of this discussion -- getting into the actual math would be outside the question. Commented Jul 16, 2016 at 0:51
  • I take it the order is important? If not then the mean difference between values is only 13, so you could try sorting them and compress the deltas. Commented Jul 16, 2016 at 0:56
  • 1
    Yes, the order is important (to be specific, it is a list of smallest-k values for which 10^k = 1 mod p for primes p > 5). The ith position of the list corresponds to the (i+3)rd prime Commented Jul 16, 2016 at 0:57
  • Does your program need the full list, or can it work on a stream of numbers as they come in? In other words, can you set up two threads: 1) a thread that generates smallest-k values; and, 2) a thread that consumes/uses those values as they are generated? Commented Jul 16, 2016 at 1:01

5 Answers 5

5

Given your definition ...

it is a list of smallest-k values for which 10^k = 1 mod p for primes p > 5

... am I wrong to believe that your values are of the form (p - 1) / x where x is an integer significantly smaller than p?

For instance, for p < 50, we have:

p = 7  : 10^6  = 1 (mod 7)  => k = 6  = (p - 1) / 1  => x = 1
p = 11 : 10^2  = 1 (mod 11) => k = 2  = (p - 1) / 5  => x = 5
p = 13 : 10^6  = 1 (mod 13) => k = 6  = (p - 1) / 2  => x = 2
p = 17 : 10^16 = 1 (mod 17) => k = 16 = (p - 1) / 1  => x = 1
p = 19 : 10^18 = 1 (mod 19) => k = 18 = (p - 1) / 1  => x = 1
p = 23 : 10^22 = 1 (mod 23) => k = 22 = (p - 1) / 1  => x = 1
p = 29 : 10^28 = 1 (mod 29) => k = 28 = (p - 1) / 1  => x = 1
p = 31 : 10^15 = 1 (mod 31) => k = 15 = (p - 1) / 2  => x = 2
p = 37 : 10^3  = 1 (mod 37) => k = 3  = (p - 1) / 12 => x = 12
p = 41 : 10^5  = 1 (mod 41) => k = 5  = (p - 1) / 8  => x = 8
p = 43 : 10^21 = 1 (mod 43) => k = 21 = (p - 1) / 2  => x = 2
p = 47 : 10^46 = 1 (mod 47) => k = 46 = (p - 1) / 1  => x = 1

The list of x values should compress much better than the list of k values. (For instance, I'd be willing to bet that the most frequent value of x will be '1'.)

And because it's rather easy and fast to compute primes up to 1 million (which I think is your upper bound), you may be able to quickly rebuild the list of k values based on the compressed list of x values and the real-time computed list of primes.

You probably should have explained from the beginning what exactly you were trying to compress to get more accurate answers.

Sign up to request clarification or add additional context in comments.

1 Comment

This is a very good point... I forgot that the orders must divide phi(p) = p-1
2

In short, no.

log(2, 999982) ~= 20

So the largest number would take 20 bits to store. Let's say that on average, each number takes 10 bits to store (evenly distributed between 0 and the max).

~80,000 numbers * 10 bits per number = 800,000 bits = 100,000 bytes

So these numbers, stored as efficiently as possible, would take ~100KB of space.

Compression will only work if there's some non-randomness to the numbers. If they're truly random, as you say, then a general compression algorithm won't be able to make this any smaller, so 100KB is about the best you can hope to do.

EDIT

Note that things are even worse, in that you want to paste these into source code, so you can't just use arbitrary binary data. You'll need something text-friendly, like base64 encoding, which will add another ~33% of overhead. Also, you can't really store numbers based on the average number of bits required, because you'd need some way to know how many bits were used by each individual number. There are possible encoding schemes, but all will carry some additional overhead.

SECOND EDIT

Based on the comments above, the data is not actually random as originally stated. A general compression algorithm therefore might work, and if not, there are presumably other solutions (e.g. just shipping the code that generated the numbers in the first place, which is likely smaller than 50KB).

5 Comments

What about converting the numbers to other bases, etc?
Your base-10 numbers are already stored in binary. Any other representation uses the same amount of information.
@user6596353 Your question doesn't really make sense. Computers store data in binary (base 2), and I was already assuming that representation. E.g. the number 255 (in decimal) is 11111111 in binary, which takes 7 bits to store. That's already the most efficient way to store the number in a computer.
I'm talking in terms of textual storage (the whole point of this question to begin with). If I have "12345" that's 5 characters. But if I convert it to (for example) hex, it becomes "3039" (one character fewer)
@user6596353 Right, but I was already talking about storing it directly in binary. I gave a theoretical lower-bound on how much storage it would take.
1

The best text compression available offers a (roughly) 12-17% compression ratio (62.4-90 kB) so you're not going to meet your threshold. Your data are random, as well, which generally makes compression algorithms perform worse.

Look at an alternative approach, such as making your RNG process faster, or if you don't need a full list (just some integers), create a separate "producer" thread to generate random integers (involving whatever actual math you are using) and a "consumer" thread that does work on those integers as they come in. That way, your program could perhaps still do work, even if it would take a long time to generate a full list.

4 Comments

The question is not about text compression, but about integer compression. For which 10x compression factor is doable, depending on the data. (Given that it's random, probably not, so in that case you're absolutely right).
Re-reading the question from 2016, it looks like the data are stored in a text file, as strings. Is that now not correct?
He writes: "I have a list of positive (random) integers". It is indeed stored as a text file! But the data is still a list of integers.
The data may be a list of integers when read in from text and converted to integers. When they are in a text file, they are effectively string data. I'm not sure what is still unclear about this. In any case, this answer is nearly eight years old.
1

Here I've tested easily availiable algorithms in python on two strings: one is generated randomly with non uniform distribution, another one has some structure. It seems, lzma does better

# check the compression ratio
import lzma
import zlib
import gzip
import bz2
import zipfile
import tarfile
compressors = ['lzma','zlib','gzip','bz2'] 
a = np.exp(np.random.rand(1024))
b = np.arange(1024)
b[32] = -10
b[96] = 20000
a = bytes(a)
b = bytes(b)

for i in range(len(compressors)):
    print("{} compression ratio: ".format(compressors[i]))
    a_lzma = eval(compressors[i]).compress(a)
    b_lzma = eval(compressors[i]).compress(b)
    print(float(len(a_lzma))/len(a),float(len(b_lzma))/len(b))
    print("\n")

The output:

lzma compression ratio: 0.93115234375 0.08984375

zlib compression ratio: 0.95068359375 0.1944580078125

gzip compression ratio: 0.9521484375 0.196533203125

bz2 compression ratio: 0.9925537109375 0.1268310546875

Comments

0

There are python bindings for Streamvbyte.

Streamvbyte is a fast, fairly good integer compression library, used by a number of well-known open source projects.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.