2
\$\begingroup\$

I want to encode a non-negative integer to ASCII letters and digits. I wrote below code for that, that works, but which I assume too immature to use in production.

#! /usr/bin/env python3

from string import digits, ascii_letters


def encode(number: int, pool: str) -> str:
    """Encode a non-negative integer as string."""
    
    result = ''
    pool_size = len(pool)
    
    while number:
        number, remainder = divmod(number, pool_size)
        result += pool[remainder]
    
    return result


def main():

    size = (1 << (8 * 2)) - 1
    codes = set()

    for i in range(size):
        print(i, code := encode(i, digits+ascii_letters))
        codes.add(code)
    
    print('Unique:', len(codes) == size)


if __name__ == '__main__':
    main()

I also tried the encoding functions in the base64 library to no avail. Most of them will create strings containing other than the desired characters at some point and b16encode() has a too small alphabet, resulting in too long codes (no lower-case).

If you can provide a better solution to convert an arbitrary non-negative integer into a string of characters exclusively from string.ascii_letters + string.digits, that would be great.

Update
Project on GitHub

\$\endgroup\$
4
  • \$\begingroup\$ What is the motivation for doing this? And why are the default base 64 characters unfavourable? \$\endgroup\$ Commented Feb 24, 2022 at 13:46
  • \$\begingroup\$ The motivation is to encode integer IDs into easy-to-remember and easy-to-type PINs for users within a web application for smartphones. \$\endgroup\$ Commented Feb 24, 2022 at 13:48
  • \$\begingroup\$ Is your maximum encodable ID actually sixteen bits? Serially generated IDs will quickly exceed this. \$\endgroup\$ Commented Feb 24, 2022 at 13:51
  • \$\begingroup\$ Yes, it's currently two bytes. But the en-/decoding code should be generalized to an arbitrary amount of bytes. \$\endgroup\$ Commented Feb 24, 2022 at 14:03

1 Answer 1

4
\$\begingroup\$

Your current encode doesn't have validation, and won't produce meaningful output for number below 1.

size and the way you use it in a range don't entirely make sense; I would sooner expect size = 1 << (8 * 2) so that 65535 is included. Also your range should start at 1 and not 0, since 0 produces an empty string. (Or modify your encode so that 0 does indeed produce a non-empty string.)

Given your goal of

easy-to-remember and easy-to-type PINs

I don't think that digits, ascii_letters constitute an appropriate pool. Disambiguation of zero and capital O, lower-case L and upper-case I, etc. are problematic. Best to write your own pool literal that avoids these.

Make a judgement call as to which is easier to remember: a shorter string that includes mixed case, or a longer string with one case only. With a pool of 54 upper-case, lower-case and numeric characters, and an ID of at most 16 bits, the maximum encoded length is

$$ \lceil \frac {16 \log2} {\log54} \rceil = 3 $$

In the other direction, if you want to limit the number of encoded characters to 3, the minimum pool size is

$$ \lceil 2^{\frac {16} 3} \rceil = 41 $$

Suggested

import math
from typing import Iterator

LEGIBLE = (
    'ABCDEFGHJKLMNPRTUVWXYZ'
    'abcdefghijkmnopqrstuvwxyz'
    '2346789'
)

# One above the maximum ID.
ID_LIMIT = 1 << (8 * 2)


def encode_each(number: int, pool: str = LEGIBLE) -> Iterator[str]:
    if number < 0:
        raise ValueError(f'{number} is non-encodable')

    pool_size = len(pool)

    while True:
        number, remainder = divmod(number, pool_size)
        yield pool[remainder]
        if not number:
            break


def encode(number: int, pool: str = LEGIBLE) -> str:
    return ''.join(encode_each(number, pool))


def all_codes(size: int = ID_LIMIT) -> Iterator[str]:
    for i in range(size):
        yield encode(i)


def test() -> None:
    codes = tuple(all_codes())
    assert len(codes) == 65536
    assert len(set(codes)) == 65536

    n_symbols = math.ceil(math.log(65536) / math.log(len(LEGIBLE)))
    for code in codes:
        assert 0 < len(code) <= n_symbols


def demo() -> None:
    ids = 0, 1, 100, 1000, 60000, 65535

    for i in ids:
        print(i, '->', encode(i))


if __name__ == '__main__':
    test()
    demo()

Output

0 -> A
1 -> B
100 -> zB
1000 -> gW
60000 -> GjY
65535 -> mda
\$\endgroup\$

You must log in to answer this question.