Database Compression in Python

Question

I have hourly logs like

user1:joined
user2:log out
user1:added pic
user1:added comment
user3:joined

I want to compress all the flat files down to one file. There are around 30 million users in the logs and I just want the latest user log for all the logs.

My end result is I want to have a log look like

user1:added comment
user2:log out
user3:joined

Now my first attempt on a small scale was to just do a dict like

log['user1'] = "added comment"

Will doing a dict of 30 million key/val pairs have a giant memory footprint.. Or should I use something like sqllite to store them.. then just put the contents of the sqllite table back into a file?

Are you doing this once for the logs already recorder, if not, do you want the logs to store one result for an id for a certain period, or for eternity? — Rosh Oxymoron
– Rosh Oxymoron, Commented Dec 24, 2010 at 21:41

Ignacio Vazquez-Abrams · Accepted Answer · 2010-12-24 20:33:42Z

1

If you intern() each log entry then you'll use only one string for each similar log entry regardless of the number of times it shows up, thereby lowering memory usage a lot.

>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>>> b = 'f' + ('oo',)[0]
>>> a is b
False
>>> a = intern('foo')
>>> b = intern('f' + ('oo',)[0])
>>> a is b
True

answered Dec 24, 2010 at 20:33

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Ignacio Vazquez-Abrams Over a year ago

Sure, but the key will automatically be unique for the dictionary.

Matt Billenstein · Accepted Answer · 2010-12-24 21:35:42Z

1

You could also process the log lines in reverse -- then use a set to keep track of which users you've seen:

s = set()

# note, this piece is inefficient in that I'm reading all the lines
# into memory in order to reverse them...  There are recipes out there
# for reading a file in reverse.
lines = open('log').readlines()
lines.reverse()

for line in lines:
    line = line.strip()
    user, op = line.split(':')
    if not user in s:
         print line
         s.add(user)

answered Dec 24, 2010 at 21:35

Matt Billenstein

6781 gold badge4 silver badges7 bronze badges

Comments

Rosh Oxymoron · Accepted Answer · 2010-12-24 21:48:27Z

1

The various dbm modules (dbm in Python 3, or anydbm, gdbm, dbhash, etc. in Python 2) let you create simple databases of key to value mappings. They are stored on the disk so there is no huge memory impact. And you can store them as logs if you wish to.

answered Dec 24, 2010 at 21:48

Rosh Oxymoron

21.1k7 gold badges44 silver badges43 bronze badges

Comments

Brian Clapper · Accepted Answer · 2010-12-24 20:35:45Z

0

This sounds like the perfect kind of problem for a Map/Reduce solution. See:

for example.

answered Dec 24, 2010 at 20:35

Brian Clapper

26.4k7 gold badges67 silver badges66 bronze badges

Comments

nate c · Accepted Answer · 2010-12-24 21:43:12Z

Its pretty to easy to mock up the data structure to see how much memory it would take.

Something like this where you could change gen_string to generate data that would approximate the messages.

import random
from commands import getstatusoutput as gso

def gen_string():
     return str(random.random())

 d = {}
 for z in range(10**6):
     d[gen_string()] = gen_string()

print gso('ps -eo %mem,cmd |grep test.py')[1]

On a one gig netbook:

  0.4 vim test.py
  0.1 /bin/bash -c time python test.py
 11.7 /usr/bin/python2.6 test.py
  0.1 sh -c { ps -eo %mem,cmd |grep test.py; } 2>&1
  0.0 grep test.py

   real    0m26.325s
   user    0m25.945s
   sys     0m0.377s

... So its using about 10% of 1 gig for 100,000 records

But it would also depend on how much data redundancy you have ...

Hugh Bothwell · Accepted Answer · 2010-12-24 23:02:02Z

Thanks to @Ignacio for intern() -

def procLog(logName, userDict):
    inf = open(logName, 'r')
    for ln in inf.readlines():
        name,act = ln.split(':')
        userDict[name] = intern(act)
    inf.close()
    return userDict

def doLogs(logNameList):
    userDict = {}
    for logName in logNameList:
        userDict = procLog(logName, userDict)
    return userDict

def writeOrderedLog(logName, userDict):
    keylist = userDict.keys()
    keylist.sort()
    outf = open(logName,'w')
    for k in keylist:
        outf.write(k + ':' + userDict[k])
    outf.close()

def main():
    mylogs = ['log20101214', 'log20101215', 'log20101216']
    d = doLogs(mylogs)
    writeOrderedLog('cumulativeLog', d)

the question, then, is how much memory this will consume.

def makeUserName():
    ch = random.choice
    syl = ['ba','ma','ta','pre','re','cu','pro','do','tru','ho','cre','su','si','du','so','tri','be','hy','cy','ny','quo','po']
    # 22**5 is about 5.1 million potential names
    return ch(syl).title() + ch(syl) + ch(syl) + ch(syl) + ch(syl)

ch = random.choice
states = ['joined', 'added pic', 'added article', 'added comment', 'voted', 'logged out']
d = {}
t = []
for i in xrange(1000):
    for j in xrange(8000):
        d[makeUserName()] = ch(states)
    t.append( (len(d), sys.getsizeof(d)) )

which results in

alt text

(horizontal axis = number of user names, vertical axis = memory usage in bytes) which is... slightly weird. It looks like a dictionary preallocates quite a lot of memory, then doubles it every time it gets too full?

Anyway, 4 million users takes just under 100MB of RAM - but it actually reallocates around 3 million users, 50MB, so if the doubling holds, you will need about 800MB of RAM to process 24 to 48 million users.

Collectives™ on Stack Overflow

Database Compression in Python

6 Answers 6

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

1 Comment

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related