merge sort in python

Question

basically I have a bunch of files containing domains. I've sorted each individual file based on its TLD using .sort(key=func_that_returns_tld)

now that I've done that I want to merge all the files and end up wtih one massive sorted file. I assume I need something like this:

open all files
read one line from each file into a list
sort list with .sort(key=func_that_returns_tld)
output that list to file
loop by reading next line

am I thinking about this right? any advice on how to accomplish this would be appreciated.

nothing, but I'm not in linux for this project. ;/ I'm doing it for someone else and the file is just to big to migrate off their machine. — d-c
– d-c, Commented Aug 24, 2010 at 18:41
Remembering that Python exists on platforms other than Linux...Sort-Object is available in PowerShell — Nick T
– Nick T, Commented Aug 24, 2010 at 18:43

unutbu · Accepted Answer · 2011-06-28 12:25:54Z

8

If your files are not very large, then simply read them all into memory (as S. Lott suggests). That would definitely be simplest.

However, you mention collation creates one "massive" file. If it's too massive to fit in memory, then perhaps use heapq.merge. It may be a little harder to set up, but it has the advantage of not requiring that all the iterables be pulled into memory at once.

import heapq
import contextlib

class Domain(object):
    def __init__(self,domain):
        self.domain=domain
    @property
    def tld(self):
        # Put your function for calculating TLD here
        return self.domain.split('.',1)[0]
    def __lt__(self,other):
        return self.tld<=other.tld
    def __str__(self):
        return self.domain

class DomFile(file):
    def next(self):
        return Domain(file.next(self).strip())

filenames=('data1.txt','data2.txt')
with contextlib.nested(*(DomFile(filename,'r') for filename in filenames)) as fhs:
    for elt in heapq.merge(*fhs):
        print(elt)

with data1.txt:

google.com
stackoverflow.com
yahoo.com

and data2.txt:

standards.freedesktop.org
www.imagemagick.org

yields:

google.com
stackoverflow.com
standards.freedesktop.org
www.imagemagick.org
yahoo.com

edited Jun 28, 2011 at 12:25

answered Aug 24, 2010 at 18:48

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

d-c Over a year ago

I wouldnt know how to make this work in my case. I need to use the 'key' function of .sort() because i'm sorting based on TLD rather than first character in the line

unutbu Over a year ago

I've edited my answer to show how you could sort things other than numbers.

Jochen Ritzel Over a year ago

@~unutbu: You can use sorted( lst, key=Domain) instead of explicitly mapping.

unutbu Over a year ago

@THC4k: Thanks for the suggestion, but I'm not sure I follow. sorted will return strings. I need dom1 to be an iterable of Domain objects. (Otherwise, heapq.merge will brainlessly merge them as strings instead of according to TLD.)

S.Lott · Accepted Answer · 2010-08-24 18:53:57Z

0

Unless your file is incomprehensibly huge, it will fit into memory.

Your pseudo-code is hard to read. Please indent your pseudo-code correctly. The final "loop by reading next line" makes no sense.

Basically, it's this.

all_data= []
for f in list_of_files:
    with open(f,'r') as source:
        all_data.extend( source.readlines() )
all_data.sort(... whatever your keys are... )

You're done. You can write all_data to a file, or process it further or whatever you want to do with it.

answered Aug 24, 2010 at 18:53

S.Lott

393k83 gold badges520 silver badges791 bronze badges

1 Comment

S.Lott Over a year ago

@d-c: Nope. A few gigs is fine. You have to get to more gigs than you have virtual memory configured in your OS swap space before it begins to matter.

JonC · Accepted Answer · 2010-08-24 19:34:42Z

0

Another option (again, only if all your data won't fit into memory) is to create a SQLite3 database and do the sorting there and write it to file after.

answered Aug 24, 2010 at 19:34

JonC

2971 gold badge4 silver badges19 bronze badges

Collectives™ on Stack Overflow

merge sort in python

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related