2

I have a list of 2000 files, which I would like to combine:

01-0628-11A-01D-0356-01_hg19
01-0628-11A-01D-0356-01_hg20
01-0628-11A-01D-0356-01_hg21
01-1372-11A-01D-0356-01_hg16
01-1372-11A-01D-0356-01_hg17
...


I have already gotten the files into a glob, and used regular expressions to rename the files into the common identifier (the six digit code shown below), however there is a varying number of original files for each identifier.

01-0628
01-0628
01-0628
01-1372
01-1372
...

I was originally going to just use a reader and open each file with common names, but I was wondering if there is a more efficient way to do this.

The final output I would like is the following, each of the files with common identifiers combined into one:

01-0628
01-1372
...

All of the files contain similarly formatted data, so simply apending the existing files to a new file would not be an issue.

2
  • It looks like you'll need to open each file anyway right? So I don't see how you could find any more efficient way to do that than having to open them one by one. Commented Jun 30, 2013 at 22:16
  • You can't just read through each file and open('newlargefile.csv', 'a') to a new large file, or first file? It'll just keep appending new lines to the bottom of the written file. Example: stackoverflow.com/questions/4706499/… Commented Jun 30, 2013 at 22:31

1 Answer 1

1

Assuming these csvs have similar or identical fields. This code should work. It uses the DictReader and DictWriter classes of the csv module which convert csv rows to Python dictionaries.

1) It opens and reads in the globbed csv files, from in_csv_dir, into a (filename, rows) dictionary.

2) It groups the csv rows into a (prefix, rows) dictionary based on filenames and the prefix_length variable.

3) It combines the fields of each prefix grouping and creates a combined csv in the out_csv_dir.

4) As dictionary keys are unordered, your csvs may have specific field orders. This can be entered into field_order. This will sort csv fields but won't fail on fields not defined in field_order.

import os
import sys

# Import System libraries
from csv import DictReader, DictWriter
import glob

in_csv_dir = ".\\csvs"
out_csv_dir = ".\\combined_csvs"
prefix_length = 2

field_order = ["NAME", "TITLE", "COMPANY", "LOCATION"]
field_check = lambda q: field_order.index(q) if(field_order.count(q)) else sys.maxint
csvs = {}
gotten_files = glob.glob(os.path.join(in_csv_dir, "*.csv"))
for glob_filename in gotten_files:
    print "%-11s%s" % ("Opening:", glob_filename)
    file_obj = open(glob_filename, "rb")
    cur_reader = DictReader(file_obj)
    cur_record = [q for q in cur_reader.__iter__()]
    file_obj.close()
    if(cur_record):
        (path, filename_ext) = os.path.split(glob_filename)
        (filename, ext) = os.path.splitext(filename_ext)
        csvs[filename] = cur_record

csv_prefixes = list(set([x[:prefix_length] for x in csvs.keys()]))
csv_groups = dict([(prefix, []) for prefix in csv_prefixes])
map(lambda (key, value): csv_groups[key[:prefix_length]].extend(value), csvs.items())

for (key, sub_csvs) in csv_groups.items():
    com_keys = list(reduce(lambda x, y: x|set(y.keys()), sub_csvs, set([])))
    com_keys.sort(cmp=lambda x, y: field_check(x) - field_check(y))

    filename = os.path.join(out_csv_dir, "%s.csv" % key)
    print "%-11s%s" % ("Combining:", filename)
    file_obj = open(filename, "wb")
    temp_csv = DictWriter(file_obj, com_keys)

    temp_csv.writerow(dict(zip(com_keys, com_keys)))
    map(lambda x: temp_csv.writerow(x), sub_csvs)
    file_obj.close()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.