I need to extract 1,500,000 out of 25,000,000 records and group them.
The groups and the UUIDs of the records to extract are defined in a separate file (200MB) with the following format:
>Cluster 0
0 70nt, >90ec66e4-c038-41f0-a553-c94864cf3958... at +/80.00%
1 88nt, >2d45d336-a0f4-4eca-8577-b950e11bb4cf... *
2 70nt, >6f6ad8f1-0cfb-4e57-8962-366cd749fa3f... at +/82.86%
>Cluster 1
0 74nt, >5f584468-a231-416d-9156-ff68e11ee096... *
>Cluster 2
0 74nt, >7f584468-a231-416d-9156-ff68e11ee096... *
1 79nt, >f7884902-51d4-48e1-88a3-9adc0bd0f2cd... at +/86.08%
Here's my function for parsing it:
def clstr_parse(filename):
clstr = None
with open(filename) as f:
for line in f:
if line.startswith('>'):
if clstr:
yield clstr
clstr = []
else:
uuid = line.split()[2][1:37]
clstr.append(uuid)
if clstr:
yield clstr
Then I use it to extract the "groups" (list of UUIDs) that contain more than one UUID:
groups = [grp for grp in clstr_parse('file.clstr') if len(grp) >= 2]
And define a dict (with the UUIDs as keys) for storing the records during their extraction:
records = {uuid: None for grp in groups for uuid in grp}
The file (30GB) from which I need to extract the records is in the following format (the columns are TAB-delimited):
@something ...some_defs...
@...more_things...
92fa0cdf-9e1b-4f83-b6e0-ca35885bfdbd 16 ...more_fields...
2d45d336-a0f4-4eca-8577-b950e11bb4cf 16 ...more_fields...
2d45d336-a0f4-4eca-8577-b950e11bb4cf 2064 ...more_fields...
f7884902-51d4-48e1-88a3-9adc0bd0f2cd 0 ...more_fields...
90ec66e4-c038-41f0-a553-c94864cf3958 16 ...more_fields...
6f6ad8f1-0cfb-4e57-8962-366cd749fa3f 0 ...more_fields...
7f584468-a231-416d-9156-ff68e11ee096 16 ...more_fields...
I made a function for yielding each record:
def sam_parse(filename):
with open(filename) as f:
for line in f:
if line.startswith('@'):
pass
else:
yield line
for line in f:
yield line
Which I use in the extraction process:
for rec in sam_parse('file.sam'):
(uuid, flag) = rec.split(maxsplit=2)[0:2]
if uuid in records and int(flag) < 2048:
records[uuid] = rec[0:-1]
for grp in groups:
for uuid in grp:
print(records[uuid])
print()
The problem is that I would expect this program to take less than 10 minutes to complete (tested a similar code in awk) but it's been 8 hours that I launched it and it's still running. Is there something wrong with the Python code?
else-branch in yoursam_parsefunction does not make sense to me. Why are you starting anotherfor line in floop inside the outerfor line in floop?it's been 8 hours that I launched it and it's still runningI would suggest adding progress bars to your loop. This way you can get a rough estimate of lines/second, and then know about how many hours you'll need to wait. Example: github.com/tqdm/tqdm#iterable-based