Text processing with Python

Question

I need to extract 1,500,000 out of 25,000,000 records and group them.

The groups and the UUIDs of the records to extract are defined in a separate file (200MB) with the following format:

>Cluster 0
0   70nt, >90ec66e4-c038-41f0-a553-c94864cf3958... at +/80.00%
1   88nt, >2d45d336-a0f4-4eca-8577-b950e11bb4cf... *
2   70nt, >6f6ad8f1-0cfb-4e57-8962-366cd749fa3f... at +/82.86%
>Cluster 1
0   74nt, >5f584468-a231-416d-9156-ff68e11ee096... *
>Cluster 2
0   74nt, >7f584468-a231-416d-9156-ff68e11ee096... *
1   79nt, >f7884902-51d4-48e1-88a3-9adc0bd0f2cd... at +/86.08%

Here's my function for parsing it:

def clstr_parse(filename):
    clstr = None
    with open(filename) as f:
        for line in f:
            if line.startswith('>'):
                if clstr:
                    yield clstr
                clstr = []
            else:
                uuid = line.split()[2][1:37]
                clstr.append(uuid)
    if clstr:
        yield clstr

Then I use it to extract the "groups" (list of UUIDs) that contain more than one UUID:

groups = [grp for grp in clstr_parse('file.clstr') if len(grp) >= 2]

And define a dict (with the UUIDs as keys) for storing the records during their extraction:

records = {uuid: None for grp in groups for uuid in grp}

The file (30GB) from which I need to extract the records is in the following format (the columns are TAB-delimited):

@something ...some_defs...
@...more_things...
92fa0cdf-9e1b-4f83-b6e0-ca35885bfdbd    16  ...more_fields...
2d45d336-a0f4-4eca-8577-b950e11bb4cf    16  ...more_fields...
2d45d336-a0f4-4eca-8577-b950e11bb4cf    2064    ...more_fields...
f7884902-51d4-48e1-88a3-9adc0bd0f2cd    0   ...more_fields...
90ec66e4-c038-41f0-a553-c94864cf3958    16  ...more_fields...
6f6ad8f1-0cfb-4e57-8962-366cd749fa3f    0   ...more_fields...
7f584468-a231-416d-9156-ff68e11ee096    16  ...more_fields...

I made a function for yielding each record:

def sam_parse(filename):
    with open(filename) as f:
        for line in f:
            if line.startswith('@'):
                pass
            else:
                yield line
                for line in f:
                    yield line

Which I use in the extraction process:

for rec in sam_parse('file.sam'):
    (uuid, flag) = rec.split(maxsplit=2)[0:2]
    if uuid in records and int(flag) < 2048:
        records[uuid] = rec[0:-1]

for grp in groups:
    for uuid in grp:
        print(records[uuid])
    print()

The problem is that I would expect this program to take less than 10 minutes to complete (tested a similar code in awk) but it's been 8 hours that I launched it and it's still running. Is there something wrong with the Python code?

The else-branch in your sam_parse function does not make sense to me. Why are you starting another for line in f loop inside the outer for line in f loop? — treuss
– treuss, Commented Sep 25, 2023 at 20:32
Have you considered using a framework like pandas for this task? — treuss
– treuss, Commented Sep 25, 2023 at 20:37
it's been 8 hours that I launched it and it's still running I would suggest adding progress bars to your loop. This way you can get a rough estimate of lines/second, and then know about how many hours you'll need to wait. Example: github.com/tqdm/tqdm#iterable-based — Nick ODell
– Nick ODell, Commented Sep 25, 2023 at 20:41
I just ran clster_parse on 28 million entries in the rile.clstr file and it finished in 29 seconds. The clstr file was 1.19 GB. — Jesse Sealand
– Jesse Sealand, Commented Sep 25, 2023 at 20:50
@JesseSealand I couldn't reproduce the TABs in SO, but there is one TAB between each non-white-space column — Fravadona
– Fravadona, Commented Sep 25, 2023 at 21:11

Jesse Sealand · Accepted Answer · 2023-09-25 21:48:35Z

1

I got this to process ~2 million records from the sam file in 6 seconds. I did two things, first i removed the second yield statement in the definition such that

def sam_parse(filename):
    with open(filename) as f:
        for line in f:
            if line.startswith('@'):
                pass
            else:
                yield line

Then secondly, I resplit the flag to make sure its an integer value. I kept getting error when I didn't split the flag string a second time.

for rec in sam_parse('file.sam'):
    (uuid, flag) = rec.split("    ")[0:2]
    flag = flag.split(" ")[0]
    if uuid in records and int(flag) < 2048:
        records[uuid] = rec[0:-1]

edited Sep 25, 2023 at 21:48

answered Sep 25, 2023 at 21:22

Jesse Sealand

1,6331 gold badge10 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Fravadona Over a year ago

You mean removing the [0:2] after rec.split and then further split the flag?

Jesse Sealand Over a year ago

I keep getting errors when trying to convert the flag to an int for some reason, that's why I suggest it.

Nick ODell Over a year ago

Isn't flag = flag.split(" ")[0] a no-op? It only does something if there is whitespace in the flag. But the previous line removes whitespace from the first two components of the split.

Jesse Sealand Over a year ago

Remember that my file has 4 spaces in place of a tab character because I copied from the post.

Fravadona Over a year ago

I don't really understand what the problem was but it works with the sam_parse modification. Thank you ;-)

|

Collectives™ on Stack Overflow

Text processing with Python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related