1

I need to extract 1,500,000 out of 25,000,000 records and group them.

The groups and the UUIDs of the records to extract are defined in a separate file (200MB) with the following format:

>Cluster 0
0   70nt, >90ec66e4-c038-41f0-a553-c94864cf3958... at +/80.00%
1   88nt, >2d45d336-a0f4-4eca-8577-b950e11bb4cf... *
2   70nt, >6f6ad8f1-0cfb-4e57-8962-366cd749fa3f... at +/82.86%
>Cluster 1
0   74nt, >5f584468-a231-416d-9156-ff68e11ee096... *
>Cluster 2
0   74nt, >7f584468-a231-416d-9156-ff68e11ee096... *
1   79nt, >f7884902-51d4-48e1-88a3-9adc0bd0f2cd... at +/86.08%

Here's my function for parsing it:

def clstr_parse(filename):
    clstr = None
    with open(filename) as f:
        for line in f:
            if line.startswith('>'):
                if clstr:
                    yield clstr
                clstr = []
            else:
                uuid = line.split()[2][1:37]
                clstr.append(uuid)
    if clstr:
        yield clstr

Then I use it to extract the "groups" (list of UUIDs) that contain more than one UUID:

groups = [grp for grp in clstr_parse('file.clstr') if len(grp) >= 2]

And define a dict (with the UUIDs as keys) for storing the records during their extraction:

records = {uuid: None for grp in groups for uuid in grp}

The file (30GB) from which I need to extract the records is in the following format (the columns are TAB-delimited):

@something ...some_defs...
@...more_things...
92fa0cdf-9e1b-4f83-b6e0-ca35885bfdbd    16  ...more_fields...
2d45d336-a0f4-4eca-8577-b950e11bb4cf    16  ...more_fields...
2d45d336-a0f4-4eca-8577-b950e11bb4cf    2064    ...more_fields...
f7884902-51d4-48e1-88a3-9adc0bd0f2cd    0   ...more_fields...
90ec66e4-c038-41f0-a553-c94864cf3958    16  ...more_fields...
6f6ad8f1-0cfb-4e57-8962-366cd749fa3f    0   ...more_fields...
7f584468-a231-416d-9156-ff68e11ee096    16  ...more_fields...

I made a function for yielding each record:

def sam_parse(filename):
    with open(filename) as f:
        for line in f:
            if line.startswith('@'):
                pass
            else:
                yield line
                for line in f:
                    yield line

Which I use in the extraction process:

for rec in sam_parse('file.sam'):
    (uuid, flag) = rec.split(maxsplit=2)[0:2]
    if uuid in records and int(flag) < 2048:
        records[uuid] = rec[0:-1]

for grp in groups:
    for uuid in grp:
        print(records[uuid])
    print()

The problem is that I would expect this program to take less than 10 minutes to complete (tested a similar code in awk) but it's been 8 hours that I launched it and it's still running. Is there something wrong with the Python code?

24
  • 1
    The else-branch in your sam_parse function does not make sense to me. Why are you starting another for line in f loop inside the outer for line in f loop? Commented Sep 25, 2023 at 20:32
  • 1
    Have you considered using a framework like pandas for this task? Commented Sep 25, 2023 at 20:37
  • 1
    it's been 8 hours that I launched it and it's still running I would suggest adding progress bars to your loop. This way you can get a rough estimate of lines/second, and then know about how many hours you'll need to wait. Example: github.com/tqdm/tqdm#iterable-based Commented Sep 25, 2023 at 20:41
  • 1
    I just ran clster_parse on 28 million entries in the rile.clstr file and it finished in 29 seconds. The clstr file was 1.19 GB. Commented Sep 25, 2023 at 20:50
  • 1
    @JesseSealand I couldn't reproduce the TABs in SO, but there is one TAB between each non-white-space column Commented Sep 25, 2023 at 21:11

1 Answer 1

1

I got this to process ~2 million records from the sam file in 6 seconds. I did two things, first i removed the second yield statement in the definition such that

def sam_parse(filename):
    with open(filename) as f:
        for line in f:
            if line.startswith('@'):
                pass
            else:
                yield line

Then secondly, I resplit the flag to make sure its an integer value. I kept getting error when I didn't split the flag string a second time.

for rec in sam_parse('file.sam'):
    (uuid, flag) = rec.split("    ")[0:2]
    flag = flag.split(" ")[0]
    if uuid in records and int(flag) < 2048:
        records[uuid] = rec[0:-1]
Sign up to request clarification or add additional context in comments.

6 Comments

You mean removing the [0:2] after rec.split and then further split the flag?
I keep getting errors when trying to convert the flag to an int for some reason, that's why I suggest it.
Isn't flag = flag.split(" ")[0] a no-op? It only does something if there is whitespace in the flag. But the previous line removes whitespace from the first two components of the split.
Remember that my file has 4 spaces in place of a tab character because I copied from the post.
I don't really understand what the problem was but it works with the sam_parse modification. Thank you ;-)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.