3

The following below is python script that identifies whether certain words are found or not found in a list of different files.

experiment=open('potentiation.txt')
lines=experiment.read().splitlines()
receptors=['crystal_1.txt', 'modeller_1.txt', 'moe_1.txt',
           'nci5_modeller0000_1.txt', 'nci5_modeller0001_1.txt',
           'nci5_modeller0002_1.txt', 'nci5_modeller0003_1.txt',
           'nci5_modeller0004_1.txt', 'nci5_modeller0005_1.txt',
           'nci5_modeller0006_1.txt', 'nci5_modeller0007_1.txt',
           'nci5_modeller0008_1.txt', 'nci5_modeller0009_1.txt',
           'nci5_modeller0010_1.txt', 'nci5_modeller0011_1.txt',
           'nci5_moe0000_1.txt', 'nci5_moe0001_1.txt', 'nci5_moe0002_1.txt',
           'nci5_moe0003_1.txt', 'nci5_moe0004_1.txt', 'nci5_moe0005_1.txt',
           'nci5_moe0006_1.txt', 'nci5_moe0007_1.txt', 'nci5_moe0008_1.txt',
           'nci5_moe0009_1.txt', 'nci5_moe0010_1.txt', 'nci5_moe0011_1.txt',
           'nci5_moe0012_1.txt', 'nci5_moe0013_1.txt', 'nci5_moe0014_1.txt']

for ligand in lines:
    for protein in receptors:
        file1=open(protein,"r")
        read1=file1.read()
        find_hit=read1.find(ligand)
        if find_hit == -1:
            print ligand,protein,"Not Found"
        else:
            print ligand,protein, "Found"

An example of the output of this code is below:

345647 nci5_moe0012_1.txt Not Found
345647 nci5_moe0013_1.txt Not Found
345647 nci5_moe0014_1.txt Found

My question is how can I take the output and format it into a csv file that looks like the example below?

Ligand  nci5_moe0012_1. nci5_moe_0013_1   nci5_moe_0014
345647  Not Found        Not Found        Found
0

2 Answers 2

3

I think something like this would do it (assuming your output file is tab-delimited):

import csv
import os

receptors = ['crystal_1', 'modeller_1', 'moe_1',
             'nci5_modeller0000_1', 'nci5_modeller0001_1',
             'nci5_modeller0002_1', 'nci5_modeller0003_1',
             'nci5_modeller0004_1', 'nci5_modeller0005_1',
             'nci5_modeller0006_1', 'nci5_modeller0007_1',
             'nci5_modeller0008_1', 'nci5_modeller0009_1',
             'nci5_modeller0010_1', 'nci5_modeller0011_1',
             'nci5_moe0000_1', 'nci5_moe0001_1', 'nci5_moe0002_1',
             'nci5_moe0003_1', 'nci5_moe0004_1', 'nci5_moe0005_1',
             'nci5_moe0006_1', 'nci5_moe0007_1', 'nci5_moe0008_1',
             'nci5_moe0009_1', 'nci5_moe0010_1', 'nci5_moe0011_1',
             'nci5_moe0012_1', 'nci5_moe0013_1', 'nci5_moe0014_1']

with open('potentiation.txt', 'rt') as experiment, \
     open('output.csv', 'wb') as outfile:
    csv_writer = csv.writer(outfile, delimiter='\t')
    csv_writer.writerow(['Ligand'] + receptors)  # header row
    for ligand in (line.rstrip() for line in experiment):
        row = [ligand]
        for protein in receptors:
            with open(protein+'.txt', "rt") as file1:
                found = ['Found', 'Not Found'][file1.read().find(ligand) == -1]
                row.append(found)
        csv_writer.writerow(row)

print('output.csv file written')

Update

As I said in a comment this could be done a lot faster by only reading the protein files once. In order to be able to do that and format the output the way you want, the results of checking for each ligand in each file need to stored in a data-structure built-up incrementally as each file is read and then checked multiple times, only to be written out, all-at-once, after all have been done. A simple list-of-lists is adequate for this purpose and has been used in implementation below.

The trade-off is using more memory vs reading and rereading the protein files over-and-over. Since disk IO is often one of the slowest things on a computer, the potentially large performance gain for only a slight increase in code-complexity is probably worthwhile.

Here's the code showing this alternative version:

import csv
import os

receptors = ['crystal_1', 'modeller_1', 'moe_1',
             'nci5_modeller0000_1', 'nci5_modeller0001_1',
             'nci5_modeller0002_1', 'nci5_modeller0003_1',
             'nci5_modeller0004_1', 'nci5_modeller0005_1',
             'nci5_modeller0006_1', 'nci5_modeller0007_1',
             'nci5_modeller0008_1', 'nci5_modeller0009_1',
             'nci5_modeller0010_1', 'nci5_modeller0011_1',
             'nci5_moe0000_1', 'nci5_moe0001_1', 'nci5_moe0002_1',
             'nci5_moe0003_1', 'nci5_moe0004_1', 'nci5_moe0005_1',
             'nci5_moe0006_1', 'nci5_moe0007_1', 'nci5_moe0008_1',
             'nci5_moe0009_1', 'nci5_moe0010_1', 'nci5_moe0011_1',
             'nci5_moe0012_1', 'nci5_moe0013_1', 'nci5_moe0014_1']

# initialize list of lists holding each ligand and its presence in each receptor
with open('potentiation.txt') as experiment:
    ligands = [[ligand] for ligand in (line.rstrip() for line in experiment)]

for protein in receptors:
    with open(protein + '.txt') as protein_file:
        protein_file_data = protein_file.read()
        for row in ligands:
            # determine if this ligand (row[0]) appears in protein data
            row.append('Found' if row[0] in protein_file_data else 'Not Found')

with open('output.csv', 'wb') as outfile:
    csv_writer = csv.writer(outfile, delimiter='\t')
    csv_writer.writerow(['Ligand'] + receptors)  # header row
    csv_writer.writerows(ligands)

print('output.csv file written')
Sign up to request clarification or add additional context in comments.

9 Comments

Thanks! When I use this code, I get the following error message: csv_writer([ligand, protein, "Found" if found else "Not Found"]) TypeError: '_csv.writer' object is not callable. Any suggestions?
Thanks this works! One more question. What does ^M mean? It appears in the output csv after each protein_file? Is there a way to get rid of it?
That's a carriage return character. My last update may get rid of it. If it doesn't, it may be because you're using Python 3 but didn't specify that in your question (and should let me know).
Adam: After rereading your question I realized my answer only converted the loop output into csv format, but not arranged the way you wanted. My latest update should correct that.
Thanks for catching that. There is actually one more problem with the script. The script is used to find whether a certain ligand is found or not found within various protein files. However, the output of the script is currently showing "Not Found" for all ligands for each protein file. This is not correct as there should be some that were "Found" and some "Not Found". I think a simple conditional expression should work. How can it best be introduced into the script?
|
0

You can save your result in lists (one list for ligand, one for proteins), after you add the "Protein" and the value of "Ligand" to appropriate list (in 0 index). After it's easy to save it text file.
For saving you open a file for writing and transform list in string:

my_string = " ".join(map(str, lst))

and then save my_string (And do it for each list)

5 Comments

Or you can use dictionary (keys are ligands and values are tuple (file, Found/Not Found).
Thanks for the response. I am pretty new to python. Could you explain more how I can how I can write two different lists to a single text file and include the output data(Found or Not Found)?
Is-it more comprehensible? And you can use a "," in the join method (to be more in csv).
Okay, so so one more question, how can I save both lists as one text file?
Here, this is not lists but strings!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.