0

I have a pandas dataframe Bg that was created by taking sample in rows and r for in columns. r is a list of genes that I want to split in a row-wise manner for the entire dataframe. My code below is taking a long time to run and repeatedly crash. I would like to know if there is a more efficient way to achieve the aim.

import pandas as pd

Bg = pd.DataFrame()

for idx, r in pathway_genes.itertuples():
  for i, p in enumerate(M.index):
    if idx == p:
      for genes, samples in common_mrna.iterrows():
        b = pd.DataFrame({r:samples})        
        Bg = Bg.append(b).fillna(0)

M.index

M.index = ['KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION',
       'KEGG_DRUG_METABOLISM_OTHER_ENZYMES', 'KEGG_PEROXISOME',
       'KEGG_LONG_TERM_POTENTIATION', 'KEGG_ADHERENS_JUNCTION', 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM']

pathway_genes

geneSymbols
KEGG_ABC_TRANSPORTERS ['ABCA1', 'ABCA10', 'ABCA12']
KEGG_ACUTE_MYELOID_LEUKEMIA ['AKT1', 'AKT2', 'AKT3', 'ARAF']
KEGG_ADHERENS_JUNCTION ['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2']
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY ['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5']
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM ['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']

common_mrna

common_mrna = pd.DataFrame([[1.2, 1.3, 1.4, 1.5], [1.6,1.7,1.8,1.9], [2.0,2.1,2.2,2.3], [2.4,2.5,2.6,2.7], [2.8,2.9,3.0,3.1],[3.2,3.3,3.4,3.5],[3.6,3.7,3.8,3.9],[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ABCA1','ABCA10','ABCA12','AKT1','AKT2','AKT3','ARAF','ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])

Desired output:

Bg = pd.DataFrame([[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])
7
  • please provide a single code with constructors for all the inputs (so that one just needs to copy/paste and run to reproduce your data) Commented May 22, 2022 at 15:18
  • Hi welcome to SO! If it's ok with you, could you please double-check if your all your dataframe is aligned and correct? Commented May 22, 2022 at 15:56
  • @KevinChoonLiangYew The dataframes are correct. However, the geneSymbols in pathway_genes df are lists and I'm unable to put it in a single code. Commented May 22, 2022 at 16:44
  • Thanks for improving the format, so to clarify, you are trying to match the pathway_genes with the common_mrna dataframe based on the lists of index from the pathway_genes? And your pathway_genes is a dictionary? Commented May 22, 2022 at 16:51
  • @KevinChoonLiangYew yes, that is correct! Commented May 22, 2022 at 16:53

1 Answer 1

0

Firs of all, you can use list comprehension to match the M_index with the pathway_genes

pathway_genes = {'KEGG_ABC_TRANSPORTERS': ['ABCA1', 'ABCA10', 'ABCA12'], 'KEGG_ACUTE_MYELOID_LEUKEMIA': ['AKT1', 'AKT2', 'AKT3', 'ARAF'], 'KEGG_ADHERENS_JUNCTION': ['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2'], 'KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY': ['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5'], 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM': ['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']}

matched_index_symbols = [pathway_genes[i] for i in pathway_genes.keys() if i in M_index]

After that, you can use loc to match all the symbols.

flatten_list = [j for sub in matched_index_symbols for j in sub]

Bg = common_mrna.loc[flatten_list]
Out[26]: 
        TCGA-02-0033-01  TCGA-02-2470-01  TCGA-02-2483-01  TCGA-06-0124-01
ABCA1               1.2              1.3              1.4              1.5
ABCA10              1.6              1.7              1.8              1.9
ABCA12              2.0              2.1              2.2              2.3
ACP1                4.0              4.1              4.2              4.3
ACTB                4.4              4.5              4.6              4.7
ACTG1               4.8              4.9              5.0              5.1
ACTN1               5.2              5.3              5.4              5.5
ACTN2               5.6              5.7              5.8              5.9
ABAT                6.0              6.1              6.2              6.3
ACY3                6.4              6.5              6.6              6.7
ADSL                6.8              6.9              7.0              7.1
ADSS1               7.2              7.3              7.4              7.5
ADSS2               7.6              7.7              7.8              7.9

Update

It seems that your pathway_genes is not originally a dictionary but a dataframe. If that's the case, you can extract the column index of the dataframe.

pathway_genes
Out[46]: 
                                                                      geneSymbols
KEGG_ABC_TRANSPORTERS                                        [ABCA1, ABCA10, ABCA12]
KEGG_ACUTE_MYELOID_LEUKEMIA                                 [AKT1, AKT2, AKT3, ARAF]
KEGG_ADHERENS_JUNCTION                             [ACP1, ACTB, ACTG1, ACTN1, ACTN2]
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY             [ACACB, ACSL1, ACSL3, ACSL4, ACSL5]
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM     [ABAT, ACY3, ADSL, ADSS1, ADSS2]

matched_index_symbols = np.array([pathway_genes['geneSymbols'].loc[i] for i in pathway_genes.index if i in M_index])

flatten_list = matched_index_symbols.ravel()
Sign up to request clarification or add additional context in comments.

12 Comments

Also, please avoid using enumerate if you are not using the indexes for every loop generated.
dictionary is pathway_genes?
The code fails when I call the index of M df as M.index using symbols = [pathway_genes[i] for i in pathway_genes.keys() if i in list(M.index)]. It returns an empty symbols variable.
Is your M.index the same list as you have provided in your question?
Yes. It looks something like this Index(['KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION', 'KEGG_DRUG_METABOLISM_OTHER_ENZYMES', 'KEGG_PEROXISOME', 'KEGG_LONG_TERM_POTENTIATION', 'KEGG_CYSTEINE_AND_METHIONINE_METABOLISM', 'KEGG_AMINO_SUGAR_AND_NUCLEOTIDE_SUGAR_METABOLISM', 'KEGG_NOTCH_SIGNALING_PATHWAY', 'KEGG_FATTY_ACID_METABOLISM', 'KEGG_LONG_TERM_DEPRESSION', 'KEGG_CITRATE_CYCLE_TCA_CYCLE', dtype='object', name='NAME', length=134)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.