Efficient way to iterate over rows and columns in pandas

Question

I have a pandas dataframe Bg that was created by taking sample in rows and r for in columns. r is a list of genes that I want to split in a row-wise manner for the entire dataframe. My code below is taking a long time to run and repeatedly crash. I would like to know if there is a more efficient way to achieve the aim.

import pandas as pd

Bg = pd.DataFrame()

for idx, r in pathway_genes.itertuples():
  for i, p in enumerate(M.index):
    if idx == p:
      for genes, samples in common_mrna.iterrows():
        b = pd.DataFrame({r:samples})        
        Bg = Bg.append(b).fillna(0)

M.index

M.index = ['KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION',
       'KEGG_DRUG_METABOLISM_OTHER_ENZYMES', 'KEGG_PEROXISOME',
       'KEGG_LONG_TERM_POTENTIATION', 'KEGG_ADHERENS_JUNCTION', 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM']

pathway_genes

	geneSymbols
KEGG_ABC_TRANSPORTERS	`['ABCA1', 'ABCA10', 'ABCA12']`
KEGG_ACUTE_MYELOID_LEUKEMIA	`['AKT1', 'AKT2', 'AKT3', 'ARAF']`
KEGG_ADHERENS_JUNCTION	`['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2']`
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY	`['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5']`
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM	`['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']`

common_mrna

common_mrna = pd.DataFrame([[1.2, 1.3, 1.4, 1.5], [1.6,1.7,1.8,1.9], [2.0,2.1,2.2,2.3], [2.4,2.5,2.6,2.7], [2.8,2.9,3.0,3.1],[3.2,3.3,3.4,3.5],[3.6,3.7,3.8,3.9],[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ABCA1','ABCA10','ABCA12','AKT1','AKT2','AKT3','ARAF','ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])

Desired output:

Bg = pd.DataFrame([[4.0,4.1,4.2,4.3],[4.4,4.5,4.6,4.7],[4.8,4.9,5.0,5.1],[5.2,5.3,5.4,5.5],[5.6,5.7,5.8,5.9],[6.0,6.1,6.2,6.3],[6.4,6.5,6.6,6.7],[6.8,6.9,7.0,7.1],[7.2,7.3,7.4,7.5],[7.6,7.7,7.8,7.9]], columns=['TCGA-02-0033-01', 'TCGA-02-2470-01', 'TCGA-02-2483-01', 'TCGA-06-0124-01'], index =['ACP1','ACTB','ACTG1','ACTN1','ACTN2','ABAT','ACY3','ADSL','ADSS1','ADSS2'])

please provide a single code with constructors for all the inputs (so that one just needs to copy/paste and run to reproduce your data) — mozway
– mozway, Commented May 22, 2022 at 15:18
Hi welcome to SO! If it's ok with you, could you please double-check if your all your dataframe is aligned and correct? — user16836078
– user16836078, Commented May 22, 2022 at 15:56
@KevinChoonLiangYew The dataframes are correct. However, the geneSymbols in pathway_genes df are lists and I'm unable to put it in a single code. — melolilili
– melolilili, Commented May 22, 2022 at 16:44
Thanks for improving the format, so to clarify, you are trying to match the pathway_genes with the common_mrna dataframe based on the lists of index from the pathway_genes? And your pathway_genes is a dictionary? — user16836078
– user16836078, Commented May 22, 2022 at 16:51

score 0 · Accepted Answer · 2022-05-23 00:17:04Z

0

Firs of all, you can use list comprehension to match the M_index with the pathway_genes

pathway_genes = {'KEGG_ABC_TRANSPORTERS': ['ABCA1', 'ABCA10', 'ABCA12'], 'KEGG_ACUTE_MYELOID_LEUKEMIA': ['AKT1', 'AKT2', 'AKT3', 'ARAF'], 'KEGG_ADHERENS_JUNCTION': ['ACP1', 'ACTB', 'ACTG1', 'ACTN1', 'ACTN2'], 'KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY': ['ACACB', 'ACSL1', 'ACSL3', 'ACSL4', 'ACSL5'], 'KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM': ['ABAT', 'ACY3', 'ADSL', 'ADSS1', 'ADSS2']}

matched_index_symbols = [pathway_genes[i] for i in pathway_genes.keys() if i in M_index]

After that, you can use loc to match all the symbols.

flatten_list = [j for sub in matched_index_symbols for j in sub]

Bg = common_mrna.loc[flatten_list]
Out[26]: 
        TCGA-02-0033-01  TCGA-02-2470-01  TCGA-02-2483-01  TCGA-06-0124-01
ABCA1               1.2              1.3              1.4              1.5
ABCA10              1.6              1.7              1.8              1.9
ABCA12              2.0              2.1              2.2              2.3
ACP1                4.0              4.1              4.2              4.3
ACTB                4.4              4.5              4.6              4.7
ACTG1               4.8              4.9              5.0              5.1
ACTN1               5.2              5.3              5.4              5.5
ACTN2               5.6              5.7              5.8              5.9
ABAT                6.0              6.1              6.2              6.3
ACY3                6.4              6.5              6.6              6.7
ADSL                6.8              6.9              7.0              7.1
ADSS1               7.2              7.3              7.4              7.5
ADSS2               7.6              7.7              7.8              7.9

Update

It seems that your pathway_genes is not originally a dictionary but a dataframe. If that's the case, you can extract the column index of the dataframe.

pathway_genes
Out[46]: 
                                                                      geneSymbols
KEGG_ABC_TRANSPORTERS                                        [ABCA1, ABCA10, ABCA12]
KEGG_ACUTE_MYELOID_LEUKEMIA                                 [AKT1, AKT2, AKT3, ARAF]
KEGG_ADHERENS_JUNCTION                             [ACP1, ACTB, ACTG1, ACTN1, ACTN2]
KEGG_ADIPOCYTOKINE_SIGNALING_PATHWAY             [ACACB, ACSL1, ACSL3, ACSL4, ACSL5]
KEGG_ALANINE_ASPARTATE_AND_GLUTAMATE_METABOLISM     [ABAT, ACY3, ADSL, ADSS1, ADSS2]

matched_index_symbols = np.array([pathway_genes['geneSymbols'].loc[i] for i in pathway_genes.index if i in M_index])

flatten_list = matched_index_symbols.ravel()

edited May 23, 2022 at 0:17

answered May 22, 2022 at 17:06

user16836078

Sign up to request clarification or add additional context in comments.

12 Comments

user16836078 Over a year ago

Also, please avoid using enumerate if you are not using the indexes for every loop generated.

melolilili Over a year ago

dictionary is pathway_genes?

melolilili Over a year ago

The code fails when I call the index of M df as M.index using symbols = [pathway_genes[i] for i in pathway_genes.keys() if i in list(M.index)]. It returns an empty symbols variable.

user16836078 Over a year ago

Is your M.index the same list as you have provided in your question?

melolilili Over a year ago

Yes. It looks something like this

Index(['KEGG_VASOPRESSIN_REGULATED_WATER_REABSORPTION',        'KEGG_DRUG_METABOLISM_OTHER_ENZYMES', 'KEGG_PEROXISOME',        'KEGG_LONG_TERM_POTENTIATION',        'KEGG_CYSTEINE_AND_METHIONINE_METABOLISM',        'KEGG_AMINO_SUGAR_AND_NUCLEOTIDE_SUGAR_METABOLISM',        'KEGG_NOTCH_SIGNALING_PATHWAY', 'KEGG_FATTY_ACID_METABOLISM',        'KEGG_LONG_TERM_DEPRESSION', 'KEGG_CITRATE_CYCLE_TCA_CYCLE',       dtype='object', name='NAME', length=134)

|

Collectives™ on Stack Overflow

Efficient way to iterate over rows and columns in pandas

1 Answer 1

Update

12 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Update

12 Comments

Your Answer

Sign up or log in

Post as a guest

Related