How to run t-test on multiple pandas columns

Question

I want to write a code (with few lines) that runs t-test on Product and Purchase_cost,warranty_years and service_cost at the same time.

# dataset 

import pandas as pd
from scipy.stats import ttest_ind

data = {'Product': ['laptop', 'printer','printer','printer','laptop','printer','laptop','laptop','printer','printer'],
        'Purchase_cost': [120.09, 150.45, 300.12, 450.11, 200.55,175.89,124.12,113.12,143.33,375.65],
        'Warranty_years':[3,2,2,1,4,1,2,3,1,2],
        'service_cost': [5,5,10,4,7,10,4,6,12,3]
    
        }

df = pd.DataFrame(data)

print(df)

code attempt for Product & Purchase_cost. I want to run t-test for Product & warranty_years and Product & service cost


#define samples
group1 = df[df['Product']=='laptop']
group2 = df[df['Product']=='printer']

#perform independent two sample t-test
ttest_ind(group1['Purchase_cost'], group2['Purchase_cost'])

what do you mean by "concurrently" and "efficiently"? Do you want to parallelize the computations? Is it really worth it (do you have very large data and actually run into efficiency issue?) or just premature optimization? — mozway
– mozway, Commented Nov 29, 2023 at 3:23
@mozway - I want a simple code instead of repeating the steps in my code attempt for service_cost and warranty_years. ignore parallelization and optimization — nasa313
– nasa313, Commented Nov 29, 2023 at 3:38
I see, then you can use vectorial code, or loop over the columns, see the answer below — mozway
– mozway, Commented Nov 29, 2023 at 3:45

mozway · Accepted Answer · 2023-11-29 04:03:17Z

ttest_ind can work on 2D (ND) inputs:

cols = df.columns.difference(['Product'])
# or with an explicit list
# cols = ['Purchase_cost', 'Warranty_years', 'service_cost']

group1 = df[df['Product']=='laptop']
group2 = df[df['Product']=='printer']
out = pd.DataFrame(ttest_ind(group1[cols], group2[cols]),
                   columns=cols, index=['statistic', 'pvalue'])

If it wasn't, you could have used a dictionary comprehension looping over your columns:

out = pd.DataFrame({c: ttest_ind(group1[c], group2[c]) for c in cols},
                    index=['statistic', 'pvalue'])

Output:

           Purchase_cost  Warranty_years  service_cost
statistic      -1.861113        3.513240     -0.919464
pvalue          0.099760        0.007924      0.384738

generalization to more pairs

If you have more than just laptop/printer as products and want to compare all pairs, you could generalize with:

from itertools import combinations

cols = df.columns.difference(['Product'])

g = df.groupby('Product')[cols]

out = pd.concat({(a,b): pd.DataFrame(ttest_ind(g.get_group(a), g.get_group(b)),
                                     columns=cols, index=['statistic', 'pvalue'])
                 for a, b in combinations(df['Product'].unique(), 2)
                }, names=['product1', 'product2'])

Example output with an extra category (phone):

                             Purchase_cost  Warranty_years  service_cost
product1 product2                                                       
laptop   printer  statistic      -1.861113        3.513240     -0.919464
                  pvalue          0.099760        0.007924      0.384738
         phone    statistic      -1.945836        2.988072      2.766417
                  pvalue          0.109251        0.030515      0.039533
printer  phone    statistic      -1.286968        0.423659      1.893370
                  pvalue          0.239026        0.684528      0.100178

If you have many combinations, note that you should likely post-process the data to account for multiple testing.

Collectives™ on Stack Overflow

How to run t-test on multiple pandas columns

1 Answer 1

generalization to more pairs

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

generalization to more pairs

Comments

Your Answer

Sign up or log in

Post as a guest

Related