Remove Column with Duplicate Values in Pandas

Question

I have a database with sample as below:

Data frame is generated when I load data in Python as per below code

import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)

Output:

Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading. Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work. Actual database is very big and has many duplicate column with Dates only.

@scott boston, tried that but am not sure if mangle_dupe_cols works in a way it should work. It gives an error "Setting mangle_dupe_cols=False is not supported yet". There are many threads also ongoing which shows that this command is not working properly. — Aditya Bhargava
– Aditya Bhargava, Commented Apr 11, 2018 at 3:59

jpp · Accepted Answer · 2018-04-11 12:57:41Z

3

There are 2 ways you can do this.

Ignore columns when reading the data

pandas.read_csv has the argument usecols, which accepts an integer list.

So you can try:

# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))

# use column integer list
df = pd.read_csv('file.csv', usecols=cols)

Remove columns from dataframe

You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.

# cols as defined in previous example

df = df.iloc[:, cols]

edited Apr 11, 2018 at 12:57

answered Apr 10, 2018 at 13:12

jpp

166k37 gold badges301 silver badges362 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Anton vBR · Accepted Answer · 2018-04-10 17:14:47Z

One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.

m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)

Full example:

import pandas as pd

data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''

m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)

print(df)

#         Date  Value1  Value2
#0  2018-01-01       0       1
#1  2018-01-02       0       1

Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:

df = df.loc[:,~df.columns.str.contains('.', regex=False)]

Full example:

import pandas as pd

data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''


df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)

#         Date  Value1  Value2
#0  2018-01-01       0       1
#1  2018-01-02       0       1

Collectives™ on Stack Overflow

Remove Column with Duplicate Values in Pandas

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related