3

I have a database with sample as below: enter image description here

Data frame is generated when I load data in Python as per below code

import os
import pandas as pd
data_dir="D:\\userdata\\adbharga\\Desktop\\AVA\\PythonCoding\\VF-Aus\\4G Cell Graphs"
os.chdir(data_dir)
df = pd.read_csv('CA Throughput(Kbit_s) .csv',index_col=None, header=0)

Output:

enter image description here

Is there any way by which we can avoid reading duplicate columns in Pandas, or remove the duplicate columns post reading. Pl Note: Column Name is different once data is read in Pandas, so command like df=df.loc[:,~df.columns.duplicated()] won't work. Actual database is very big and has many duplicate column with Dates only.

1
  • 2
    @scott boston, tried that but am not sure if mangle_dupe_cols works in a way it should work. It gives an error "Setting mangle_dupe_cols=False is not supported yet". There are many threads also ongoing which shows that this command is not working properly. Commented Apr 11, 2018 at 3:59

2 Answers 2

3

There are 2 ways you can do this.

Ignore columns when reading the data

pandas.read_csv has the argument usecols, which accepts an integer list.

So you can try:

# work out required columns
df = pd.read_csv('file.csv', header=0)
cols = [0] + list(range(1, len(df.columns), 2))

# use column integer list
df = pd.read_csv('file.csv', usecols=cols)

Remove columns from dataframe

You can use similar logic with pd.DataFrame.iloc to remove unwanted columns.

# cols as defined in previous example

df = df.iloc[:, cols]
Sign up to request clarification or add additional context in comments.

Comments

1

One way of do it could be to read only the first row and create a mask using drop_duplicates(). This we pass to the usecols without the need to specify the index beforehand. It should be failsafe.

m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)

Full example:

import pandas as pd

data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''

m = pd.read_csv(pd.compat.StringIO(data),nrows=1, header=None).T.drop_duplicates().index
df = pd.read_csv(pd.compat.StringIO(data), usecols=m)

print(df)

#         Date  Value1  Value2
#0  2018-01-01       0       1
#1  2018-01-02       0       1

Another way to do it would be to remove all columns with a dot inside .. This should work in most cases as the dot is rarely used in column names:

df = df.loc[:,~df.columns.str.contains('.', regex=False)]

Full example:

import pandas as pd

data = '''\
Date,Value1,Date,Value2
2018-01-01,0,2018-01-01,1
2018-01-02,0,2018-01-02,1'''


df = pd.read_csv(pd.compat.StringIO(data))
df = df.loc[:,~df.columns.str.contains('.', regex=False)]
print(df)

#         Date  Value1  Value2
#0  2018-01-01       0       1
#1  2018-01-02       0       1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.