1

Currently, I have to read the CSV file and set the headers in advance. And then drop the columns which I don't want. Is there any way to do this directly?

# Current Code
columns_name = ['station', 'date', 'observation', 'value', 'other_1', 
'other_2', 'other_3', 'other_4']
del_columns_name = ['other_1', 'other_2', 'other_3', 'other_4']
df =pd.read_csv('filename', names = columns_name)
df.drop(del_columns_name, axis=1)
4
  • I don't see anything wrong. Possibly you could avoid reading them from the start already. df.drop(del_columns_name, axis=1, inplace=True) or df = df.drop(del_columns_name, axis=1) Commented May 11, 2018 at 23:45
  • It's right. But I want to know whether there is a direct way to do my 4 lines codes. Commented May 11, 2018 at 23:51
  • In that case you might aswell pass the indexes right away. Commented May 11, 2018 at 23:52
  • Did one of the below solutions help? Feel free to accept one (tick on left), or ask for clarification. Commented May 16, 2018 at 11:43

2 Answers 2

2

One way is to use your two lists to resolve the indices and column names required.

Then use usecols and names arguments for pd.read_csv to specify column indices and names respectively.

idx, cols = list(zip(*((i, x) for i, x in enumerate(columns_name) \
                 if x not in del_columns_name)))

df = pd.read_csv('filename', usecols=idx, names=cols, header=None)

As explained in the docs, you should also specify header=None explicitly when no header exists.

Explanation

  • Use a generator expression to iterate columns_name and remove items not in del_columns_name.
  • Use enumerate to extract indices.
  • Use zip to create separate tuples for indices and column names.
Sign up to request clarification or add additional context in comments.

2 Comments

I liked the expression but it seems a bit overkill for the small example above but almost necessary if you have a more complex system.
@AntonvBR, Yeh I'm not really sure where the column names come from. It could be from a static config file, for example. In which case, you may be forced into something like this.
2

I think you might even specify the indexes right away. In this case you are insterested in: [0,1,2,3]. Consider this example which also parses dates.

import pandas as pd

cols = ['station', 'date', 'observation', 'value']

data = '''\
1, 2018-01-01, 1, 1, 1, 1, 1, 1
2, 2018-01-02, 2, 2, 2, 2, 2, 2'''

file = pd.compat.StringIO(data)
df = pd.read_csv(file, names=cols, usecols=[0,1,2,3], parse_dates=[1])

print(df)

Returns:

   station       date  observation  value
0        1 2018-01-01            1      1
1        2 2018-01-02            2      2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.