3

I'm working with a csv file that presents multiple headers, all are repeated like in this example:

1                     2     3   4
0            POSITION_T  PROB  ID  
1                 2.385   2.0   1  
2            POSITION_T  PROB  ID 
3                 3.074   6.0   3  
4                 6.731   8.0   4    
6            POSITION_T  PROB  ID  
7                12.508   2.0   1  
8                12.932   4.0   2  
9                12.985   4.0   2  

I want to remove the duplicated headers to get the a final document like this:

0            POSITION_T  PROB  ID  
1                 2.385   2.0   1   
3                 3.074   6.0   3  
4                 6.731   8.0   4     
7                12.508   2.0   1  
8                12.932   4.0   2  
9                12.985   4.0   2  

The way in which I trying to remove these headers is by using:

df1 = [df!='POSITION_T'][df!='PROB'][df!='ID']

But that produce the error TypeError: Could not compare ['TRACK_ID'] with block values Some ideas? thanks in advance!

1
  • What does the actual text file look like? Commented Sep 1, 2017 at 16:10

4 Answers 4

4

Filtering out by field value:

df = pd.read_table('yourfile.csv', header=None, delim_whitespace=True, skiprows=1)
df.columns = ['0','POSITION_T','PROB','ID']
del df['0']

# filtering out the rows with `POSITION_T` value in corresponding column
df = df[df.POSITION_T.str.contains('POSITION_T') == False]

print(df)

The output:

  POSITION_T PROB ID
1      2.385  2.0  1
3      3.074  6.0  3
4      6.731  8.0  4
6     12.508  2.0  1
7     12.932  4.0  2
8     12.985  4.0  2
Sign up to request clarification or add additional context in comments.

2 Comments

I have a similar problem here stackoverflow.com/q/68705981/16421119
Would really appreciate your help!
3

To keep the bottom level column names only:

df.columns=[multicols[-1] for multicols in df.columns]

1 Comment

This worked for me (though in my use case, I was taking only the top level)
1

This is not ideal! The best way to deal with this would be to handle it in the file parsing.

mask = df.iloc[:, 0] == 'POSITION_T'
d1 = df[~mask]
d1.columns = df.loc[mask.idxmax].values

d1 = d1.apply(pd.to_numeric, errors='ignore')
d1

   POSITION_T  PROB  ID
1                      
1       2.385   2.0   1
3       3.074   6.0   3
4       6.731   8.0   4
7      12.508   2.0   1
8      12.932   4.0   2
9      12.985   4.0   2

1 Comment

Hi @pirsquared, I have a similar issue here. stackoverflow.com/q/68705981/16421119 Would really appreciate your help!
0
past_data=pd.read_csv("book.csv")

past_data = past_data[past_data.LAT.astype(str).str.contains('LAT') == False]

print(past_data)
  1. Replace the CSV (here: book.csv)
  2. Replace your variable names (here: past_data)
  3. Replace all the LAT with your any of your column name
  4. That's All/ your multiple headers will be removed

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.