convert list to dataframe in python

Question

I have a text file with column header & data. I am trying to convert this file data into pandas DataFrame.

File:

#Columns: TargetDoc|GRank|LRank|Priority|Loc ID
aaaaa|1|1|Slow|8gkahinka.01
aaaaa|1|0|Slow|7nlafnjbaflnbja.01

I wrote below code: Firstly, I converted each line and trying list to convert Dataframe:

import os
import pandas as pd

with open("DocID101_201604070523.txt") as raw_file:
    full_file_text = raw_file.readlines()

raw_file.close()

data_list = list()
for l in full_file_text:
    if i.startswith('#'):
        labels = l.strip().replace('#Columns: ','').split('|')
    else:
        data_list += l.strip().split('|')

df = PD.DataFrame.from_records(data_list,columns=labels)

But I got error on df:

AssertionError: 5 columns passed, passed data had 10 columns.

What's wrong with my code or is there any better way convert to dataframe ?

Why not just pd.read_csv('file.txt', sep='|')?

pbreach
– pbreach

2016-12-21 15:39:05 +00:00
Commented Dec 21, 2016 at 15:39 — pbreach
– pbreach, Commented Dec 21, 2016 at 15:39
After getting rid of #Columns: .

pbreach
– pbreach

2016-12-21 15:40:39 +00:00
Commented Dec 21, 2016 at 15:40 — pbreach
– pbreach, Commented Dec 21, 2016 at 15:40

EdChum · Accepted Answer · 2016-12-21 15:45:51Z

3

You can just read in the file using read_csv with sep='|' and then fix the first column name as a post processing step using rename:

In [228]:
import io
import pandas as pd    
t="""#Columns: TargetDoc|GRank|LRank|Priority|Loc ID
aaaaa|1|1|Slow|8gkahinka.01
aaaaa|1|0|Slow|7nlafnjbaflnbja.01"""
df = pd.read_csv(io.StringIO(t), sep='|')
df

Out[228]:
  #Columns: TargetDoc  GRank  LRank Priority              Loc ID
0               aaaaa      1      1     Slow        8gkahinka.01
1               aaaaa      1      0     Slow  7nlafnjbaflnbja.01

Now rename the first column by passing in the first column name as the key for the passed in dict and split the string for the new column name:

In [229]:
df.rename(columns={df.columns[0]:df.columns[0].split()[-1]}, inplace=True)
df

Out[229]:
  TargetDoc  GRank  LRank Priority              Loc ID
0     aaaaa      1      1     Slow        8gkahinka.01
1     aaaaa      1      0     Slow  7nlafnjbaflnbja.01

So in your case:

df = pd.read_csv("DocID101_201604070523.txt", sep='|')

and then rename like the above

edited Dec 21, 2016 at 15:45

answered Dec 21, 2016 at 15:39

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

brianpck Over a year ago

Exactly. The point is that this is structured data and does not need to be treated like a text file.

user2517214 Over a year ago

Hi, real time the data is not structured and I am using below:

EdChum Over a year ago

well nothing in your question stated this

iFlo · Accepted Answer · 2016-12-21 15:42:17Z

1

That's because your are contataining all row into one list with :

data_list += l.strip().split('|')

What you want is :

data_list.append(l.strip().split('|'))

This way, you will get a list of list of 5 elements.

Edit : But the solution above of using csv separator is highly recommended.

answered Dec 21, 2016 at 15:42

iFlo

1,49411 silver badges21 bronze badges

2 Comments

user2517214 Over a year ago

In real time, the file may also have lot of junk data and i need to clean the data before get this structure. So I can't directly use read_csv and I need to create another temp file in order to use read_csv directly.

iFlo Over a year ago

So that's your solution ;)

Collectives™ on Stack Overflow

convert list to dataframe in python

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related