55

I would like to convert 'bytes' data into a Pandas dataframe.

The data looks like this (few first lines):

    (b'#Settlement Date,Settlement Period,CCGT,OIL,COAL,NUCLEAR,WIND,PS,NPSHYD,OCGT'
 b',OTHER,INTFR,INTIRL,INTNED,INTEW,BIOMASS\n2017-01-01,1,7727,0,3815,7404,3'
 b'923,0,944,0,2123,948,296,856,238,\n2017-01-01,2,8338,0,3815,7403,3658,16,'
 b'909,0,2124,998,298,874,288,\n2017-01-01,3,7927,0,3801,7408,3925,0,864,0,2'
 b'122,998,298,816,286,\n2017-01-01,4,6996,0,3803,7407,4393,0,863,0,2122,998'

The columns headers appear at the top. each subsequent line is a timestamp and numbers.

Is there a straightforward way to do this?

Thank you very much

@Paula Livingstone:

This seems to work:

s=str(bytes_data,'utf-8')

file = open("data.txt","w") 

file.write(s)
df=pd.read_csv('data.txt')

maybe this can be done without using a file in between.

0

3 Answers 3

64

You can also use BytesIO directly:

from io import BytesIO

df = pd.read_csv(BytesIO(bytes_data))

This will save you the step of transforming bytes_data to a string

Sign up to request clarification or add additional context in comments.

Comments

62

I had the same issue and found this library https://docs.python.org/2/library/stringio.html from the answer here: How to create a Pandas DataFrame from a string

Try something like:

from io import StringIO

s=str(bytes_data,'utf-8')

data = StringIO(s) 

df=pd.read_csv(data)

1 Comment

if in case you are getting bytes from subprocess module then s = subprocess.check_output(['docker', 'images']) s1=str(s,'utf-8') data = pd.read_fwf(StringIO(s1)) could help better
1

Ok cool, your input formatting is quite awkward but the following works:

with open('file.txt', 'r') as myfile:
    data=myfile.read().replace('\n', '') #read in file as a string

df = pd.Series(" ".join(data.strip(' b\'').strip('\'').split('\' b\'')).split('\\n')).str.split(',', expand=True)

print(df)

this produces the following:

                 0                  1     2    3     4        5      6   7   \
0  #Settlement Date  Settlement Period  CCGT  OIL  COAL  NUCLEAR   WIND  PS   
1        2017-01-01                  1  7727    0  3815     7404   3923   0   
2        2017-01-01                  2  8338    0  3815     7403   3658  16   
3        2017-01-01                  3  7927    0  3801     7408   3925   0   

       8      9      10     11      12      13     14       15  
0  NPSHYD  OCGT   OTHER  INTFR  INTIRL  INTNED  INTEW  BIOMASS  
1     944      0   2123    948     296     856    238           
2     909      0   2124    998     298     874    288           
3     864      0   2122    998     298     816    286     None 

In order for this to work you will need to ensure that your input file contains only a collection of complete rows. For this reason I removed the partial row for the purposes of the test.

As you have said that the data source is an http GET request then the initial read would take place using pandas.read_html.

More detail on this can be found here. Note specifically the section on io (io : str or file-like).

2 Comments

Thank you. My input is not from a file though. I created the file as an intermediate step but I would like to avoid using a file at all.
queried via an API from an HTTP request, and i get it in the bytes format shown in the question

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.