3

I have a very large pandas df I am writeing out to csv. I need to add a second header row containing the data types. The below code works but produces a third unexpected empty row in the CSV:

#! /usr/bin/env python
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))

# get count of header columns, add REAL for each one
types_header_for_insert = list(df.columns.values)
for idx, val in enumerate(types_header_for_insert):
    types_header_for_insert[idx] = 'REAL'

# count number of index columns, then add STRING for each one
index_count = len(df.index.names)
for idx in range(0, index_count):
    df.reset_index(level=0, inplace=True)
    types_header_for_insert.insert(0, 'STRING')

# insert the new types column
df.columns = pd.MultiIndex.from_tuples(zip(df.columns, types_header_for_insert))

print df.columns.values

df.to_csv("./test.csv", index=False)

output:

index,A,B
STRING,REAL,REAL
,,
0,1,2
1,3,4

How can I get rid of this extra blank row? Where does it come from?

3 Answers 3

3

I used a work around in the end (a) write the original headers to csv (b) replace the headers with the second header line and append whole df to first file:

# write the header to the file only
pd.DataFrame(data=[df.columns]).to_csv("outfile.csv", header=False, index=False)

# now replace header
types_header_for_insert = list(df.columns.values)
for idx, val in enumerate(df.columns.values):
    if df[val].dtype == 'float64':
        types_header_for_insert[idx] = 'REAL'

    elif self.grouped[val].dtype == 'int64':
        types_header_for_insert[idx] = 'INTEGER'

    else:
        types_header_for_insert[idx] = 'STRING'

df.columns = types_header_for_insert

# append the whole df with new header
df.to_csv("outfile.csv", mode="a", float_format='%.3f', index=False)
Sign up to request clarification or add additional context in comments.

Comments

2

I think it is bug, see opened issue 6618.

Maybe help little trick - add types_header_for_insert before first row to data:

#! /usr/bin/env python
import pandas as pd

df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))

# get count of header columns, add REAL for each one
types_header_for_insert = list(df.columns.values)
for idx, val in enumerate(types_header_for_insert):
    types_header_for_insert[idx] = 'REAL'

# count number of index columns, then add STRING for each one
index_count = len(df.index.names)
for idx in range(0, index_count):
    df.reset_index(level=0, inplace=True)
    types_header_for_insert.insert(0, 'STRING')

# insert the new types column
#df.columns = pd.MultiIndex.from_tuples(zip(df.columns, types_header_for_insert))

#set new value to dataframe
df.loc[-1]  = types_header_for_insert

#sort index 
df = df.sort_index()
print df
#     index     A     B
#-1  STRING  REAL  REAL
# 0       0     1     2
# 1       1     3     4

print df.to_csv(index=False)
#index,A,B
#STRING,REAL,REAL
#0,1,2
#1,3,4

EDIT

In large df you can use append:

#empty df with column from df
df1 = pd.DataFrame(columns = df.columns)
#create series from types_header_for_insert
s = pd.Series(types_header_for_insert, index=df.columns)
print s
index    STRING
A          REAL
B          REAL
dtype: object

df1 = df1.append(s, ignore_index=True).append(df, ignore_index=True)
print df1
    index     A     B
0  STRING  REAL  REAL
1       0     1     2
2       1     3     4

print df1.to_csv(index=False)
index,A,B
STRING,REAL,REAL
0,1,2
1,3,4

1 Comment

Yes, works but the sort operation is not efficient on a large table with a more complex multikey index (takes 30 mins to sort for my dataframe). In this case it may be more efficient to create a new dataframe with a single row and the same headers then merge, instead of append and sort.
0

In Python 3, the MultiIndex.from_tuples() fails with object of type 'zip' has no len(). However, wrapping the zip in list() works with no blank row. Consider trying it in Python 2:

df.columns = pd.MultiIndex.from_tuples(list(zip(df.columns, types_header_for_insert)))

print df.columns.values

df.to_csv("./test.csv", index=False)

#   index    A    B
#  STRING REAL REAL
#       0    1    2
#       1    3    4

Alternatively, to circumnavigate zip with list comprehension:

data = [df.columns, types_header_for_insert]
newcolumns = [tuple(i[j] for i in data) for j in range(min(len(l) for l in data))]
df.columns = pd.MultiIndex.from_tuples(newcolumns)

print df.columns.values

df.to_csv("./test.csv", index=False)

#   index    A    B
#  STRING REAL REAL
#       0    1    2
#       1    3    4

2 Comments

The first approach with list(zip()) still gives me the blank line in pandas 0.16.1 - for various reasons, I'm not able to update at his point. @jezrael points to this known bug as cause - issue 6618.
No luck for second approach either - avoiding the approach with zip still gives the third empty line as in my first code snippet ",,". What pd version was this with?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.