48

I have a pandas.DataFrame that I wish to export to a CSV file. However, pandas seems to write some of the values as float instead of int types. I couldn't not find how to change this behavior.

Building a data frame:

df = pandas.DataFrame(columns=['a','b','c','d'], index=['x','y','z'], dtype=int)
x = pandas.Series([10,10,10], index=['a','b','d'], dtype=int)
y = pandas.Series([1,5,2,3], index=['a','b','c','d'], dtype=int)
z = pandas.Series([1,2,3,4], index=['a','b','c','d'], dtype=int)
df.loc['x']=x; df.loc['y']=y; df.loc['z']=z

View it:

>>> df
    a   b    c   d
x  10  10  NaN  10
y   1   5    2   3
z   1   2    3   4

Export it:

>>> df.to_csv('test.csv', sep='\t', na_rep='0', dtype=int)
>>> for l in open('test.csv'): print l.strip('\n')
        a       b       c       d
x       10.0    10.0    0       10.0
y       1       5       2       3
z       1       2       3       4

Why do the tens have a dot zero ?

Sure, I could just stick this function into my pipeline to reconvert the whole CSV file, but it seems unnecessary:

def lines_as_integer(path):
    handle = open(path)
    yield handle.next()
    for line in handle:
        line = line.split()
        label = line[0]
        values = map(float, line[1:])
        values = map(int, values)
        yield label + '\t' + '\t'.join(map(str,values)) + '\n'
handle = open(path_table_int, 'w')
handle.writelines(lines_as_integer(path_table_float))
handle.close()
6
  • 7
    you should import pandas as pd :) Commented Jun 13, 2013 at 16:52
  • 13
    @Andy Why should I do that ? Namespaces are a great idea... until you abbreviate them all and it becomes unreadable. Commented Sep 22, 2015 at 15:08
  • 2
    @AndyHayden Longer to type, but definitely easier to read. To a novice stumbling on the code, pd signifies Police Department. Or worse if he speaks french. Commented Nov 18, 2015 at 16:31
  • 8
    It's just a convention - use it, or don't use it - depends on the expectation of who your audience is likely to be - For many pandas users, the convention is to use pd, just as in the UK, the convention is to drive on the left. It's not a problem until you have to share the same stretch of road. Commented Nov 1, 2016 at 17:11
  • 2
    I don't think that analogy is adequate, because driving on the left is incompatible with driving on the right. However, using the full package name works fine for a veteran that knows about the abbreviation standard, while the opposite is not true (a novice is baffled by pd). Commented Feb 24, 2019 at 15:51

9 Answers 9

22

The answer I was looking for was a slight variation of what @Jeff proposed in his answer. The credit goes to him. This is what solved my problem in the end for reference:

import pandas
df = pandas.DataFrame(data, columns=['a','b','c','d'], index=['x','y','z'])
df = df.fillna(0)
df = df.astype(int)
df.to_csv('test.csv', sep='\t')
Sign up to request clarification or add additional context in comments.

4 Comments

This gets around having any floats but you lose the NaN info. Perhaps fill NA with -9999 or some value that you know is not 'real' in your data set.
you may refer to my answer below to preserve NaN
How to do that only for one column? My df has mixed types, strings and numbers.
if your data are natural numbers (nonnegative integers), using df.fillna(-1) is an option.
18

This is a "gotcha" in pandas (Support for integer NA), where integer columns with NaNs are converted to floats.

This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues to be “numeric”. One possibility is to use dtype=object arrays instead.

2 Comments

So no way to get them as integers without reparsing the whole file ? How about if I use df.fillna() ?
Use dtype=object (rather than int) when creating x and df.
10

The problem is that since you are assigning things by rows, but dtypes are grouped by columns, so things get cast to object dtype, which is not a good thing, you lose all efficiency. So one way is to convert which will coerce to float/int dtype as needed.

As we answered in another question, if you construct the frame all at once (or construct column by column) this step will not be needed

In [23]: def convert(x):
   ....:     try:
   ....:         return x.astype(int)
   ....:     except:
   ....:         return x
   ....:     

In [24]: df.apply(convert)
Out[24]: 
    a   b   c   d
x  10  10 NaN  10
y   1   5   2   3
z   1   2   3   4

In [25]: df.apply(convert).dtypes
Out[25]: 
a      int64
b      int64
c    float64
d      int64
dtype: object

In [26]: df.apply(convert).to_csv('test.csv')

In [27]: !cat test.csv
,a,b,c,d
x,10,10,,10
y,1,5,2.0,3
z,1,2,3.0,4

11 Comments

But then there is .0s in the c columns... :s
because its a float! no choice there (well you CAN pass float_format='%.0f' to to_csv but that is could lead to loss of precision –
But but..., if you use dtype=object (e.g. in x and df via OP's construction, which I agree is not best way) then 2, 3 and 10s are all ints... it's almost always not worth worrying about anyway. This seems just like the transpose of OP's effort :s
yep...keep stressing that having object dtype for numbers is bad....maybe we should put in a PerformanceWarning if that occurs (e.g. like in this case)....
If they have gone out of their way to choose dtype=object though, surely they deserve what they get (if they don't they'd get a float). A better solution would for numpy to support NaNs in integer arrays... ;)
|
9

The simplest solution is to use float_format in pd.read_csv():

df.to_csv('test.csv', sep='\t', na_rep=0, float_format='%.0f')

But this applies to all float columns. BTW: Using your code on pandas 1.1.5, all of my columns are float.

Output:

    a   b   c   d
x   10  10  0   10
y   1   5   2   3
z   1   2   3   4

Without float_format:

    a   b   c   d
x   10.0    10.0    0    10.0
y    1.0     5.0    2.0   3.0
z    1.0     2.0    3.0   4.0

1 Comment

This is by far the best and most precise answer, it does exactly what was asked for in the question. Should get more upvotes. Solved my (same) problem, thanks!
8

If you want to preserve NaN info in the csv which you have exported, then do the below. P.S : I'm concentrating on column 'C' in this case.

df[c] = df[c].fillna('')       #filling Nan with empty string
df[c] = df[c].astype(str)      #convert the column to string 
>>> df
    a   b    c     d
x  10  10         10
y   1   5    2.0   3
z   1   2    3.0   4

df[c] = df[c].str.split('.')   #split the float value into list based on '.'
>>> df
        a   b    c          d
    x  10  10   ['']       10
    y   1   5   ['2','0']   3
    z   1   2   ['3','0']   4

df[c] = df[c].str[0]            #select 1st element from the list
>>> df
    a   b    c   d
x  10  10       10
y   1   5    2   3
z   1   2    3   4

Now, if you export the dataframe to csv, Column 'c' will not have float values and the NaN info is preserved.

1 Comment

This solution is nice, but it supposes you know in which column you have missing data, which is rarely the case.
1

You can use astype() to specify data type for each column

For example:

import pandas
df = pandas.DataFrame(data, columns=['a','b','c','d'], index=['x','y','z'])

df = df.astype({"a": int, "b": complex, "c" : float, "d" : int})

Comments

1

Just write it out as string to csv:

df.to_csv('test.csv', sep='\t', na_rep='0', dtype=str)

2 Comments

It does not work at all. TypeError: to_csv() got an unexpected keyword argument 'dtype'
if it doesn't work, use astype() to convert the data type
0

You can change your DataFrame into Numpy array as a workaround:

 np.savetxt(savepath, np.array(df).astype(np.int), fmt='%i', delimiter = ',', header= 'PassengerId,Survived', comments='')

Comments

0

Here is yet another solution:

df['IntColumnWithNAValues'].fillna(0, inplace=True) #Fill with a value that is out of your range

df['IntColumnWithNAValues'] = df['IntColumnWithNAValues'].astype(int)

df['IntColumnWithNAValues'].replace(0, '', inplace=True)

.csv files doesn't differentiate between NA or '' (empty string) as it as a text file, so you get to keep your missing fields while converting non null values to int.

You can do this for every column that you want; If you have lots of columns it might be a problem.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.