Convert columns to string in Pandas

Question

I have the following DataFrame from a SQL query:

(Pdb) pp total_rows
     ColumnID  RespondentCount
0          -1                2
1  3030096843                1
2  3030096845                1

and I pivot it like this:

total_data = total_rows.pivot_table(cols=['ColumnID'])

which produces

(Pdb) pp total_data
ColumnID         -1            3030096843   3030096845
RespondentCount            2            1            1

[1 rows x 3 columns]

When I convert this dataframe into a dictionary (using total_data.to_dict('records')[0]), I get

{3030096843: 1, 3030096845: 1, -1: 2}

but I want to make sure the 303 columns are cast as strings instead of integers so that I get this:

{'3030096843': 1, '3030096845': 1, -1: 2}

From pandas 1.0, the documentation recommends using astype("string") instead of astype(str) for some pretty good reasons, take a look. — cs95
– cs95, Commented Jul 19, 2020 at 10:19

Andy Hayden · Accepted Answer · 2014-02-25 06:38:55Z

629

One way to convert to string is to use astype:

total_rows['ColumnID'] = total_rows['ColumnID'].astype(str)

However, perhaps you are looking for the to_json function, which will convert keys to valid json (and therefore your keys to strings):

In [11]: df = pd.DataFrame([['A', 2], ['A', 4], ['B', 6]])

In [12]: df.to_json()
Out[12]: '{"0":{"0":"A","1":"A","2":"B"},"1":{"0":2,"1":4,"2":6}}'

In [13]: df[0].to_json()
Out[13]: '{"0":"A","1":"A","2":"B"}'

Note: you can pass in a buffer/file to save this to, along with some other options...

answered Feb 25, 2014 at 6:38

Andy Hayden

378k110 gold badges640 silver badges546 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Keith Over a year ago

I think to_string() is preferable due to the preservation of NULLs stackoverflow.com/a/44008334/3647167

3pitt Over a year ago

@Keith null preservation is attractive. but the doc says its purpose is to 'Render a DataFrame to a console-friendly tabular output'. i'd like someone authoritative to weigh in

Sussch Over a year ago

to_json() probably does not call astype(str) as it leaves datetime64 and its subclasses as milliseconds since epoch.

Andy Hayden Over a year ago

@Sussch I suspect that's because json doesn't have an explicit datetime format, so you're kinda forced to use epoch. Which is to say, I think that's the standard.

rocksteady Over a year ago

@webNoob13: this is desired/intended behaviour - those are Pandas strings, essentially. See here: stackoverflow.com/questions/34881079/…

|

Mike · Accepted Answer · 2018-11-15 13:53:29Z

157

If you need to convert ALL columns to strings, you can simply use:

df = df.astype(str)

This is useful if you need everything except a few columns to be strings/objects, then go back and convert the other ones to whatever you need (integer in this case):

 df[["D", "E"]] = df[["D", "E"]].astype(int)

answered Nov 15, 2018 at 13:53

Mike

2,7718 gold badges29 silver badges34 bronze badges

2 Comments

abhijat_saxena Over a year ago

I would prefer your answer - because the OP asked for 'all' columns, not individual columns.

taiyodayo Over a year ago

Doesn't this throw SettingWithCopyWarning: in latest pandas?

wjandrea · Accepted Answer · 2023-08-04 00:15:30Z

103

pandas >= 1.0: It's time to stop using `astype(str)`!

Prior to pandas 1.0 (well, 0.25 actually) this was the defacto way of declaring a Series/column as as string:

# pandas <= 0.25
# Note to pedants: specifying the type is unnecessary since pandas will 
# automagically infer the type as object
s = pd.Series(['a', 'b', 'c'], dtype=str)
s.dtype
# dtype('O')

From pandas 1.0 onwards, consider using "string" type instead.

# pandas >= 1.0
s = pd.Series(['a', 'b', 'c'], dtype="string")
s.dtype
# StringDtype

Here's why, as quoted by the docs:

You can accidentally store a mixture of strings and non-strings in an object dtype array. It’s better to have a dedicated dtype.

object dtype breaks dtype-specific operations like DataFrame.select_dtypes(). There isn’t a clear way to select just text while excluding non-text but still object-dtype columns.

When reading code, the contents of an object dtype array is less clear than 'string'.

See also the section on Behavioral Differences between "string" and object.

Extension types (introduced in 0.24 and formalized in 1.0) are closer to pandas than numpy, which is good because numpy types are not powerful enough. For example NumPy does not have any way of representing missing data in integer data (since type(NaN) == float). But pandas can using Nullable Integer columns.

Why should I stop using it?

Accidentally mixing dtypes

The first reason, as outlined in the docs is that you can accidentally store non-text data in object columns.

# pandas <= 0.25
pd.Series(['a', 'b', 1.23])   # whoops, this should have been "1.23"

0       a
1       b
2    1.23
dtype: object

pd.Series(['a', 'b', 1.23]).tolist()
# ['a', 'b', 1.23]   # oops, pandas was storing this as float all the time.

# pandas >= 1.0
pd.Series(['a', 'b', 1.23], dtype="string")

0       a
1       b
2    1.23
dtype: string

pd.Series(['a', 'b', 1.23], dtype="string").tolist()
# ['a', 'b', '1.23']   # it's a string and we just averted some potentially nasty bugs.

Challenging to differentiate strings and other python objects

Another obvious example example is that it's harder to distinguish between "strings" and "objects". Objects are essentially the blanket type for any type that does not support vectorizable operations.

Consider,

# Setup
df = pd.DataFrame({'A': ['a', 'b', 'c'], 'B': [{}, [1, 2, 3], 123]})
df
 
   A          B
0  a         {}
1  b  [1, 2, 3]
2  c        123

Upto pandas 0.25, there was virtually no way to distinguish that "A" and "B" do not have the same type of data.

# pandas <= 0.25  
df.dtypes

A    object
B    object
dtype: object

df.select_dtypes(object)

   A          B
0  a         {}
1  b  [1, 2, 3]
2  c        123

From pandas 1.0, this becomes a lot simpler:

# pandas >= 1.0
# Convenience function I call to help illustrate my point.
df = df.convert_dtypes()
df.dtypes

A    string
B    object
dtype: object

df.select_dtypes("string")

   A
0  a
1  b
2  c

Readability

This is self-explanatory ;-)

OK, so should I stop using it right now?

...No. As of writing this answer (version 1.1), there are no performance benefits but the docs expect future enhancements to significantly improve performance and reduce memory usage for "string" columns as opposed to objects. With that said, however, it's never too early to form good habits!

edited Aug 4, 2023 at 0:15

wjandrea

33.8k10 gold badges69 silver badges105 bronze badges

answered Jul 19, 2020 at 10:10

cs95

406k106 gold badges744 silver badges797 bronze badges

5 Comments

Nages Over a year ago

This works if source is a,b,c and fails if source is 1,2,3 etc.

cs95 Over a year ago

@Nages I hope so, it generally doesn't make sense to represent numeric data as text.

Nages Over a year ago

That is right. But some times like it happens if you are trying to solve Kaggle titanic competition where Pclass is represented as 1,2 and 3. Here it should be categorical like string format instead of numeric. To solve this problem str has helped instead of string in that case. Any way thanks it works for characters. Thanks for sharing this documentation details.

wojciech Over a year ago

As of version 1.4.3, Pandas "string" dtype is still considered experimental.

w. Patrick Gale Over a year ago

df.astype(str) is useful if you are trying to fill missing data (empty cells) in your dataFrame. For instance, if you are trying to query data within a dataFrame column that contains empty cells, the queries might not act as expected since the column data will likely be treated as an object type. If you are trying to use string comparison operations on an object, then be prepared for the unexpected. See - pandas.pydata.org/docs/user_guide/missing_data.html

Surya Chhetri · Accepted Answer · 2017-08-23 21:32:24Z

33

Here's the other one, particularly useful to convert the multiple columns to string instead of just single column:

In [76]: import numpy as np
In [77]: import pandas as pd
In [78]: df = pd.DataFrame({
    ...:     'A': [20, 30.0, np.nan],
    ...:     'B': ["a45a", "a3", "b1"],
    ...:     'C': [10, 5, np.nan]})
    ...: 

In [79]: df.dtypes ## Current datatype
Out[79]: 
A    float64
B     object
C    float64
dtype: object

## Multiple columns string conversion
In [80]: df[["A", "C"]] = df[["A", "C"]].astype(str) 

In [81]: df.dtypes ## Updated datatype after string conversion
Out[81]: 
A    object
B    object
C    object
dtype: object

answered Aug 23, 2017 at 21:32

Surya Chhetri

11.7k4 gold badges61 silver badges39 bronze badges

Comments

Govinda · Accepted Answer · 2021-07-30 19:19:15Z

26

There are four ways to convert columns to string

1. astype(str)
df['column_name'] = df['column_name'].astype(str)

2. values.astype(str)
df['column_name'] = df['column_name'].values.astype(str)

3. map(str)
df['column_name'] = df['column_name'].map(str)

4. apply(str)
df['column_name'] = df['column_name'].apply(str)

Lets see the performance of each type

#importing libraries
import numpy as np
import pandas as pd
import time

#creating four sample dataframes using dummy data
df1 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df2 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df3 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])
df4 = pd.DataFrame(np.random.randint(1, 1000, size =(10000000, 1)), columns =['A'])

#applying astype(str)
time1 = time.time()
df1['A'] = df1['A'].astype(str)
print('time taken for astype(str) : ' + str(time.time()-time1) + ' seconds')

#applying values.astype(str)
time2 = time.time()
df2['A'] = df2['A'].values.astype(str)
print('time taken for values.astype(str) : ' + str(time.time()-time2) + ' seconds')

#applying map(str)
time3 = time.time()
df3['A'] = df3['A'].map(str)
print('time taken for map(str) : ' + str(time.time()-time3) + ' seconds')

#applying apply(str)
time4 = time.time()
df4['A'] = df4['A'].apply(str)
print('time taken for apply(str) : ' + str(time.time()-time4) + ' seconds')

Output

time taken for astype(str): 5.472359895706177 seconds
time taken for values.astype(str): 6.5844292640686035 seconds
time taken for map(str): 2.3686647415161133 seconds
time taken for apply(str): 2.39758563041687 seconds

map(str) and apply(str) are takes less time compare with remaining two techniques

edited Jul 30, 2021 at 19:19

answered Jul 30, 2021 at 18:17

Govinda

98910 silver badges7 bronze badges

4 Comments

tdy Over a year ago

your results are suspicious. .astype(str) should definitely be fastest. use %timeit to get more reliable results (gives you the average over many trials). %timeit gives me 654ms for .astype(str), 1.4s for .values.astype(str), 2.11s for .map(str), and 1.74s for for .apply(str).

cottontail Over a year ago

Even though these tests use wall time (time.time()), which isn't precise and shouldn't be used to test performance, it turns out timeit test agrees with these results.

tdy Over a year ago

Huh, I wonder if I had hastily timed a smaller Series. Anyway it's good that you took the time to time it rigorously, plus introducing the faster map(repr).

ChaimG Over a year ago

You can add df4['A'] = np.array2string(df4['A'].to_numpy()). This tested 8x as fast as the fastest answer!

Feng Bao · Accepted Answer · 2020-11-13 22:43:19Z

8

I usually use this one:

pd['Column'].map(str)

answered Nov 13, 2020 at 22:43

Feng Bao

911 silver badge2 bronze badges

Comments

Mike William Dopp · Accepted Answer · 2023-02-24 12:12:38Z

5

currently i do it like this

df_pg['store_id'] = df_pg['store_id'].astype('string')

answered Feb 24, 2023 at 12:12

Mike William Dopp

871 silver badge9 bronze badges

Comments

cottontail · Accepted Answer · 2023-02-22 20:49:29Z

1. `.map(repr)` is very fast

If you want to convert values to strings in a column, consider .map(repr). For multiple columns, consider .applymap(str).

df['col_as_str'] = df['col'].map(repr)

# multiple columns
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(str)
# or
df[['col1', 'col2']] = df[['col1', 'col2']].apply(lambda col: col.map(repr))

In fact, a timeit test shows that map(repr) is 3 times faster than astype(str) (and is faster than any other method mentioned on this page). Even for multiple columns, this runtime difference still holds. The following is the runtime plot of various methods mentioned here.

astype(str) has very little overhead but for larger frames/columns, map/applymap outperforms it.

2. Don't convert to strings in the first place

There's very little reason to convert a numeric column into strings given pandas string methods are not optimized and often get outperformed by vanilla Python string methods. If not numeric, there are dedicated methods for those dtypes. For example, datetime columns should be converted to strings using pd.Series.dt.strftime().

One way numeric->string seems to be used is in a machine learning context where a numeric column needs to be treated as categorical. In that case, instead of converting to strings, consider other dedicated methods such as pd.get_dummies or sklearn.preprocessing.LabelEncoder or sklearn.preprocessing.OneHotEncoder to process your data instead.

3. Use `rename` to convert column names to specific types

The specific question in the OP is about converting column names to strings, which can be done by rename method:

df = total_rows.pivot_table(columns=['ColumnID'])
df.rename(columns=str).to_dict('records')
# [{'-1': 2, '3030096843': 1, '3030096845': 1}]

The code used to produce the above plots:

import numpy as np
from perfplot import plot
plot(
    setup=lambda n: pd.Series(np.random.default_rng().integers(0, 100, size=n)),
    kernels=[lambda s: s.astype(str), lambda s: s.astype("string"), lambda s: s.apply(str), lambda s: s.map(str), lambda s: s.map(repr)],
    labels= ['col.astype(str)', 'col.astype("string")', 'col.apply(str)', 'col.map(str)', 'col.map(repr)'],
    n_range=[2**k for k in range(4, 22)],
    xlabel='Number of rows',
    title='Converting a single column into string dtype',
    equality_check=lambda x,y: np.all(x.eq(y)));
plot(
    setup=lambda n: pd.DataFrame(np.random.default_rng().integers(0, 100, size=(n, 100))),
    kernels=[lambda df: df.astype(str), lambda df: df.astype("string"), lambda df: df.applymap(str), lambda df: df.apply(lambda col: col.map(repr))],
    labels= ['df.astype(str)', 'df.astype("string")', 'df.applymap(str)', 'df.apply(lambda col: col.map(repr))'],
    n_range=[2**k for k in range(4, 18)],
    xlabel='Number of rows in dataframe',
    title='Converting every column of a 100-column dataframe to string dtype',
    equality_check=lambda x,y: np.all(x.eq(y)));

Interesting! I've never seen Series.map(repr), but the timing makes sense: Why is repr(int) faster than str(int)?

biraj silwal · Accepted Answer · 2022-11-03 02:44:45Z

2

pandas version: 1.3.5

Updated answer

df['colname'] = df['colname'].astype(str) => this should work by default. But if you create str variable like str = "myString" before using astype(str), this won't work. In this case, you might want to use the below line.

df['colname'] = df['colname'].astype('str')

===========

(Note: incorrect old explanation)

df['colname'] = df['colname'].astype('str') => converts dataframe column into a string type

df['colname'] = df['colname'].astype(str) => gives an error

edited Nov 3, 2022 at 2:44

answered Oct 29, 2022 at 20:07

biraj silwal

517 bronze badges

3 Comments

DaveFar Over a year ago

I'm using pandas version 1.4.0 and do not get an error for astype(str)

biraj silwal Over a year ago

You're right. it worked for me as well. I tried to do astype(str) right after reading the file and it worked. I guess previously it gave me an error because I tried to use astype(str) after other operations.

biraj silwal Over a year ago

Okay, I found out why astype(str) didn't work for me before. it's because I created str = "value" variable before using astype(str).

dbouz · Accepted Answer · 2020-02-04 10:19:19Z

0

Using .apply() with a lambda conversion function also works in this case:

total_rows['ColumnID'] = total_rows['ColumnID'].apply(lambda x: str(x))

For entire dataframes you can use .applymap(). (but in any case probably .astype() is faster)

answered Feb 4, 2020 at 10:19

dbouz

9391 gold badge10 silver badges14 bronze badges

Collectives™ on Stack Overflow

Convert columns to string in Pandas

10 Answers 10

7 Comments

2 Comments

pandas >= 1.0: It's time to stop using `astype(str)`!

Why should I stop using it?

Accidentally mixing dtypes

Challenging to differentiate strings and other python objects

Readability

OK, so should I stop using it right now?

5 Comments

Comments

4 Comments

Comments

Comments

1. `.map(repr)` is very fast

2. Don't convert to strings in the first place

3. Use `rename` to convert column names to specific types

1 Comment

pandas version: 1.3.5

Updated answer

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

10 Answers 10

7 Comments

2 Comments

pandas >= 1.0: It's time to stop using astype(str)!

Why should I stop using it?

Accidentally mixing dtypes

Challenging to differentiate strings and other python objects

Readability

OK, so should I stop using it right now?

5 Comments

Comments

4 Comments

Comments

Comments

1. .map(repr) is very fast

2. Don't convert to strings in the first place

3. Use rename to convert column names to specific types

1 Comment

pandas version: 1.3.5

Updated answer

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

pandas >= 1.0: It's time to stop using `astype(str)`!

1. `.map(repr)` is very fast

3. Use `rename` to convert column names to specific types