Difference between df.reindex() and df.set_index() methods in pandas

Question

I was confused by this, which is very simple but I didn't immediately find the answer on StackOverflow:

df.set_index('xcol') makes the column 'xcol' become the index (when it is a column of df).
df.reindex(myList), however, takes indexes from outside the dataframe, for example, from a list named myList that we defined somewhere else.

However, df.reindex(myList) also changes values to NAs. A simple alternative is: df.index = myList

I hope this post clarifies it! Additions to this post are also welcome!

Ben.T · Accepted Answer · 2018-06-07 14:13:06Z

25

You can see the difference on a simple example. Let's consider this dataframe:

df = pd.DataFrame({'a': [1, 2],'b': [3, 4]})
print (df)
   a  b
0  1  3
1  2  4

Indexes are then 0 and 1

If you use set_index with the column 'a' then the indexes are 1 and 2. If you do df.set_index('a').loc[1,'b'], you will get 3.

Now if you want to use reindex with the same indexes 1 and 2 such as df.reindex([1,2]), you will get 4.0 when you do df.reindex([1,2]).loc[1,'b']

What happend is that set_index has replaced the previous indexes (0,1) with (1,2) (values from column 'a') without touching the order of values in the column 'b'

df.set_index('a')
   b
a   
1  3
2  4

while reindex change the indexes but keeps the values in column 'b' associated to the indexes in the original df

df.reindex(df.a.values).drop('a',1) # equivalent to df.reindex(df.a.values).drop('a',1)
     b
1  4.0
2  NaN
# drop('a',1) is just to not care about column a in my example

Finally, reindex change the order of indexes without changing the values of the row associated to each index, while set_index will change the indexes with the values of a column, without touching the order of the other values in the dataframe

answered Jun 7, 2018 at 14:13

Ben.T

29.7k6 gold badges39 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

prosti Over a year ago

Great explanation!

ntjess Over a year ago

Just a brief usage comment, pandas recommends using at rather than loc for single-cell indexing: df.at[1, 'b']. Loc is generally meant for accessing ranges.

prosti · Accepted Answer · 2019-05-15 12:26:01Z

8

Just to add, the undo to set_index would be reset_index method (more or less):

df = pd.DataFrame({'a': [1, 2],'b': [3, 4]})
print (df)

df.set_index('a', inplace=True)
print(df)

df.reset_index(inplace=True, drop=False)
print(df)

answered May 15, 2019 at 12:26

prosti

46.9k19 gold badges199 silver badges161 bronze badges

Comments

Long · Accepted Answer · 2019-08-16 08:18:37Z

4

Besides great answer from Ben. T, I would like to give one more example of how they are different when you use reindex and set_index to an index column

import pandas as pd
import numpy as np
testdf = pd.DataFrame({'a': [1, 3, 2],'b': [3, 5, 4],'c': [5, 7, 6]})

print(testdf)
print(testdf.set_index(np.random.permutation(testdf.index)))
print(testdf.reindex(np.random.permutation(testdf.index)))

Output:

With set_index, when index column (the first column) is shuffled, the order of other columns are kept intact
With reindex, the order of rows are changed accordingly to the shuffle of index column.

edited Aug 16, 2019 at 8:18

answered Aug 16, 2019 at 8:11

Long

1,8451 gold badge23 silver badges34 bronze badges

Collectives™ on Stack Overflow

Difference between df.reindex() and df.set_index() methods in pandas

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related