3

Coming from R background, I find the (very high) use of Index objects in pandas a little disconcerting. For example, if train is a pandas DataFrame, is there some special reason why train.columns should return an Index rather than a list? What purpose would additionally be served if it is an Index object? As per the definition of pandas.Index, it is the basic object storing axis labels for all pandas objects. While train.index.values does return the row labels (axis=0), how can I get column labels or columns names from pandas.index? In this question unlike in an earlier question, I have a specific example in mind.

3
  • 1
    Possible duplicate of What is the point of indexing in pandas? Commented Sep 14, 2017 at 14:17
  • The link above has some good info about why all elements of the index being hashable matters. Commented Sep 14, 2017 at 14:18
  • Thanks. It does. I am going through it. Commented Sep 14, 2017 at 14:22

2 Answers 2

4

A pd.Index is an array-like container of the column names, so in some sense it doesn't make sense to ask how to get the labels from the index, because the index is the labels.

That said, you can always get the underlying numpy array with df.columns.values, or convert to a python list with tolist() as @Mitch showed.

In terms of why an index is used over a bare array - an Index provides extra functionality/performance used throughout pandas - the core of which is hash table based indexing.

By example, consider the following frame / columns.

df = pd.DataFrame(np.random.randn(10, 10),
                  columns=list('abcdefghkm'))

cols = df.columns

cols
Out[16]: Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'k', 'm'], dtype='object')

Now say you want to select column 'h' out of the frame. With a list or array version of the columns, you would have loop over the columns to find the position of 'h', which is O(n) in the number of columns - something like this:

for i, col in enumerate(cols):
    if col == 'h':   
        found_loc = i
        break

found_loc
Out[18]: 7

df.values[:, found_loc]
Out[19]: 
array([-0.62916208,  2.04403495,  0.29498066,  1.07939374, -1.49619915,
       -0.54592646, -1.04382192, -0.45934113, -1.02935858,  1.62439231])

df['h']
Out[20]: 
0   -0.629162
1    2.044035
2    0.294981
3    1.079394
4   -1.496199
5   -0.545926
6   -1.043822
7   -0.459341
8   -1.029359
9    1.624392
Name: h, dtype: float64

With the Index, pandas constructs a hash table of the column values, so finding the location of 'h' is an amortized O(1) operation, generally significantly faster, especially if the number of columns is significant.

df.columns.get_loc('h')
Out[21]: 7

This example was only selecting a single column, but as @ayhan notes in the comment, this same hash table structure also speeds up many other operations like merging, alignment, filtering, and grouping.

Sign up to request clarification or add additional context in comments.

2 Comments

It all comes down to finding the location of 'h' but it might be worth mentioning that this speeds up many other operations like grouping, subsetting, merging etc.
An Excellent and complete answer.
3

From the documentation for pandas.Index

Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects

Having a regular list as an index for a DataFrame could cause issues with unorderable or unhashable objects, evidently - since it is backed by a hash table, the same principles apply as to why lists can't be dictionary keys in regular Python.

At the same time, the Index object being explicit permits us to use different types as an Index, as compared to the implicit integer index that NumPy has for instance, and perform fast lookups.

If you want to retrieve a list of column names, the Index object has a tolist method.

>>> df.columns.tolist()
['a', 'b', 'c']

2 Comments

Will be grateful if you can please expand upon the statement ' Having a regular list as an index for a DataFrame could cause issues with unorderable or unhashable objects, evidently. '. (Maybe there is an example.) Thanks.
@user3282777 An index is like a mapping to the DataFrame columns, sort of like a Python dict. So the same principles apply as for why you can't have mutable types as dict keys in regular Python, which the Python wiki has a useful bit on.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.