I was just tinkering around and found this amusing:
>>> import pandas as pd
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> x = set(df)
>>> x
{'col2', 'col1'}
Why does pandas return column names as set values?
Because that's how __iter__ is defined in the source code for NDFrame, of which pd.DataFrame is a child:
def __iter__(self):
"""Iterate over infor axis"""
return iter(self._info_axis)
pd.DataFrame._info_axis is used internally to store column labels:
df = pd.DataFrame(columns=list('abcd'))
df._info_axis # Index(['a', 'b', 'c', 'd'], dtype='object')
set iterates the pd.DataFrame instance via __iter__, hashes each element, and returns a set of values corresponding to unique column labels.
You can find the implementation for __iter__ in DataFrame's parent class NDFrame:
def __iter__(self):
"""Iterate over infor axis"""
return iter(self._info_axis)
It's essentially the same as calling keys on a DataFrame, defined in the same location. I'm including it here because the docstring is more helpful, and describes the differences in _info_axis between Series, DataFrame and Panel
def keys(self):
"""Get the 'info axis' (see Indexing for more)
This is index for Series, columns for DataFrame and major_axis for
Panel.
"""
return self._info_axis
__iter__method but couldn't find it. I am sorry if this is a stupid question. I am learning.NDFrame