1

Could I ask how to retrieve an index of a row in a DataFrame? Specifically, I am able to retrieve the index of rows from a df.loc.

idx = data.loc[data.name == "Smith"].index

I can even retrieve row index from df.loc by using data.index like this:

idx = data.loc[data.index == 5].index

However, I cannot retrieve the index directly from the row itself (i.e., from row.index, instead of df.loc[].index). I tried using these codes:

idx = data.iloc[5].index

The result of this code is the column names.

To provide context, the reason I need to retrieve the index of a specific row (instead of rows from df.loc) is to use df.apply for each row. I plan to use df.apply to apply a code to each row and copy the data from the row immediately above them.

def retrieve_gender (row):
    # This is a panel data, whose only data in 2000 is already keyed in. Time-invariant data in later years are the same as those in 2000.
    if row["Year"] == 2000:
        pass
    elif row["Year"] == 2001: # To avoid complexity, let's use only year 2001 as example.
        idx = row.index # This is wrong code.
        row["Gender"] = row.iloc[idx-1]["Gender"]
    return row["Gender"]


data["Gender"] = data.apply(retrieve_gender, axis=1)

2 Answers 2

1

With Pandas you can loop through your dataframe like this :

for index in range(len(df)): 
    if df.loc[index,'year'] == "2001":
        df.loc[index,'Gender'] = df.loc[index-1 ,'Gender']
Sign up to request clarification or add additional context in comments.

1 Comment

I actually wrote a retrieve_data(df) that uses iterrows(), instead of retrieve_data(row), and it worked. But I 'm just curious just in case. So there is no way this can be done by df.apply to each individual row, isn't there?
0

apply gives series indexed by column labels

The problem with idx = data.iloc[5].index is data.iloc[5] converts a row to a pd.Series object indexed by column labels.

In fact, what you are asking for is impossible via pd.DataFrame.apply because the series that feeds your retrieve_gender function does not include any index identifier.

Use vectorised logic instead

With Pandas row-wise logic is inefficient and not recommended; it involves a Python-level loop. Use columnwise logic instead. Taking a step back, it seems you wish to implement 2 rules:

  1. If Year is not 2001, leave Gender unchanged.
  2. If Year is 2001, use Gender from previous row.

np.where + shift

For the above logic, you can use np.where with pd.Series.shift:

data['Gender'] = np.where(data['Year'] == 2001, data['Gender'].shift(), data['Gender'])

mask + shift

Alternatively, you can use mask + shift:

data['Gender'] = data['Gender'].mask(data['Year'] == 2001, data['Gender'].shift())

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.