2

I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.

My code is:

new_agent = hotel['agent'].dropna()

agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))

results

result is

The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?

UPDATE: Thanks ti7 helped me solved the problem:

new_agent = hotel['agent'].dropna() #get a series of just the available values

n_null = hotel['agent'].isnull().sum() #length of the missing entries

new_agent.sample(n_null,replace=True).values #sample it with repetition and get values

hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #fill and replace

1
  • fillna takes in a value not a function Commented Mar 3, 2021 at 22:39

1 Answer 1

0

.fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!

You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.

  • get a Series of just the available values (.dropna())
  • .sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
  • get the .values (this is a flat numpy array)
  • filter the column and assign

quick code

df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
    df["agent"].isna().sum(),  # get the same number of values as are missing
    replace=True               # repeat values
).values                       # throw out the index

demo

>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
   agent  b
0    1.0  3
1    2.0  4
2    NaN  5
3    NaN  6
4   10.0  7
>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])
>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
...     df["agent"].isna().sum(),
...     replace=True
... ).values
>>> df
   agent  b
0    1.0  3
1    2.0  4
2   10.0  5
3    2.0  6
4   10.0  7
Sign up to request clarification or add additional context in comments.

3 Comments

I like the way you layout the concept and it works! Thanks! Just one final question. In your syntax: df.loc[df["agent"].isna(), "agent"], the second 'agent' represent label as an argument?
Cheers - the second arg filters the Series down to just that column (otherwise it attempts to make changes to every column). .loc will accept a list of columns like df.loc[df["agent"].isna(), ["agent", "b"]] if wanted too!
If this solved your Question, please mark it as the Answer with the check to its left!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.