Fill missing data with random values from categorical column - Python

Question

I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.

My code is:

new_agent = hotel['agent'].dropna()

agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))

results

result is

The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?

UPDATE: Thanks ti7 helped me solved the problem:

new_agent = hotel['agent'].dropna() #get a series of just the available values

n_null = hotel['agent'].isnull().sum() #length of the missing entries

new_agent.sample(n_null,replace=True).values #sample it with repetition and get values

hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values #fill and replace

fillna takes in a value not a function

user15270287
– user15270287

2021-03-03 22:39:22 +00:00
Commented Mar 3, 2021 at 22:39 — user15270287
– user15270287, Commented Mar 3, 2021 at 22:39

ti7 · Accepted Answer · 2021-03-03 23:28:38Z

0

.fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!

You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.

get a Series of just the available values (.dropna())
.sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
get the .values (this is a flat numpy array)
filter the column and assign

quick code

df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
    df["agent"].isna().sum(),  # get the same number of values as are missing
    replace=True               # repeat values
).values                       # throw out the index

demo

>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
   agent  b
0    1.0  3
1    2.0  4
2    NaN  5
3    NaN  6
4   10.0  7

>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])

>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
...     df["agent"].isna().sum(),
...     replace=True
... ).values
>>> df
   agent  b
0    1.0  3
1    2.0  4
2   10.0  5
3    2.0  6
4   10.0  7

edited Mar 3, 2021 at 23:28

answered Mar 3, 2021 at 23:22

ti7

19.8k8 gold badges50 silver badges87 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Stephanie Jie Hu Over a year ago

I like the way you layout the concept and it works! Thanks! Just one final question. In your syntax: df.loc[df["agent"].isna(), "agent"], the second 'agent' represent label as an argument?

ti7 Over a year ago

Cheers - the second arg filters the Series down to just that column (otherwise it attempts to make changes to every column). .loc will accept a list of columns like df.loc[df["agent"].isna(), ["agent", "b"]] if wanted too!

ti7 Over a year ago

If this solved your Question, please mark it as the Answer with the check to its left!

Collectives™ on Stack Overflow

Fill missing data with random values from categorical column - Python

results

1 Answer 1

quick code

demo

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

results

1 Answer 1

quick code

demo

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related