2

I have a column "Country" in a data frame, I would like to group the "Country" column with only two options: "Mainland China" and " Others". I have tried different options e.g. filter, etc. No one works. How should I do it?

Here is the dataset https://drive.google.com/file/d/17DY8f-Jxba0Ky5iOUQqEZehhoWNO3vzR/view?usp=sharing

FYI, I have already grouped different provinces in China as one country "Mainland China"

Thanks for your help!

enter image description here

5
  • can you post the data as code please, not a picture Commented Feb 9, 2020 at 21:03
  • one min, I will do it Commented Feb 9, 2020 at 21:04
  • interesting dataset Commented Feb 9, 2020 at 21:43
  • Do not share information as images unless absolutely necessary. See: meta.stackoverflow.com/questions/303812/…, minimal reproducible example. This seems rather basic, have you read the Pandas docs? Commented Feb 10, 2020 at 1:44
  • Oh, and this is essentially a duplicate of stackoverflow.com/questions/19913659/…. Commented Feb 10, 2020 at 2:07

4 Answers 4

3

I think the quickest way to change the value would be using .loc instead of apply since .loc is optimized for pandas.

df.loc[df.Country != 'Mainland China', 'Country'] = 'Others'
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks aws_apprentice! I tried that way. But apparently, I made some mistakes
1

Try (and group by Country):

import numpy as np

df["Country"]=np.where(df["Country"].eq("Mainland China"), "Mainland China", "Other")

Edit

timeit (please note I didn't do .loc[] as lambda doesn't support assignment - feel free to suggest a way of adding it):

import pandas as pd
import numpy as np
import timeit
from timeit import Timer

#proportion-wise that's the dataframe, as per OP's question

df=pd.DataFrame({"Country": ["Mainland China"]*398+["a", "b","c"]*124})

df["otherCol"]=2
df["otherCol2"]=3

#shuffle

df2=df.copy().sample(frac=1)
df3=df2.copy()
df4=df3.copy()

op2=Timer(lambda: np.where(df2["Country"].eq("Mainland China"), "Mainland China", "Other"))
op3=Timer(lambda: df3.Country.map(lambda x: x if x == 'Mainland China' else 'Others'))
op4=Timer(lambda: df4["Country"].apply(lambda x: x if x == "Mainland China" else "Others"))

print(op2.timeit(number=1000))
print(op3.timeit(number=1000))
print(op4.timeit(number=1000))

Returns:

2.1856687490362674 #numpy
2.2388894270407036 #map
2.4437739049317315 #apply

7 Comments

Thanks Grazegorz, even though your solution comes later than those two guys, I know one way extra solving this problem. Thank you :D
No worries- time them - you will have some criteria to compare ;) I would expect np.where to be a bit faster than .loc[...] .apply(...) is outside of competition here.
What is the advantage of using this over .loc[], aside from a tiny performance gain?
Looking at stackoverflow.com/a/31173785/5082048, performance might be lower than for .map(lambda x: ...) for small datasets. List comprehensions scored best in that benchmark.
I benchmarked all the methods except .loc[] - please see above.
|
-1

Try using apply:

dataframe["Country"] = dataframe["Country"].apply(lambda x: x if x == "Mainland China" else "Others")

4 Comments

thanks, your solution works perfectly as well. Since I have already accepted one solution, I sincerely appreciate your help !
@AMC Thanks for involving in the discussion! I suppose that we should respect everybody's efforts. What do you think :)
@almo I agree entirely, my statement was in no way related to the answerer’s person or character.
@AMC on the plus side, it is quite flexible if other categories need to be defined in the future.
-2

Assuming df is your pandas dataframe.

You could do:

df['Country'] = df.Country.map(lambda x: x if x == 'Mainland China' else 'Others')

6 Comments

Thanks, perfect:D. I have wasted nearly one hour.
@AMC on the plus side, it is quite flexible if other categories need to be defined in the future.
@ArcoBast You could just use .map() and a dictionary, which is likely the most flexible solution.
@AMC I thought about this, but mapping everything except 'Mainland China' to a single value is not straightforward with a dictionary. I could have suggested using a defaultdict, of course, but considered that to be overkill. From an analyst's point of view, when working with a small dataset like this one, flexibility beats speed in my experience.
@ArcoBast It's certainly a unique situation, yes. For this particular case, I like the solution using .loc[]. As soon as the number of values to map changes, I prefer .map() with a dict.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.