0

I have a pandas dataframe where cells in columns have multiple values and are separated by ';'. I'm trying to split the multiple values (in one cell) and create new rows for those that split off. Something like the example below:

> In: df
> Out:
| Year | State | Ingredient | Species |
| 1998 |  CA   | egg; pork  | sp1;sp2 |

The result I am trying to achieve looks like this:

> In: df
> Out:
| Year | State | Ingredient | Species |
| 1998 |  CA   | egg        | sp1     |
| 1998 |  CA   | egg        | sp1     |
| 1998 |  CA   | pork       | sp2     |
| 1998 |  CA   | pork       | sp2     |

I have found a method to split the dataframe like this, but it only works once. The code I used is shown below:

sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
df1['Species] = sp.values

When I execute this on the 'Species' column first, using the original dataframe (df), it works.

However, when I execute this code again on df1, trying to split up all the 'Ingredient', it gives me an error saying that length of value does not match length of index. As shown below:

fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
df2 = df1.loc[j].copy()
df2['Ingredient'] = fd.values

I did many trials to find why it returns that error message to me, and I realized that when I execute this called again on df1 to create df2, it doubles the number of rows/index when I execute df2 = df1.loc[j].copy(). Therefore, giving me more rows than I need. However, if I substitute 'df1' with 'df' (the original dataframe) then this error doesn't appear and it works.

Is there a solution to fix this? Or is there any other way of splitting it?

Thank you.

ps. This is my first time posting on Stack Overflow, and I'm also new to Python. Sorry if the formatting is bad.

2 Answers 2

2

I gave your problem a try. I wasn't able to fix the issue in your approach. I was able to come up with another approach since you provided the expected output. Hopefully this is concise and resolves your issue.

df = pd.DataFrame(columns=['Year', 'State', 'Ingredient', 'Species'])
df.loc[0] = [1998, 'CA', 'egg; pork', 'sp1;sp2']   # Same input df as problem
print df
sp = df['Species'][0].split(';') # Separating by species
df = pd.concat([df]*len(sp), ignore_index=True) # Add len(sp) more rows
df['Species'] = sp
ing = df['Ingredient'][0].split(';')
df = pd.concat([df]*len(ing), ignore_index=True) 
df['Ingredient'] = ing*len(sp)    # Replicate ingredient len(sp) number of times
print df
   Year State Ingredient  Species
0  1998    CA  egg; pork  sp1;sp2
   Year State Ingredient Species
0  1998    CA        egg     sp1
1  1998    CA       pork     sp2
2  1998    CA        egg     sp1
3  1998    CA       pork     sp2

PS: This is my first time answering ... please let me know if I should make any changes to this answer to add more detail or format. Thanks!

Edit: I was able find out what was going wrong in your approach. You have to reset the index when you create the copy of the dataframes otherwise when you get the number of indices with value 0, you get multiple values since they are all currently 0. See below.

sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
print df1
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
print j

df1 = df.loc[i].copy().reset_index(drop=True)
print df1
fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
print j

Output:

   Year State Ingredient  Species
0  1998    CA  egg; pork  sp1;sp2
0  1998    CA  egg; pork  sp1;sp2
Int64Index([0, 0, 0, 0], dtype='int64')
   Year State Ingredient  Species
0  1998    CA  egg; pork  sp1;sp2
1  1998    CA  egg; pork  sp1;sp2
Int64Index([0, 0, 1, 1], dtype='int64')

Original code with fix:

df = pd.DataFrame(columns=['Year', 'State', 'Ingredient', 'Species'])
df.loc[0] = [1998, 'CA', 'egg; pork', 'sp1;sp2']
#print df

sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy().reset_index(drop=True, inplace=False)
df1['Species'] = sp.values


fd = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
j = fd.index.get_level_values(0)
df2 = df1.loc[j].copy().reset_index(drop=True, inplace=False)
df2['Ingredient'] = fd.values
print df2

Hope that helps!

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for responding! I tried out your code, and it didn't work so well. I think your method worked for you because the dataset was small. I have a large and complicated dataset, so that's why it didn't work for me. Thank you for the 'Edit' advice, that really helped my thought process, and I've learned a lot from your method. If I find the solution, I will let you know!
Does the fix for your original code work? I understand earlier it was an issue of incorrect functionality, is the issue now about performance?
Yes! I figured it out now. It is exactly what you said in your "Original code with fix". I will post my answer down below. Thank you vk!
1

With the help of vk's "Original code with fix" shown above. It helped me solve the error "length of values don't match with length of index". The solution is: I needed to place reset_index() at the appropriate locations in the code.

Original code:

## Separate multiple entries in cells in 'Species' column to new rows:
sp = df['Species'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy()
df1['Species] = sp.values

## Separate multiple entries in cells in 'Ingredient' column to new rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack().reset_index(level=1, drop=True)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values

Fixed code:

## Separate multiple entries in 'Species' column cell into rows
sp = df['Species'].str.split(';', expand=True).stack()
i = sp.index.get_level_values(0)
df1 = df.loc[i].copy().reset_index()
df1['Species'] = sp.values

del df1['index'] ## a column called "index" is generated when you execute reset_index()

## Separate multiple entries in 'Ingredient' column cell into rows:
ing = df1['Ingredient'].str.split(';', expand=True).stack()
j = ing.index.get_level_values(0)
df2 = df1.loc[j].copy()
df2['Ingredient'] = ing.values

And I got the output I wanted with the 'Fixed code'.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.