1

My problem: I have a pandas dataframe and one column in particular which I need to process contains values separated by (":") and in some cases, some of those values between ":" can be value = value, and can appear at the start/middle/end of the string. The length of the string can differ in each cell as we iterate through the row, for e.g.

clickstream['events']  
1:3:5:7=23  
23=1:5:1:5:3  
9:0:8:6=5:65:3:44:56  
1:3:5:4

I have a file which contains the lookup values of these numbers,e.g.

event_no,description,event
1,xxxxxx,login
3,ffffff,logout
5,eeeeee,button_click
7,tttttt,interaction
23,ferfef,click1

output required:

clickstream['events']  
login:logout:button_click:interaction=23
click1=1:button_click:login:button_click:logout

Is there a pythonic way of looking up these individual values and replacing with the event column corresponding to the event_no row as shown in the output? I have hundreds of events and trying to work out a smart way of doing this. pd.merge would have done the trick if I had a single value, but I'm struggling to work out how I can work across the values and ignore the "=value" part of the string

1 Answer 1

1

Edit for to ignore missing keys in Dict:

import pandas as pd

EventsDict = {1:'1:3:5:7',2:'23:45:1:5:3',39:'0:8:46:65:3:44:56',4:'1:3:5:4'}
clickstream = pd.Series(EventsDict)
#Keep this as a dictionary
EventsLookup = {1:'login',3:'logout',5:'button_click',7:'interaction'}

def EventLookup(x):
    list1 = [EventsLookup.get(int(item),'Missing') for item in x.split(':')]
    return ":".join(list1)

clickstream.apply(EventLookup)

Since you are using a full DF and not just a series, use:

clickstream['events'].apply(EventLookup)
Output:
1                 login:logout:button_click:interaction
2             Missing:Missing:login:button_click:logout
4                     login:logout:button_click:Missing
39    Missing:Missing:Missing:Missing:logout:Missing...
Sign up to request clarification or add additional context in comments.

11 Comments

Hi @Liam Foley - Thanks for answering. I have tried to replicate the above but seem to get the following error AttributeError: ("'Series' object has no attribute 'split'", u'occurred at index 1'). The only change to your recreate statements was to have clickstream = pd.DataFrame(EventsDict) as clickstream = pd.DataFrame([EventsDict]) to avoid the error ValueError: If using all scalar values, you must must pass an index.... any ideas? thanks
@Maruhk Sounds like the apply isn't working. Can you post the exact code you have? If you're using a full dataframe, you would have to do something like: DF['COL'] = DF['COL'].apply(lambda x: .......
seems like my lookup dictionary was not created properly but following your method I replicated the transformation and ran the function but seem to get an error on the first value in the clickstream dataset. I have copied the code in the following location ClickStream - Code & Output Error Link - thanks
You make the dict, but then turn the object back into a series right away. eventlookup = eventlookup.set_index('no')['value'].to_dict() eventlookup = pd.Series(eventlookup) Don't do the second part. eventlookup = pd.Series(eventlookup). What does the Dict look like?
Great, it's a dictionary now, which it needs to be. Try the other part now. clickstream['events'] = clickstream['events'].apply(lambda x: ":".join([eventlookup[str(item)] for item in x.split(':')])) If that doesn't work, check to see if you your dictionary keys are strings or ints. your dict keys datatype needs to match the datatype in clickstream['events']. They all need to be either ints or strs.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.