1

I have a dataseta with users, visit_types (bookings or searches) and hotels. I need to populate a new column with the most booked hotel, based on previous booked hotel for that row.

For example,

   **user**   **visit_type**   **hotel_code**   **most_booked**
1    user1       search             1                NaN
2    user1       search             2                NaN
3    user1       booking            1                NaN
4    user1       search             8                NaN
5    user1       booking            8                1
6    user2       search             6                NaN
7    user2       booking            6                NaN
8    user2       search             4                NaN
9    user2       booking            4                6
10   user2       booking            6                4
11   user2       booking            4                6

So with this example:

The most booked hotel for user1 would be, in row 3 hotel = NaN, beacuse it has no hotel booked previously, and in row 5 it would be hotel = 1.

For user2, row 7 would be hotel = NaN, row 9 would be hotel = 6, and row 10 hotel = 4 (as it is the last booked and there are only two hotels booked) and for the last row 11, the hotel would be 6 as it is the most booked up to that point (without taking into account row 11).

1 Answer 1

2

This should achieve what you want:

import pandas as pd
import operator
from collections import defaultdict

d = {      "user":["user1","user1","user1","user1","user1","user2","user2","user2","user2","user2","user2"],
     "visit_type":["search","search","booking","search","booking","search","booking","search","booking","booking","booking"],
     "hotel_code":[1,2,1,8,8,6,6,4,4,6,4]}

df = pd.DataFrame(data=d)
#Setting default value
df['most_booked']='NaN'

for user in df.user.unique():
    #Ignoring searches, only considering bookings
    df_bookings = df.loc[(df["visit_type"] == "booking") & (df['user'] == user)]
    last_booked = ""
    booking_counts = defaultdict(int)

    for i, entry in df_bookings.iterrows():
        #Skipping first booking
        if last_booked != "":
            highest = max(booking_counts.values())
            #Prefers last booked if it equals max
            if booking_counts[last_booked] == highest:
                max_booked = last_booked
            #Otherwise chooses max
            else:
                max_booked = max(booking_counts.items(), key=operator.itemgetter(1))[0]
            df.loc[i, 'most_booked'] = max_booked

        #Update number of bookings in dictionary
        current_booking = entry["hotel_code"]
        booking_counts[current_booking] += 1
        last_booked = current_booking

print(df)

    hotel_code   user visit_type most_booked
0            1  user1     search         NaN
1            2  user1     search         NaN
2            1  user1    booking         NaN
3            8  user1     search         NaN
4            8  user1    booking           1
5            6  user2     search         NaN
6            6  user2    booking         NaN
7            4  user2     search         NaN
8            4  user2    booking           6
9            6  user2    booking           4
10           4  user2    booking           6
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.