0

I have data device_class as below:

Base    G   Pref    Sier    Val Other   latest_class    d_id
0       2   0       0       12  0       Val             38
12      0   0       0       0   0       Base            39
0       0   12      0       0   0       Pref            40
0       0   0       12      0   0       Sier            41
0       0   0       12      0   0       Sier            42
12      0   0       0       0   0       Base            43
0       0   0       0       0   12      Other           45
0       0   0       0       0   12      Other           46
0       12  0       0       0   0       G               47
0       0   12      0       0   0       Pref            48
0       0   0       0       0   12      Other           51
0       0   8       5       0   0       Sier            53
0       0   0       0       12  0       Val             54
0       0   0       0       12  0       Val             55

I want to select only the rows(or devices) where the devices: 1. Has been in their latest class for a minimum of 3 consecutive months 2. I need to filter out records where latest_class = 'Other'. 3. Now the above data is a year's data and for some devices like ( 38) there are two classes which the device has been a part of G and Val.These types of devices I need to filter out.

So the expected output will be:

Base    G   Pref    Sier    Val Other   latest_class    d_id
12      0   0       0       0   0       Base            39
0       0   12      0       0   0       Pref            40
0       0   0       12      0   0       Sier            41
0       0   0       12      0   0       Sier            42
12      0   0       0       0   0       Base            43
0       12  0       0       0   0       G               47
0       0   12      0       0   0       Pref            48
0       0   0       0       12  0       Val             54
0       0   0       0       12  0       Val             55

I have done the below to get only records whose values in latest_class are more than 3:

i = np.arange(len(device_class))
j = (device_class.columns[:-1].values[:, None] == device_class.latest_class.values).argmax(0)
device_class_latest = device_class.iloc[np.flatnonzero(device_class.values[i,j] >= 3)]

Can someone please help me with this?

1 Answer 1

1

I'm not quite sure I'm understanding your data structure correctly. I'm assuming that the values in the first 6 columns are the number of months that someone has been in the class? If so, try the following solution:

import pandas as pd

data = {
    'Base': [0, 12, 0, 0, 0, 12, 0, 0, 0, 0, 0, 0, 0, 0],
    'G': [2, 0, 0, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0 ,0],
    'Pref': [0, 0, 12, 0, 0, 0, 0, 0, 0, 12, 0, 8, 0, 0],
    'Sier': [0, 0, 0, 12, 12, 0, 0, 0, 0, 0, 0, 5, 0, 0],
    'Val': [12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 12],
    'Other': [0, 0, 0, 0, 0, 0, 12, 12, 0, 0, 12, 0, 0 ,0],
    'latest_class': [
        'Val', 'Base', 'Pref', 'Sier', 'Sier', 'Base', 'Other', 'Other', 'G',
        'Pref', 'Other', 'Sier', 'Val','Val'
    ],
    'd_id': [38, 39, 40, 41, 42, 45, 45, 46, 47, 48, 51, 53, 54, 55]
}

# Load data into DataFrame
df = pd.DataFrame(data)

# Remove records where latest class is Other
df = df[df['latest_class'] != 'Other']

# Filter out records with > 1 class
months_df = df.drop(['latest_class', 'd_id'], axis=1)
months_multiple = months_df[months_df > 0].count(axis=1)
months_1_only = months_multiple == 1
df = df.loc[months_1_only, :]

# Get records where months of latest_class >= 3
rows_to_keep = []
for index, row in df.iterrows():
    latest_class = row['latest_class']
    months_spent = row[latest_class]
    gte_3 = True if months_spent >= 3 else False
    rows_to_keep.append(gte_3)
df = df.iloc[rows_to_keep, :]

# Get them back in the original order (if needed)
df = df[['Base', 'G', 'Pref', 'Sier', 'Val', 'Other', 'latest_class', 'd_id']]
print(df)

The output is as you wanted:

    Base   G  Pref  Sier  Val  Other latest_class  d_id
1     12   0     0     0    0      0         Base    39
2      0   0    12     0    0      0         Pref    40
3      0   0     0    12    0      0         Sier    41
4      0   0     0    12    0      0         Sier    42
5     12   0     0     0    0      0         Base    45
8      0  12     0     0    0      0            G    47
9      0   0    12     0    0      0         Pref    48
12     0   0     0     0   12      0          Val    54
13     0   0     0     0   12      0          Val    55

Note that I've been overly verbose in order to clearly identify each step, but you could combine a lot of these lines together to create a more succinct script.

Additionally, the final filter could be defined as a function and applied using Pandas apply method instead of using iterrows.

Sign up to request clarification or add additional context in comments.

1 Comment

Awesome!!Thanks a lot!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.