0

I am looking for a way to get class label from my dataframe containing rows of features.

For instance, in this example:

df = pd.DataFrame([
['1',   'a',    'bb',   '0'],
['1',   'a',    'cc',   '0'],
['2', 'a',    'dd',   '1'],
['2',   'a',    'ee',   '1'],
['3', 'a',    'ff',   '2'],
['3', 'a',    'gg',   '2'],
['3', 'a',    'hh',   '2']], columns = ['ID', 'name', 'type', 'class'])

df 
    ID  name    type class
0   1    a      bb      0
1   1    a      cc      0
2   2    a      dd      1
3   2    a      ee      1
4   3    a      ff      2
5   3    a      gg      2
6   3    a      hh      2

My class array should be (i.e. for each ID the class value should be picked once):

class
array([0., 1., 2.,])

EDIT

df['class'].values produces array(['0', '0', '1', '1', '2', '2', '2'], dtype=object)

Expected answer:

I want array([0, 1, 2])

4
  • 1
    Which part are you having trouble with? - pandas.pydata.org/docs/user_guide/index.html Commented Sep 28, 2020 at 22:47
  • 2
    df.drop_duplicates('ID')['class'] Commented Sep 28, 2020 at 22:52
  • As created the dataframe contains strings in the column. That's what values is giving you. Commented Sep 28, 2020 at 22:52
  • @wwii exactly, thank you. Commented Sep 28, 2020 at 22:55

2 Answers 2

1

You can use groupby+ unique() as the following:

>>> df.groupby('ID')['class'].unique().astype(int).to_numpy()
array([0, 1, 2])

For given dataframe, you can use the following methods:

Solution 1 : Series.unique():

>>> df['class'].unique()
array(['0', '1', '2'], dtype=object)

#in case you want int outputs
>>> df['class'].unique().astype(int)
array([0, 1, 2])

Solution 2 value_counts():

>>> df['class'].value_counts(ascending=True).index.to_numpy().astype(int)
array([0, 1, 2])
Sign up to request clarification or add additional context in comments.

1 Comment

The issue with this answer is if you have other íd having a class that was previously listed, the value won't be included (say id =10 follows with class 0, this will not appear in the intended array since the class already exists.
0

In case multiple IDs can have same class, you can select your 'ID' and 'class' columns and drop duplicates, then fetch class column. Otherwise, simply use unique as suggested in other answer (of course you can convert this answer to ints too):

df[['ID','class']].drop_duplicates()['class'].values
#['0' '1' '2']

or similar to @wii's suggestion in comments:

df.drop_duplicates('ID')['class'].values
#['0' '1' '2']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.