0

I'm building a machine learning software to segment pages of large packages of data. I'm trying to do some analysis of the model by automating the process of verifying predicted output against the target output with labels. To do this, I've created a pandas dataframe that looks like this:

page_num    file    predicted    label
--------------------------------------
1           file1       0          0
1           file1       0          0
2           file1       0          0
2           file1       0          0
2           file1       0          0
3           file1       1          1
3           file1       1          1
3           file1       1          1
1           file2       0          0
1           file2       0          0
1           file2       0          0
2           file2       2          2
2           file2       2          2
...
n           filen       0          0

There are other columns too that I left out for brevity (total of 13 columns, not including index). I'm relatively new to pandas, but I'm basically looking to get the dataframe to look like this:

page_num    file    predicted    label
--------------------------------------
1           file1       0          0
2           file1       0          0
3           file1       1          1
1           file2       0          0
2           file2       2          2
...
n           filen       0          0

So I can verify that the values in predicted == label for each page in each file.

I've tried a couple of things:

First, I tried df[df.groupby(['file', 'page_num'])], but that yielded the error 'ValueError: cannot copy sequence with size 489 to array axis with dimension 13'.

I checked df.groupby(['file', 'page_num']).groups and noted that the groups are what I want: files and their pages. But I can't use the DataFrame where function, and I don't think apply is what I want either.

I've also tried to just iterate through the groups and check the dataframe, but I get a lot of False outcomes. The Jupyter notebook output looks like:

for group in df.groupby(['file', 'page_num']).groups:
    df[df.file == group[0], df.page_num == group[1]].reset_index(drop=True)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-39-b34f0ce41321> in <module>
      1 for group in df.groupby(['file', 'page_num']).groups:
----> 2     temp_df = df[df.file == group[0], df.page_num == group[1]].reset_index(drop=True)
      3     print(temp_df.label)

~\AppData\Local\Continuum\anaconda3\envs\base\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2925             if self.columns.nlevels > 1:
   2926                 return self._getitem_multilevel(key)
-> 2927             indexer = self.columns.get_loc(key)
   2928             if is_integer(indexer):
   2929                 indexer = [indexer]

~\AppData\Local\Continuum\anaconda3\envs\base\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2655                                  'backfill or nearest lookups')
   2656             try:
-> 2657                 return self._engine.get_loc(key)
   2658             except KeyError:
   2659                 return self._engine.get_loc(self._maybe_cast_indexer(key))

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(0           True
1           True
2           True
3           True
4           True
5           True
6           True
7           True
8           True
9           True
10          True
11          True
12          True
13          True
14          True
15          True
16          True
17          True
18          True
19          True
20          True
21          True
22          True
23          True
24          True
25          True
26          True
27          True
28          True
29          True
           ...  
2028635    False
2028636    False
2028637    False
2028638    False
2028639    False
2028640    False
2028641    False
2028642    False
2028643    False
2028644    False
2028645    False
2028646    False
2028647    False
2028648    False
2028649    False
2028650    False
2028651    False
2028652    False
2028653    False
2028654    False
2028655    False
2028656    False
2028657    False
2028658    False
2028659    False
2028660    False
2028661    False
2028662    False
2028663    False
2028664    False
Name: file, Length: 2028665, dtype: bool, 0           True
1           True
2           True
3           True
4           True
5           True
6           True
7           True
8           True
9           True
10          True
11          True
12          True
13          True
14          True
15          True
16          True
17          True
18          True
19          True
20          True
21          True
22          True
23          True
24          True
25          True
26          True
27          True
28          True
29          True
           ...  
2028635    False
2028636    False
2028637    False
2028638    False
2028639    False
2028640    False
2028641    False
2028642    False
2028643    False
2028644    False
2028645    False
2028646    False
2028647    False
2028648    False
2028649    False
2028650    False
2028651    False
2028652    False
2028653    False
2028654    False
2028655    False
2028656    False
2028657    False
2028658    False
2028659    False
2028660    False
2028661    False
2028662    False
2028663    False
2028664    False
Name: page_num, Length: 2028665, dtype: bool)' is an invalid key

I don't understand really what is happening, because every time I try to change something I get a different ValueError or TypeError or something of the sorts. I'd expect to be able to iterate through the groups yielded by df.groupby(['file', 'page_num']).groups and check that my main dataframe df has matching values in label and predicted where df['file' == group[0]] and df['page_num' == group[1]].

I'm very new to pandas, so I'm probably missing something minor. Any help is appreciated. Thank you!

11
  • so you want to sort by file and then by page? Commented Aug 12, 2019 at 14:48
  • 2
    Seems like you also want to drop_duplicates()? Commented Aug 12, 2019 at 14:49
  • @Yuca Yes, I want to sort by file and then page. Then compare two separate columns after the fact and do something with the resulting comparison. Commented Aug 12, 2019 at 14:51
  • it's unclear on how you handle the fact that you have 3 rows for file1 and page 2. That's why ALollz suggests the dups. What's the logic there? is removing duplicates enough? Commented Aug 12, 2019 at 14:53
  • 1
    this should give you a sense of how to start df1.assign(correct=df1.predicted==df1.label).groupby(['file', 'page_num']).correct.count() Commented Aug 12, 2019 at 15:04

1 Answer 1

1

By drop_duplicates you delete the duplicate rows and by sort_values ​​you sort first by file name and second by page_num:

df.drop_duplicates().sort_values(['file','page_num'],ascending = True)

Out:

    page_num    file    predicted   label
0           1   file1           0       0
2           2   file1           0       0
5           3   file1           1       1
8           1   file2           0       0
11          2   file2           2       2

It is interesting to understand that df.drop_duplicates().sort_values ​​(['page_num', 'file'], ascending = True) would not produce the same result since it orders first by page_num and then by file

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks! I dropped all of the extra columns and then stored the result of this into a new dataframe, and that worked wonderfully with np.where to do a lot of my analysis. Thank you again!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.