I'm building a machine learning software to segment pages of large packages of data. I'm trying to do some analysis of the model by automating the process of verifying predicted output against the target output with labels. To do this, I've created a pandas dataframe that looks like this:
page_num file predicted label
--------------------------------------
1 file1 0 0
1 file1 0 0
2 file1 0 0
2 file1 0 0
2 file1 0 0
3 file1 1 1
3 file1 1 1
3 file1 1 1
1 file2 0 0
1 file2 0 0
1 file2 0 0
2 file2 2 2
2 file2 2 2
...
n filen 0 0
There are other columns too that I left out for brevity (total of 13 columns, not including index). I'm relatively new to pandas, but I'm basically looking to get the dataframe to look like this:
page_num file predicted label
--------------------------------------
1 file1 0 0
2 file1 0 0
3 file1 1 1
1 file2 0 0
2 file2 2 2
...
n filen 0 0
So I can verify that the values in predicted == label for each page in each file.
I've tried a couple of things:
First, I tried df[df.groupby(['file', 'page_num'])], but that yielded the error 'ValueError: cannot copy sequence with size 489 to array axis with dimension 13'.
I checked df.groupby(['file', 'page_num']).groups and noted that the groups are what I want: files and their pages. But I can't use the DataFrame where function, and I don't think apply is what I want either.
I've also tried to just iterate through the groups and check the dataframe, but I get a lot of False outcomes. The Jupyter notebook output looks like:
for group in df.groupby(['file', 'page_num']).groups:
df[df.file == group[0], df.page_num == group[1]].reset_index(drop=True)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-39-b34f0ce41321> in <module>
1 for group in df.groupby(['file', 'page_num']).groups:
----> 2 temp_df = df[df.file == group[0], df.page_num == group[1]].reset_index(drop=True)
3 print(temp_df.label)
~\AppData\Local\Continuum\anaconda3\envs\base\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2925 if self.columns.nlevels > 1:
2926 return self._getitem_multilevel(key)
-> 2927 indexer = self.columns.get_loc(key)
2928 if is_integer(indexer):
2929 indexer = [indexer]
~\AppData\Local\Continuum\anaconda3\envs\base\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2655 'backfill or nearest lookups')
2656 try:
-> 2657 return self._engine.get_loc(key)
2658 except KeyError:
2659 return self._engine.get_loc(self._maybe_cast_indexer(key))
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
TypeError: '(0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
20 True
21 True
22 True
23 True
24 True
25 True
26 True
27 True
28 True
29 True
...
2028635 False
2028636 False
2028637 False
2028638 False
2028639 False
2028640 False
2028641 False
2028642 False
2028643 False
2028644 False
2028645 False
2028646 False
2028647 False
2028648 False
2028649 False
2028650 False
2028651 False
2028652 False
2028653 False
2028654 False
2028655 False
2028656 False
2028657 False
2028658 False
2028659 False
2028660 False
2028661 False
2028662 False
2028663 False
2028664 False
Name: file, Length: 2028665, dtype: bool, 0 True
1 True
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
20 True
21 True
22 True
23 True
24 True
25 True
26 True
27 True
28 True
29 True
...
2028635 False
2028636 False
2028637 False
2028638 False
2028639 False
2028640 False
2028641 False
2028642 False
2028643 False
2028644 False
2028645 False
2028646 False
2028647 False
2028648 False
2028649 False
2028650 False
2028651 False
2028652 False
2028653 False
2028654 False
2028655 False
2028656 False
2028657 False
2028658 False
2028659 False
2028660 False
2028661 False
2028662 False
2028663 False
2028664 False
Name: page_num, Length: 2028665, dtype: bool)' is an invalid key
I don't understand really what is happening, because every time I try to change something I get a different ValueError or TypeError or something of the sorts. I'd expect to be able to iterate through the groups yielded by df.groupby(['file', 'page_num']).groups and check that my main dataframe df has matching values in label and predicted where df['file' == group[0]] and df['page_num' == group[1]].
I'm very new to pandas, so I'm probably missing something minor. Any help is appreciated. Thank you!
drop_duplicates()?df1.assign(correct=df1.predicted==df1.label).groupby(['file', 'page_num']).correct.count()