Pandas Dataframes

Question

I am still new to python and need help setting up a dataframe. I have a data set with 5 competitors, 3 possible sentiments and a number called Adjusted Count. Here is a sample data file: https://www.sendspace.com/file/0xfh5s

I need to create a dataframe which has competitors by sentiment and sum of Adjusted Count.

It should look like this:

            Positive    Balanced    Negative    Grand Total
A           335         208         33          576
B           346         170         97          613
C           573         90          454         1117
D           1955        1456        1762        5173
E           253         0           189         442
Grand Total 3462        1924        2535        7921

But instead this is what i'm currently getting:

            Positive    Balanced    Negative    Total
A           0           0           0           0
B           0           0           0           0
C           0           0           0           0
D           0           0           0           0
E           0           0           0           0

Here is my current code:

PATH_DATA = '...filename.xlsx'

df1 = pd.read_excel(PATH_DATA, 'A')
df2 = pd.read_excel(PATH_DATA, 'B')
df3 = pd.read_excel(PATH_DATA, 'C')
df4 = pd.read_excel(PATH_DATA, 'D')
df5 = pd.read_excel(PATH_DATA, 'E')
df_ALL = pd.concat([df1, df2, df3, df4, df5]) 

SENTIMENT_DIMENTION = 'Sentiment simplified'
SENTIMENT_ORDER = ['Positive', 'Balanced', 'Negative']
COMPETITOR_DIMENTION = 'Competitor Name'
COMPETITOR_ORDER = ['A', 'B', 'C', 'D', 'E']
AC_DIMENTION = 'Adjusted Count'

class Report:
    def ACbysentiment(self, vdimention, hdimention, indexlist):
        data = self.data
        data = data.groupby([vdimention, hdimention]).size().unstack(1)
        #data = data.groupby('Adjusted Count')[sumdimention].agg('sum')
        data = data.reindex(columns=indexlist)
        data['Total'] = data.sum(axis=1)
        data.fillna(0, inplace = True)
        data = data.sort_values('Total', ascending=True)
        return data

df_ALL = df_ALL.replace(np.nan, NAN_VAL, regex=True)
S1 = Report(df_ALL)

S1_Competitor_Sentiment_AC = S1.ACbysentiment(COMPETITOR_DIMENTION, AC_DIMENTION, SENTIMENT_ORDER)
S1_Competitor_Sentiment_AC = pd.DataFrame(S1_Competitor_Sentiment_AC, index = COMPETITOR_ORDER)

Any help you could give would be greatly appreciated!

Thanks

Here is a sample of what df_ALL could look like:

Competitor  Sentiment simplified    Adjusted Count
A           Positive                50
A           Balanced                40
A           Negative                30
A           Positive                10
B           Balanced                50
B           Negative                40
B           Positive                30
B           Balanced                10
B           Negative                50
B           Positive                40
C           Balanced                30
C           Negative                10
C           Positive                50
C           Balanced                40
C           Negative                30
D           Positive                10
D           Balanced                50
D           Negative                40
E           Positive                30
E           Balanced                10
E           Negative                50
E           Positive                40
E           Balanced                30
E           Negative                10
E           Positive                50

Please correct your code indentation. In addition, show us an excerpt of self.data, since we need minimal reproducible example. — jpp
– jpp, Commented Jun 13, 2018 at 9:19
Please could you include the code for your dataframe, that is, the code that reads in your data. You can refer to this link: stackoverflow.com/help/mcve — Bryce Ramgovind
– Bryce Ramgovind, Commented Jun 13, 2018 at 9:23
Hi, I hope the edits help. Although, I am unsure how to post a sample of the data — Plota
– Plota, Commented Jun 13, 2018 at 9:31
Please do let me know if there is anything else you need to help me with this problem! @BryceRamgovind Thanks — Plota
– Plota, Commented Jun 13, 2018 at 10:23
@Plota Thanks, that does help however please provide a sample layout of the excel file so that I can see the data. It can be factious. — Bryce Ramgovind
– Bryce Ramgovind, Commented Jun 13, 2018 at 11:38

iamjoebloggs · Accepted Answer · 2018-06-14 04:57:54Z

GroupBy is your friend here. I didn't find any links to the excel file. I am assuming that you have a sheet for competitor with the sentiment information (i guess 1's and 0's). First thing you need to do is add the competitor name. You can do this in 2 ways, either add it in the excel sheet itself, or after you have read the excel sheet like this:

df1 = pd.read_excel(PATH_DATA, 'A')
df1['competitor'] = 'A'

This will add a column called competitor with value 'A'. In your case the competitor value is essential because you will aggregate on this value. Once you have done this for all the data frames only then you should concatenate them:

df_ALL = pd.concat([df1, df2, df3, df4, df5])

After this you can apply groupby on the dataframe:

df_grouped = df_All.groupby(by='competitor',as_index = False).sum()

EDIT: Class based solution

class Report:
    def __init__(self,dataframe):
        self.data = dataframe
    def ACbysentiment(self, vdimention, hdimention):
        data = self.data
        data = data.groupby(by=[vdimention, hdimention],as_index=False).sum()
        data = data.pivot(index='Competitor',columns = 'Sentiment simplified',values='Adjusted Count')
        data['Total'] = data.sum(axis=1) 
        #data.fillna(0, inplace = True) not required as we are aggregating
        data = data.sort_values('Total', ascending=True)
        return data

Note:The init method is required to pass the dataframe to the class. Your current code throws an exception.

Also note that i have used groupby with as_index = False. This allows a sql style groupby that you are looking for. So you can replace:

data = data.groupby([vdimention, hdimention]).size().unstack(1)

with

data = data.groupby(by=[vdimention, hdimention],as_index=False).sum()

Finally, if I understand correctly, you want to (in excel terms) pivot the data so for that we use the df.Pivot() method:

data = data.pivot(index='Competitor',columns = 'Sentiment simplified',values='Adjusted Count')

I also think that you don't need the various 'dimensions' that you have defined (unless they are used elsewhere in your code). You just need two:

COMPETITOR_DIMENTION = 'Competitor' # just the name of the column,case sensitive
AC_DIMENTION = 'Sentiment simplified'

and then:

S1_Competitor_Sentiment_AC = S1.ACbysentiment(COMPETITOR_DIMENTION, AC_DIMENTION)

This will return a pandas dataframe with your desired output.

Hope that helps!

sendspace.com/file/0xfh5s Here is a link to a sample data file. you will see there is already a column for competitors. The issue is with ACbysentiment within the Report class. If you have any ideas on how to fix this, please do let me know!

Collectives™ on Stack Overflow

Pandas Dataframes

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related