1

I'm new to Python, and I'm having a lot of trouble with this problem, it's something I have to do for work.

Some background about the excel file: There are 3 columns, and about 100 rows. The first column (col1) contain either A or B. The second column (col2) contain any number ranging from 1 to 10. The third column (col3) contains a value of any decimal number.

What I want the program to do is parse through the data. There will be many duplicates of col1 and col2 put together. For example, (A, 1) can be on rows 1, 5, 20, 98, etc. But col3 will be different numbers. So for those different numbers from the 3rd column, I want it to find the average of all those numbers.

The output should look something like this:

A, 1 = avg 4.32
A, 2 = avg 7.23
A, 3 = avg -9.12
etc etc (until number 10)
B, 1 = avg 3.76
B, 2 = avg -8.12
B, 3 = avg 1.56
etc etc (until number 10)

It doesn't have to be in complete alphabetical and numerical order, it can just print out the first combos it finds.. But I've done this so far in my code, and for some reason it doesn't print out ALL the combos, only 3.

import xlrd #import package

#opening workbook and reading first sheet
book = xlrd.open_workbook('trend.xls')
sheet = book.sheet_by_index(0)

#function to hold unique combos
unique_combinations = {}

#looping through data
for row_index in range(sheet.nrows):
    #declaring what group equals to what row
    col1 = sheet.cell(row_index, 0)
    col2 = sheet.cell(row_index, 1)
    col3 = sheet.cell(row_index, 2)

    unique_combo = (col1.value, col2.value)

    if unique_combinations.has_key(unique_combo):
        unique_combinations[unique_combo].append(col3.value)
    else:
        unique_combinations[unique_combo] = [col3.value]

for k in unique_combinations.keys():
    l = unique_combinations[k]
    average = sum(l) / len(l)
    print '%s: %s Mean = %s' % (k[0], k[1], average)

Essentially, it's basically 2 groups, and within the 2 groups are another 10 groups, and within those 10 groups are the average of the numbers that belongs there.

Please help! Thank you so much in advance.

SAMPLE OF EXCEL FILE:

col1 | col2 | col3
A    |   1  | 3.12
B    |   9  | 4.12
B    |   2  | 2.43
A    |   1  | 9.54
B    |   8  | 2.43
A    |   2  | 1.08

So what the program will do is see that the first combo it comes across is A, 1 and it'll store the 3.12 in a list, and look at the next ones and keep storing, until it comes across a duplicate with is the fourth row. And it'll store that value as well. And at the end of it, the output will show A, 1 = avg (3.12 + 9.54 / 2). This example is only showing for the A, 1 combo. But in reality, there are only 2 groups (like the example) but col2 can range from 1 to 10. There will be many duplicates.

16
  • 1
    Does it need to be done in Python? Excel is quite capable of doing this all by itself... Commented Mar 12, 2013 at 22:15
  • Can you post a small sample in a tabular format and add the output you would like to get. Commented Mar 12, 2013 at 22:15
  • Honestly, I said the same thing. Excel could just do it all on its own. But my boss wants a program for it. I figured he was a noob in it.. But I've always used C and C++ and it seems more tedious to open an excel file via those languages. So I opted for python. I'll edit the post and place a sample of the excel. Commented Mar 12, 2013 at 22:20
  • 1
    @chakolatemilk: Fair play. I didn't realise Excel was open source. Commented Mar 12, 2013 at 22:42
  • 1
    Does the real code ignore the column headers? Because with your sample, when your final for hits the unique_combinations[(u'col1', u'col2'] entry, the value of which is [u'col3'], the average calculation will suffer an exception. Commented Mar 12, 2013 at 22:50

2 Answers 2

1

This suggestion is more "how to work out what's going on" and will be easier to read in an answer than a comment.

I think it's worth adding debug prints and exception handling.

I tried the sample with OpenOffice and Python 2.7. I could reproduce your symptoms if an exception occurred during the final loop, and if I was swallowing stderr in my test run. Eg: python test.py 2>nul

So I suggest you try this:


    import xlrd
    book = xlrd.open_workbook('trend.xls')
    sheet = book.sheet_by_index(0)
    unique_combinations = {}
    for row_index in range(sheet.nrows):
        col1 = sheet.cell(row_index, 0)
        col2 = sheet.cell(row_index, 1)
        col3 = sheet.cell(row_index, 2)

        unique_combo = (col1.value, col2.value)
        if unique_combinations.has_key(unique_combo):
            print 'Update: %r = %r' % (unique_combo, col3.value)
            unique_combinations[unique_combo].append(col3.value)
        else:
            print 'Add: %r = %r' % (unique_combo, col3.value)
            unique_combinations[unique_combo] = [col3.value]

    for k in unique_combinations.keys():
        l = unique_combinations[k]
        try:
          average = sum(l) / len(l)
          print '%s: %s Mean = %s' % (k[0], k[1], average)
        except Exception, e:
          print 'Ignoring entry[%r]==%r due to exception %r' % (k, l, e)

That should help you figure our your 'weird behaviour'.

Sign up to request clarification or add additional context in comments.

Comments

1

Give pandas a try:

In [1]: import pandas as pd

In [2]: xls = pd.ExcelFile('test.xls')
   ...: df = xls.parse('Sheet1', header=None)
   ...: 

In [3]: df
Out[3]: 
   0  1     2
0  A  1  3.12
1  B  9  4.12
2  B  2  2.43
3  A  1  9.54
4  B  8  2.43
5  A  2  1.08

In [4]: groups = df.groupby([0,1])

In [5]: for k, g in groups:
   ...:     print k, g[2].mean()
   ...:     
(u'A', 1.0) 6.33  # your example (3.12 + 9.54) / 2
(u'A', 2.0) 1.08
(u'B', 2.0) 2.43
(u'B', 8.0) 2.43
(u'B', 9.0) 4.12

If you want all your means as a list, the complete script would be:

import pandas as pd
df = pd.ExcelFile('test.xls').parse('Sheet1', header=None)
print [g[2].mean() for _, g in df.groupby([0,1])]
# out: [6.3300000000000001, 1.0800000000000001, 2.4300000000000002, 2.4300000000000002, 4.1200000000000001]

3 Comments

I don't want to have to insert the cell's values 1 by 1 in the python script.. There are over 100 rows.
@chakolatemilk -- what do you mean? pandas let's you both read/write to excel files :S
@chakolatemilk -- I think you can form groups and calculate means in about 3 lines of code :) -- not too much to write...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.