2

I am trying to read https://www.whatdotheyknow.com/request/193811/response/480664/attach/3/GCSE%20IGCSE%20results%20v3.xlsx using pandas.

Having saved it my script is

import sys
import pandas as pd
inputfile = sys.argv[1]
xl = pd.ExcelFile(inputfile)
#    print xl.sheet_names
df = xl.parse(xl.sheet_names[0])
print df.head()

However this does not seem to process the headers properly as it gives

  GCSE and IGCSE1 results2,3 in selected subjects4 of pupils at the end of key stage 4 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 Unnamed: 5 Unnamed: 6 Unnamed: 7 Unnamed: 8 Unnamed: 9 Unnamed: 10
0                              Year: 2010/11 (Final)                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
1                                  Coverage: England                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
2                                                NaN                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
3  1. Includes International GCSE, Cambridge Inte...                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN
4  2. Includes attempts and achievements by these...                                          NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN        NaN         NaN

All of this should be treated as comments.

If you load the spreadsheet into libreoffice, for example, you can see that the column headings are correctly parsed and appear in row 15 with drop down menus to let you select the items you want.

How can you get pandas to automatically detect where the column headers are just as libreoffice does?

1 Answer 1

3

pandas is (are?) processing the file correctly, and exactly the way you asked it (them?) to. You didn't specify a header value, which means that it defaults to picking up the column names from the 0th row. The first few rows of cells aren't comments in some fundamental way, they're just not cells you're interested in.

Simply tell parse you want to skip some rows:

>>> xl = pd.ExcelFile("GCSE IGCSE results v3.xlsx")
>>> df = xl.parse(xl.sheet_names[0], skiprows=14)
>>> df.columns
Index([u'Local Authority Number', u'Local Authority Name', u'Local Authority Establishment Number', u'Unique Reference Number', u'School Name', u'Town', u'Number of pupils at the end of key stage 4', u'Number of pupils attempting a GCSE or an IGCSE', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-G', u'Number of students achieving 8 or more GCSE or IGCSE passes at A*-A', u'Number of students achieving 5 A*-A grades or more at GCSE or IGCSE'], dtype='object')
>>> df.head()
   Local Authority Number Local Authority Name  \
0                     201       City of london   
1                     201       City of london   
2                     202               Camden   
3                     202               Camden   
4                     202               Camden   

   Local Authority Establishment Number  Unique Reference Number  \
0                               2016005                   100001   
1                               2016007                   100003   
2                               2024104                   100049   
3                               2024166                   100050   
4                               2024196                   100051   

                       School Name    Town  \
0  City of London School for Girls  London   
1            City of London School  London   
2                Haverstock School  London   
3           Parliament Hill School  London   
4               Regent High School  London   

  Number of pupils at the end of key stage 4  \
0                                        105   
1                                        140   
2                                        200   
3                                        172   
4                                        174   

  Number of pupils attempting a GCSE or an IGCSE  \
0                                            104   
1                                            140   
2                                            194   
3                                            169   
4                                            171   

  Number of students achieving 8 or more GCSE or IGCSE passes at A*-G  \
0                                                100                    
1                                                108                    
2                                               SUPP                    
3                                                 22                    
4                                                  0                    

  Number of students achieving 8 or more GCSE or IGCSE passes at A*-A  \
0                                                 87                    
1                                                 75                    
2                                                  0                    
3                                                  7                    
4                                                  0                    

  Number of students achieving 5 A*-A grades or more at GCSE or IGCSE  
0                                                100                   
1                                                123                   
2                                                  0                   
3                                                 34                   
4                                               SUPP                    

[5 rows x 11 columns]
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you. How does libreoffice know to skip the first 14 rows automatically? That is what made me think that maybe there was more to this problem.
@felix: FWIW, when I open it in libreoffice I see rows 1-14. I guess in principle you could detect when a grouping table (or whatever they're called) has been defined and pick up those, but you could have more than one on a sheet anyway.
What I meant was you do see rows 1-14 but row 15 has clearly been identified as the column headings by libreoffice. In my version there is a drop down menu in every field for row 15. Do you not get the same thing? This is what I see postimg.org/image/fbgkgxelp .
I don't think that's libreoffice detecting it, I think it's that they manually defined that as a table. Reread my previous comment.
Oh I see. Does pandas have a way of detecting this?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.