1

I've got a page I'm scraping and most of the tables are in the format Heading --info. I can iterate through most of the tables and create separate dataframes for all the various information using pandas.read_html.

However, there are some where they've combined information into one table with subheadings that I want to be separate dataframes with the text of that row as the heading (appending a list).

Is there an easy way to split this dataframe - It will always be heading followed by associated rows, new heading followed by new associated rows.

eg.

   Col1   Col2
0  thing   
1  1       2
2  2       3
3  thing2  
4  1       2
5  2       3
6  3       4

Should be

thing
1   1   1
2   2   2

thing2
4   1   2
5   2   3
6   3   4

It'd be nice if people would just create web pages that made sense with the data but that's not the case here.

I've tried iterrows but cannot seem to come up with a good way to create what I want.

Help would be very much appreciated!

<div class="ranking">
    <h6><a href="javascript:;">Sprint</a></h6>
    <table>
    <tbody>
    
    
    </tbody>
    <tbody>
    <tr>
     <td class="title" colspan="8">Canneto - km 137</td>
    </tr>
    
    <tr>
    <td class="rank"><span title="Rank">1</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">21</span></td>
    <td class="name">
    <a class="10010085859" href="javascript:;">
    <abbr title="Young rider">*</abbr>
    BAGIOLI Nicola
    </a>
    
    </td>
    <td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">5</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">2</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">54</span></td>
    <td class="name">
    <a class="10008688453" href="javascript:;">
    ORSINI Umberto
    </a>
    
    </td>
    <td class="team"><img alt="BARDIANI CSF FAIZANE'" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BCF.png" title="BARDIANI CSF FAIZANE'"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">3</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">3</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">247</span></td>
    <td class="name">
    <a class="10005658114" href="javascript:;">
    ZARDINI Edoardo
    </a>
    
    </td>
    <td class="team"><img alt="VINI ZABU' KTM" src="/Content/images/event/2020/tirreno-adriatico/jerseys/THR.png" title="VINI ZABU' KTM"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">2</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">4</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">63</span></td>
    <td class="name">
    <a class="10003349312" href="javascript:;">
    BODNAR Maciej
    </a>
    
    </td>
    <td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
    
    <td class="noc"><img alt="POL" src="/Content/images/flags/POL.png" title="POL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">1</td>
    
    </tr>
    
    </tbody>
    <tbody>
    <tr>
     <td class="title" colspan="8">Follonica - km 190</td>
    </tr>
    
    <tr>
    <td class="rank"><span title="Rank">1</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">62</span></td>
    <td class="name">
    <a class="10007738055" href="javascript:;">
    ACKERMANN Pascal
    </a>
    
    </td>
    <td class="team"><img alt="BORA - HANSGROHE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/BOH.png" title="BORA - HANSGROHE"/></td>
    
    <td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">12</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">2</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">231</span></td>
    <td class="name">
    <a class="10008656828" href="javascript:;">
    GAVIRIA RENDON Fernando
    </a>
    
    </td>
    <td class="team"><img alt="UAE TEAM EMIRATES" src="/Content/images/event/2020/tirreno-adriatico/jerseys/UAD.png" title="UAE TEAM EMIRATES"/></td>
    
    <td class="noc"><img alt="COL" src="/Content/images/flags/COL.png" title="COL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">10</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">3</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">137</span></td>
    <td class="name">
    <a class="10007506366" href="javascript:;">
    ZABEL Rick
    </a>
    
    </td>
    <td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
    
    <td class="noc"><img alt="GER" src="/Content/images/flags/GER.png" title="GER"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">8</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">4</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">91</span></td>
    <td class="name">
    <a class="10008661777" href="javascript:;">
    BALLERINI Davide
    </a>
    
    </td>
    <td class="team"><img alt="DECEUNINCK  -  QUICK - STEP " src="/Content/images/event/2020/tirreno-adriatico/jerseys/DQT.png" title="DECEUNINCK  -  QUICK - STEP "/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">7</td>
    
    </tr>
    <tr>
    <td class="rank"><span title="Rank">5</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">12</span></td>
    <td class="name">
    <a class="10007096239" href="javascript:;">
    MERLIER Tim
    </a>
    
    </td>
    <td class="team"><img alt="ALPECIN - FENIX" src="/Content/images/event/2020/tirreno-adriatico/jerseys/AFC.png" title="ALPECIN - FENIX"/></td>
    
    <td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">6</td>
    
    </tr>
    <tr>
    <td class="more" colspan="8"><a href="javascript:;">More...</a></td>
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">6</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">133</span></td>
    <td class="name">
    <a class="10028417041" href="javascript:;">
    CIMOLAI Davide
    </a>
    
    </td>
    <td class="team"><img alt="ISRAEL START - UP NATION" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ISN.png" title="ISRAEL START - UP NATION"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">5</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">7</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">213</span></td>
    <td class="name">
    <a class="10007216275" href="javascript:;">
    MANZIN Lorrenzo
    </a>
    
    </td>
    <td class="team"><img alt="TOTAL DIRECT ENERGIE" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TDE.png" title="TOTAL DIRECT ENERGIE"/></td>
    
    <td class="noc"><img alt="FRA" src="/Content/images/flags/FRA.png" title="FRA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">4</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">8</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">23</span></td>
    <td class="name">
    <a class="10007744624" href="javascript:;">
    PACIONI Luca
    </a>
    
    </td>
    <td class="team"><img alt="ANDRONI GIOCATTOLI - SIDERMEC" src="/Content/images/event/2020/tirreno-adriatico/jerseys/ANS.png" title="ANDRONI GIOCATTOLI - SIDERMEC"/></td>
    
    <td class="noc"><img alt="ITA" src="/Content/images/flags/ITA.png" title="ITA"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">3</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">9</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">147</span></td>
    <td class="name">
    <a class="10010946028" href="javascript:;">
    <abbr title="Young rider">*</abbr>
    VERMEERSCH Florian
    </a>
    
    </td>
    <td class="team"><img alt="LOTTO SOUDAL" src="/Content/images/event/2020/tirreno-adriatico/jerseys/LTS.png" title="LOTTO SOUDAL"/></td>
    
    <td class="noc"><img alt="BEL" src="/Content/images/flags/BEL.png" title="BEL"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">2</td>
    
    </tr>
    <tr style="display: none;">
    <td class="rank"><span title="Rank">10</span></td>
    <td class="any-progression"></td>
    <td class="bib"><span title="Bib">195</span></td>
    <td class="name">
    <a class="10006631548" href="javascript:;">
    TEUNISSEN Mike
    </a>
    
    </td>
    <td class="team"><img alt="JUMBO - VISMA" src="/Content/images/event/2020/tirreno-adriatico/jerseys/TJV.png" title="JUMBO - VISMA"/></td>
    
    <td class="noc"><img alt="NED" src="/Content/images/flags/NED.png" title="NED"/></td>
    
    <td class="bonif">
    </td>
    <td class="points" title="Points">1</td>
    
    </tr>
    
    </tbody>
    </table>
</div>
11
  • please provide the code you used to generate the first dataframe Commented Sep 8, 2020 at 18:10
  • 1
    how to identify the start of new header? for this sample data it string what if the table data also string how to identify the header? Commented Sep 8, 2020 at 18:12
  • The code to generate the first dataframe just a beautiful soup scrape of a webpage, that's why it's such a mess. @deadshot - the header will be text but everything else in the table is an integer. But of course because there is text in there pandas things the column is all strings.. Also I made a mistake, the other columns are blank where the header is in column 1. Commented Sep 8, 2020 at 18:26
  • Pretty broad question. Which part are you having trouble with - identifing the rows with the headers or splitting after the rows have been identified? Commented Sep 8, 2020 at 18:30
  • You should probably include the html - a minimal example - as part of your minimal reproducible example. It might be worthwhile creating the separate DataFrames from the markup rather than trying to split a composite DataFrame. That said, does either Pandas Split DataFrame using row index or Python Pandas: Get index of rows which column matches certain value help or solve your problem? Commented Sep 8, 2020 at 18:43

2 Answers 2

1

You can use np.split()

import numpy as np


res = [x.reset_index(drop=True) for x in np.split(df, np.where(df.applymap(lambda x: x == ''))[0]) if not x.empty]
for x in res:
    x = x.rename(columns=x.iloc[0]).drop(x.index[0])
    print(x)

Output:

  thing   
1     1  2
2     2  3
  thing2   
1      1  2
2      2  3
3      3  4
Sign up to request clarification or add additional context in comments.

Comments

0

Identify the headers and use cumsum() to groupby then append each group to a list.

import pandas as pd
df = pd.DataFrame({'Col1': {0: 'thing', 1: '1', 2: '2', 3: 'thing2', 4: '1',5: '2', 6: '3'},
                   'Col2': {0:'' , 1: 2, 2: 3, 3:'' , 4: 2, 5: 3, 6: 4}})
gb = df.groupby((~df.Col2.astype(bool)).cumsum())


dfs = []
for k,g in gb:
    dfs.append(g.copy())

In [42]: dfs[0]
Out[42]: 
    Col1  Col2
0  thing     
1      1     2
2      2     3

In [43]: dfs[1]
Out[43]: 
     Col1  Col2
3  thing2     
4       1     2
5       2     3
6       3     4

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.