-8

i would like to extract the following table content and save it in a CSV file via pandas, but only extract the date (e.g. Thu, 11/02) and all values, which are tagged by €/MWh. Thank you all very much in advance.

Source code:

<table cellspacing="0" cellpadding="0" border="0" class="list hours responsive" width="100%">
<tbody>
    <tr>
        <th class="title"></th>
        <th class="units"></th>
        <th>Thu, 11/02</th>
        <th>Fri, 12/02</th>
        <th>Sat, 13/02</th>
        <th>Sun, 14/02</th>
        <th>Mon, 15/02</th>
        <th>Tue, 16/02</th>
        <th>Wed, 17/02</th>
    </tr>
    <tr class="no-border">
        <td class="title">
            00 - 01
        </td>
        <td>€/MWh</td>
        <td>23.82</td>
        <td>22.81</td>
        <td>22.23</td>
        <td>13.06</td>
        <td>16.57</td>
        <td>25.99</td>
        <td>32.45</td>
    </tr>
    <tr>
        <td>&nbsp;</td>
        <td>MWh</td>
        <td>10,266.0</td>
        <td>9,626.6</td>
        <td>12,255.9</td>
        <td>11,084.7</td>
        <td>11,039.5</td>
        <td>13,134.7</td>
        <td>9,958.1</td>
    </tr>
    <tr class="no-border">
        <td class="title">
            01 - 02
        </td>
        <td>€/MWh</td>
        <td>21.48</td>
        <td>21.59</td>
        <td>21.10</td>
        <td>12.17</td>
        <td>16.00</td>
        <td>23.65</td>
        <td>31.27</td>
    </tr>
    <tr>
        <td>&nbsp;</td>
        <td>MWh</td>
        <td>9,843.3</td>
        <td>9,494.4</td>
        <td>11,823.3</td>
        <td>10,531.9</td>
        <td>9,970.5</td>
        <td>12,875.6</td>
        <td>9,958.8</td>
    </tr>
    <tr class="no-border">
        <td class="title">
            02 - 03
        </td>
        <td>€/MWh</td>
        <td>21.00</td>
        <td>21.30</td>
        <td>20.21</td>
        <td>8.81</td>
        <td>14.55</td>
        <td>22.91</td>
        <td>29.72</td>
    </tr>
    <tr>
        <td>&nbsp;</td>
        <td>MWh</td>
        <td>9,857.0</td>
        <td>9,427.9</td>
        <td>11,755.2</td>
        <td>10,061.9</td>
        <td>9,881.7</td>
        <td>12,841.0</td>
        <td>9,896.9</td>
    </tr>
    <tr class="no-border">
        <td class="title">
            03 - 04
        </td>
        <td>€/MWh</td>
        <td>19.94</td>
        <td>19.86</td>
        <td>19.94</td>
        <td>6.74</td>
        <td>13.14</td>
        <td>22.04</td>
        <td>27.44</td>
    </tr>
    <tr>
        <td>&nbsp;</td>
        <td>MWh</td>
        <td>9,486.2</td>
        <td>10,492.7</td>
        <td>12,609.1</td>
        <td>11,216.6</td>
        <td>10,199.9</td>
        <td>11,209.7</td>
        <td>9,698.5</td>
    </tr>
</tbody>

3

4 Answers 4

0

Following code will give you row wise result of your page:

from bs4 import BeautifulSoup
import urllib.request

response = urllib.request.urlopen('file:///F:/test.html')
html = response.read()    
soup = BeautifulSoup(html)
table = soup.find('table', attrs={'class': 'list hours responsive'})
rows = table.findAll('tr')
for tr in rows:
  text = []
  cols = tr.findAll('td')
  for td in cols:
    try:
      text = ''.join(td.find(text=True))
    except Exception:
        text = "000"
    print(text+",")

My test HTML page was stored as test.html in F: drive

<html>
<body>
<table cellspacing="0" cellpadding="0" border="0" class="list hours responsive" width="100%">
                <tbody>
                <tr>
                    <th class="title"></th>
                    <th class="units"></th>
                                                <th>Thu, 11/02</th>
                                                <th>Fri, 12/02</th>
                                                <th>Sat, 13/02</th>
                                                <th>Sun, 14/02</th>
                                                <th>Mon, 15/02</th>
                                                <th>Tue, 16/02</th>
                                                <th>Wed, 17/02</th>

                </tr>
                                        <tr class="no-border">
                        <td class="title">
                                                                00 - 01
                                                        </td>
                        <td>€/MWh</td>
                                                        <td>23.82</td>
                                                        <td>22.81</td>
                                                        <td>22.23</td>
                                                        <td>13.06</td>
                                                        <td>16.57</td>
                                                        <td>25.99</td>
                                                        <td>32.45</td>
                                                </tr>
                    <tr>
                        <td>&nbsp;</td>
                        <td>MWh</td>
                                                        <td>10,266.0</td>
                                                        <td>9,626.6</td>
                                                        <td>12,255.9</td>
                                                        <td>11,084.7</td>
                                                        <td>11,039.5</td>
                                                        <td>13,134.7</td>
                                                        <td>9,958.1</td>
                                                </tr>
                                        <tr class="no-border">
                        <td class="title">
                                                                01 - 02
                                                        </td>
                        <td>€/MWh</td>
                                                        <td>21.48</td>
                                                        <td>21.59</td>
                                                        <td>21.10</td>
                                                        <td>12.17</td>
                                                        <td>16.00</td>
                                                        <td>23.65</td>
                                                        <td>31.27</td>
                                                </tr>
                    <tr>
                        <td>&nbsp;</td>
                        <td>MWh</td>
                                                        <td>9,843.3</td>
                                                        <td>9,494.4</td>
                                                        <td>11,823.3</td>
                                                        <td>10,531.9</td>
                                                        <td>9,970.5</td>
                                                        <td>12,875.6</td>
                                                        <td>9,958.8</td>
                                                </tr>
                                        <tr class="no-border">
                        <td class="title">
                                                                02 - 03
                                                        </td>
                        <td>€/MWh</td>
                                                        <td>21.00</td>
                                                        <td>21.30</td>
                                                        <td>20.21</td>
                                                        <td>8.81</td>
                                                        <td>14.55</td>
                                                        <td>22.91</td>
                                                        <td>29.72</td>
                                                </tr>
                    <tr>
                        <td>&nbsp;</td>
                        <td>MWh</td>
                                                        <td>9,857.0</td>
                                                        <td>9,427.9</td>
                                                        <td>11,755.2</td>
                                                        <td>10,061.9</td>
                                                        <td>9,881.7</td>
                                                        <td>12,841.0</td>
                                                        <td>9,896.9</td>
                                                </tr>
                                        <tr class="no-border">
                        <td class="title">
                                                                03 - 04
                                                        </td>
                        <td>€/MWh</td>
                                                        <td>19.94</td>
                                                        <td>19.86</td>
                                                        <td>19.94</td>
                                                        <td>6.74</td>
                                                        <td>13.14</td>
                                                        <td>22.04</td>
                                                        <td>27.44</td>
                                                </tr>
                    <tr>
                        <td>&nbsp;</td>
                        <td>MWh</td>
                                                        <td>9,486.2</td>
                                                        <td>10,492.7</td>
                                                        <td>12,609.1</td>
                                                        <td>11,216.6</td>
                                                        <td>10,199.9</td>
                                                        <td>11,209.7</td>
                                                        <td>9,698.5</td>
                                                </tr>

                                    </tbody>
            </table>
            </body>
</html>

Output of the code is as follows:

00 - 01,
€/MWh,
23.82,
22.81,
22.23,
13.06,
16.57,
25.99,
32.45,
 ,
MWh,
10,266.0,
9,626.6,
12,255.9,
11,084.7,
11,039.5,
13,134.7,
9,958.1,

01 - 02,
€/MWh,
21.48,
21.59,
21.10,
12.17,
16.00,
23.65,
31.27,
 ,
MWh,
9,843.3,
9,494.4,
11,823.3,
10,531.9,
9,970.5,
12,875.6,
9,958.8,

02 - 03,
€/MWh,
21.00,
21.30,
20.21,
8.81,
14.55,
22.91,
29.72,
 ,
MWh,
9,857.0,
9,427.9,
11,755.2,
10,061.9,
9,881.7,
12,841.0,
9,896.9,

03 - 04,
€/MWh,
19.94,
19.86,
19.94,
6.74,
13.14,
22.04,
27.44,
 ,
MWh,
9,486.2,
10,492.7,
12,609.1,
11,216.6,
10,199.9,
11,209.7,
9,698.5,
Sign up to request clarification or add additional context in comments.

Comments

0

You can refer to this example code:

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import requests
from bs4 import BeautifulSoup

url='http://news.sina.com.cn/'
res=requests.get(url)
res.encoding='utf-8'      #This is the key code
soup=BeautifulSoup(res.text,'html.parser')
tags=soup.select('a')

for tag in tags:
    try:
        link=tag['href']
        link=str(link)
        if link.startswith('http'):
            print(link)
        else:
            print(False)
    except:
        print('null')

Comments

0

There is a encoding problem, you should encode your response before printing it.

Comments

0

There is an easy/sneaky way to get around this.
I went to a online HTML reader and printed the result.
Then copied it and pasted it to an Excel file.
Now you have two options:

  1. Edit the values on Excel and export the result as CSV;
  2. Save the Excel, load it with Pandas, manipulate-it, and then export it as CSV.

For the second option you would use the column with the units to look for "€" symbol.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.