Grabbing items using selector within python script

Question

I've written some code in python to get company details and names from a webpage. I used css selector in my script to collect those items. However, when I run it I get "company details" and "contact" only the first portion separated by "br" tag out of a full string. How can i get the full portion other than what I've got?

Script I'm trying with:

import requests ; from lxml import html

tree = html.fromstring(requests.get("https://www.austrade.gov.au/SupplierDetails.aspx?ORGID=ORG8000000314&folderid=1736").text)
for title in tree.cssselect("div.contact-details"):
    cDetails = title.cssselect("h3:contains('Contact Details')+p")[0].text
    cContact = title.cssselect("h4:contains('Contact')+p")[0].text
    print(cDetails, cContact)

Elements within which the search results are:

<div class="contact-details block dark">
                <h3>Contact Details</h3><p>Company Name: Distance Learning Australia Pty Ltd<br>Phone: +61 2 6262 2964<br>Fax: +61 2 6169 3168<br>Email: <a href="mailto:[email protected]">[email protected]</a><br>Web: <a target="_blank" href="http://dla.edu.au">http://dla.edu.au</a></p><h4>Address</h4><p>Suite 108A, 49 Phillip Avenue<br>Watson<br>ACT<br>2602</p><h4>Contact</h4><p>Name: Christine Jarrett<br>Phone: +61 2 6262 2964<br>Fax: +61 2 6169 3168<br>Email: <a href="mailto:[email protected]">[email protected]</a></p>
            </div>

Results I'm getting:

Company Name: Distance Learning Australia Pty Ltd Name: Christine Jarrett

Results I'm after:

Company Name: Distance Learning Australia Pty Ltd
Phone: +61 2 6262 2964
Fax: +61 2 6169 3168
Email: [email protected]

Name: Christine Jarrett
Phone: +61 2 6262 2964
Fax: +61 2 6169 3168
Email: [email protected]

Btw, my intention is to do the aforesaid thing using selectors only, not xpath. Thanks in advance.

Andersson · Accepted Answer · 2017-08-23 10:58:38Z

1

Simply replace text property with text_content() method as below to get required output:

cDetails = title.cssselect("h3:contains('Contact Details')+p")[0].text_content()
cContact = title.cssselect("h4:contains('Contact')+p")[0].text_content()

answered Aug 23, 2017 at 10:58

Andersson

52.8k18 gold badges83 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

SIM Over a year ago

One thing to know outside this context, sir Andersson. Why cant I use "::after" or "::before" in my selector cause If I attempt to do any I get an error "Pseudo-elements are not supported." However, I found this in a documentation on css selector. Is there any version related conflict?

Andersson Over a year ago

You cannot locate pseudo-elements as they are not part of DOM. They can be used in CSS selectors to set some styles in HTML source code, but not for web-scraping

SIM Over a year ago

Oh, I see. Thanks for the answer.

Žilvinas Rudžionis · Accepted Answer · 2017-08-23 11:01:28Z

1

text returns first text node. If you want to iterate over all child nodes while grabbing text nodes use xpath like:

company_details = title.cssselect("h3:contains('Contact Details')+p")[0]
for node in company_details.xpath("child::node()"):
    print node

result:

Company Name: Distance Learning Australia Pty Ltd
<Element br at 0x7f625419eaa0>
Phone: +61 2 6262 2964
<Element br at 0x7f625419ed08>
Fax: +61 2 6169 3168
<Element br at 0x7f625419e940>
Email: 
<Element a at 0x7f625419e8e8>
<Element br at 0x7f625419eba8>
Web: 
<Element a at 0x7f6254155af8>

answered Aug 23, 2017 at 11:01

Žilvinas Rudžionis

2,38826 silver badges37 bronze badges

Collectives™ on Stack Overflow

Grabbing items using selector within python script

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related