1

I've written some code in python to get company details and names from a webpage. I used css selector in my script to collect those items. However, when I run it I get "company details" and "contact" only the first portion separated by "br" tag out of a full string. How can i get the full portion other than what I've got?

Script I'm trying with:

import requests ; from lxml import html

tree = html.fromstring(requests.get("https://www.austrade.gov.au/SupplierDetails.aspx?ORGID=ORG8000000314&folderid=1736").text)
for title in tree.cssselect("div.contact-details"):
    cDetails = title.cssselect("h3:contains('Contact Details')+p")[0].text
    cContact = title.cssselect("h4:contains('Contact')+p")[0].text
    print(cDetails, cContact)

Elements within which the search results are:

<div class="contact-details block dark">
                <h3>Contact Details</h3><p>Company Name: Distance Learning Australia Pty Ltd<br>Phone: +61 2 6262 2964<br>Fax: +61 2 6169 3168<br>Email: <a href="mailto:[email protected]">[email protected]</a><br>Web: <a target="_blank" href="http://dla.edu.au">http://dla.edu.au</a></p><h4>Address</h4><p>Suite 108A, 49 Phillip Avenue<br>Watson<br>ACT<br>2602</p><h4>Contact</h4><p>Name: Christine Jarrett<br>Phone: +61 2 6262 2964<br>Fax: +61 2 6169 3168<br>Email: <a href="mailto:[email protected]">[email protected]</a></p>
            </div>

Results I'm getting:

Company Name: Distance Learning Australia Pty Ltd Name: Christine Jarrett

Results I'm after:

Company Name: Distance Learning Australia Pty Ltd
Phone: +61 2 6262 2964
Fax: +61 2 6169 3168
Email: [email protected]

Name: Christine Jarrett
Phone: +61 2 6262 2964
Fax: +61 2 6169 3168
Email: [email protected]

Btw, my intention is to do the aforesaid thing using selectors only, not xpath. Thanks in advance.

2 Answers 2

1

Simply replace text property with text_content() method as below to get required output:

cDetails = title.cssselect("h3:contains('Contact Details')+p")[0].text_content()
cContact = title.cssselect("h4:contains('Contact')+p")[0].text_content()
Sign up to request clarification or add additional context in comments.

3 Comments

One thing to know outside this context, sir Andersson. Why cant I use "::after" or "::before" in my selector cause If I attempt to do any I get an error "Pseudo-elements are not supported." However, I found this in a documentation on css selector. Is there any version related conflict?
You cannot locate pseudo-elements as they are not part of DOM. They can be used in CSS selectors to set some styles in HTML source code, but not for web-scraping
Oh, I see. Thanks for the answer.
1

text returns first text node. If you want to iterate over all child nodes while grabbing text nodes use xpath like:

company_details = title.cssselect("h3:contains('Contact Details')+p")[0]
for node in company_details.xpath("child::node()"):
    print node

result:

Company Name: Distance Learning Australia Pty Ltd
<Element br at 0x7f625419eaa0>
Phone: +61 2 6262 2964
<Element br at 0x7f625419ed08>
Fax: +61 2 6169 3168
<Element br at 0x7f625419e940>
Email: 
<Element a at 0x7f625419e8e8>
<Element br at 0x7f625419eba8>
Web: 
<Element a at 0x7f6254155af8>

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.