PHP Scraping using XPath - html5 issue?

Question

I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.

The page to be scraped looks something like:

<!DOCTYPE html> 
<html lang="en">
    <head></head>
    <body>
        <div><span>Blah</span></div>
        <div><span>Blah</span> Blah</div>
        <div>
            <form method="POST" action="blah">
                <input name="SomeName" id="SomeId" value="GET ME"/>
                <input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
            </form>
        </div>
    </body>
</html>

and I'm attempting to parse it like this:

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));

NB: dump() just wraps print_r() but adds some stack trace info and formatting.

The output is as folllowws:

14:50:08 scraper.php 181: (Scraper->Test)
//input[@id='csrfToken-login']/@value

14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)

Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:

/input/@value
/input
//input
/div

The only selector which I've been able to get anything from is / which returns the entire document.

What am I doing wrong?

EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).

There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?

you cannot dump any of the DOM instances. They dont expose their properties. even if you had used the correct XPath //input[@id='SomeId']/@value the dump() result would show the same empty object (despite items being in there). — Gordon
– Gordon, Commented Feb 17, 2012 at 14:00
I've edited the Q to include a "working" example using the LinkedIn login page. — Basic
– Basic, Commented Feb 17, 2012 at 14:51
Hello! I have update my answer with working code (at least for the linkedin example) — Uku Loskit
– Uku Loskit, Commented Feb 18, 2012 at 11:09

Uku Loskit · Accepted Answer · 2012-02-18 13:50:42Z

2

If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.

print_r won't work for this. Everything was fine in your code except for actually getting value. Lists classes in PHP usually have a property called length, check that instead.

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;

edited Feb 18, 2012 at 13:50

answered Feb 17, 2012 at 13:23

Uku Loskit

42.2k9 gold badges97 silver badges98 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Gordon Over a year ago

or provide the full path from the root

Basic Over a year ago

This doesn't seem to resolve my problem - have a look at the updated example

Gareth A. Lloyd · Accepted Answer · 2012-02-17 13:37:34Z

2

DOMXPath looks fine to me.

As for the xpath use descendant-or-self shortcut // to get to the input tag

//input[@id='SomeId']/@value

edited Feb 17, 2012 at 13:37

answered Feb 17, 2012 at 13:25

Gareth A. Lloyd

1,9421 gold badge17 silver badges26 bronze badges

1 Comment

Basic Over a year ago

Thanks but I don't think that's the only issue - I've updated the example to point at the LinkedIn login page where the same behavior is visible.

Zach Young · Accepted Answer · 2012-02-18 02:31:39Z

0

I've been to the LinkedIn login page that you specified and it is malformed; even your pared-down example has an unclosed input node. I know nothing about PHP's XPath implementation, but I'm guessing no straight XPath API is ever going to work with a malformed document.

Your XPath is correct, by the way.

You might need an intermediary step using TagSoup to "well form" the source before you start querying it, or Google "tag soup php" for any PHP-specific solutions/implementations.

I hope this helps,
Zachary

answered Feb 18, 2012 at 2:31

Zach Young

11.4k4 gold badges38 silver badges57 bronze badges

Collectives™ on Stack Overflow

PHP Scraping using XPath - html5 issue?

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related