0

I'm attempting to scrape the value of an input box from a URL. I seem to be having problems with my implementation of XPath.

The page to be scraped looks something like:

<!DOCTYPE html> 
<html lang="en">
    <head></head>
    <body>
        <div><span>Blah</span></div>
        <div><span>Blah</span> Blah</div>
        <div>
            <form method="POST" action="blah">
                <input name="SomeName" id="SomeId" value="GET ME"/>
                <input type="hidden" name="csrfToken" value="ajax:3575644127378754050" id="csrfToken-login">
            </form>
        </div>
    </body>
</html>

and I'm attempting to parse it like this:

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
print_r($Selector);
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
print_r($xpath->query($Selector));

NB: dump() just wraps print_r() but adds some stack trace info and formatting.

The output is as folllowws:

14:50:08 scraper.php 181: (Scraper->Test)
//input[@id='csrfToken-login']/@value

14:50:08 scraper.php 188: (Scraper->Test)
DOMNodeList Object
(
)

Which I'm assuming means it was unable to find anything in the document which matches my selector? I've tried a number of variations, jsut to see if I can get something back:

/input/@value
/input
//input
/div

The only selector which I've been able to get anything from is / which returns the entire document.

What am I doing wrong?

EDIT: As some can't reproduce the problem with the old example, I've replaced it with an almost identical example which also demonstrates the problem but uses a public URL (LinkedIn login page).

There's been a suggestion that this isn't possible due to the parser choking on html5 - (as is the internal page) anyone have any experience of this?

3
  • 1
    you cannot dump any of the DOM instances. They dont expose their properties. even if you had used the correct XPath //input[@id='SomeId']/@value the dump() result would show the same empty object (despite items being in there). Commented Feb 17, 2012 at 14:00
  • I've edited the Q to include a "working" example using the LinkedIn login page. Commented Feb 17, 2012 at 14:51
  • 1
    Hello! I have update my answer with working code (at least for the linkedin example) Commented Feb 18, 2012 at 11:09

3 Answers 3

2

If your selector starts with a single slash(/), it means the absolute path from the root. You need to use double slash (//) which selects all matching elements regardless of their location.

print_r won't work for this. Everything was fine in your code except for actually getting value. Lists classes in PHP usually have a property called length, check that instead.

$Contents = file_get_contents("https://www.linkedin.com/uas/login");
$Selector = "//input[@id='csrfToken-login']/@value";
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom->loadHtml($Contents);
$xpath = new DOMXPath($dom);
libxml_use_internal_errors(false);
$b = $xpath->query($Selector);
echo $b->item(0)->value;
Sign up to request clarification or add additional context in comments.

2 Comments

or provide the full path from the root
This doesn't seem to resolve my problem - have a look at the updated example
2

DOMXPath looks fine to me.

As for the xpath use descendant-or-self shortcut // to get to the input tag

//input[@id='SomeId']/@value

1 Comment

Thanks but I don't think that's the only issue - I've updated the example to point at the LinkedIn login page where the same behavior is visible.
0

I've been to the LinkedIn login page that you specified and it is malformed; even your pared-down example has an unclosed input node. I know nothing about PHP's XPath implementation, but I'm guessing no straight XPath API is ever going to work with a malformed document.

Your XPath is correct, by the way.

You might need an intermediary step using TagSoup to "well form" the source before you start querying it, or Google "tag soup php" for any PHP-specific solutions/implementations.

I hope this helps,
Zachary

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.