1

I'm trying to extract data from anchor urls of a webpage i.e. :

require 'simple_html_dom.php';
$html = file_get_html('http://www.example.com');
foreach($html->find('a') as $element) 
{
    $href = $element->href;
    $name = $surname = $id = 0;     
    parse_str($href);
    echo $name;
}

Now, the problem with this is that it doesn't work for some reason. All urls are in the following form:

name=James&surname=Smith&id=2311245

Now, the strange thing is, if I execute

echo $href;

I get the desired output. However, that string won't parse for some reason and also has a length of 43 according to strlen() function. If, however, I pass 'name=James&surname=Smith&id=2311245' as the parse_str() function argument, it works just fine. What could be the problem?

6
  • Do you have any example input HTML we can test on? In addition, can you elaborate on "I get the desired output"? Lastly, for safety's sake, never execute parse_str without the second parameter. It's risky to blindly overwrite global variables. Commented Jun 30, 2014 at 22:08
  • $name=$surname=$id=0; ... ? Commented Jun 30, 2014 at 22:08
  • parse_str($href, $out); var_dump($out); what do you see? Commented Jun 30, 2014 at 22:08
  • what the hell is this line doing? $name=$surname=$id=0; Commented Jun 30, 2014 at 22:10
  • @KyleK Assigning 0 to $name, $surname and $id, what do you think it's doing? Commented Jun 30, 2014 at 22:11

2 Answers 2

3

I'm gonna take a guess that your target page is actually one of the rare pages that correctly encodes & in its links. Example:

<a href="somepage.php?name=James&amp;surname=Smith&amp;id=3211245">

To parse this string, you first need to unescape the &amp;s. You can do this with a simple str_replace if you like.

Sign up to request clarification or add additional context in comments.

5 Comments

That wouldn't explain why echo $href; allegedly gives the correct output. (Unless OP was mistaken or was viewing the output in a web browser.)
@Mr.Llama Your parenthetical "unless" is exactly it ;) Notice the strlen output OP gave?
Hah! I totally missed that bit.
When you spend too long on StackOverflow, you start being able to see these "invisible" errors ;)
Oh, yes, that explains it xD Thanks a lot ;)
1

Presuming the links are absolute, you just need the query string. You can use parse_url and use an out parameter with parse_str access an array;

$html = file_get_html('http://www.example.com');
foreach($html->find('a') as $element) 
{
    $href= $element->href;


    $url_components = parse_url($href);
    parse_str($url_components['query'], $out);

    var_dump($out)
}

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.