0

Starting from this html page:

https://www.sports-reference.com/olympics/summer/1896/ATH/

I'm trying to get some information with the following script:

<?php
include_once ('C:\moduli\simple_html_dom.php');


    function getTextBetweenTags($url, $tagname) {
    $values = array();
    $html = file_get_html($url);
    foreach($html->find($tagname) as $tag) {

        //echo $tag;

        foreach($tag->find('a') as $a) {

            //echo $a;

            $values[] = $a->innertext. '<br>';
            //echo $values[0];

    }
    print_r ($values);
    unset($values);
    }

    //$result=explode("'s",$values[0]);
    //array_pop($result);
    //return $result;

}

$output = getTextBetweenTags('https://www.sports-reference.com/olympics/summer/1896/ATH/', 'tr  class=""');
//echo '<pre>';

?>

What I get from the print_r array inside the loop is the following (only first rows):

Array ( ) Array ( [0] => Men's 100 metres
[1] => Tom Burke
[2] => Fritz Hofmann
[3] => Alajos Szokoly
[4] => Frank Lane
) Array ( [0] => Men's 400 metres
[1] => Tom Burke
[2] => Herbert Jamison
[3] => Charles Gmelin
) Array ( [0] => Men's 800 metres
[1] => Teddy Flack
[2] => Nándor Dáni
[3] => Dimitrios Golemis
) Array ( [0] => Men's 1,500 metres
[1] => Teddy Flack
[2] => Arthur C. Blake
[3] => Albin Lermusiaux

I'd like to store in separated variables (for example for 100 metres):

100 metres
Men
Tom Burke
USA --> (this one taken from "alt" attribute inside html)
Gold --> (static parameter for the first athlete)

then reset all and get for second loop

100 metres
Men
Fritz Hofmann
GER --> (this one taken from "alt" attribute inside html)
Silver --> (static parameter for the second athlete)

for the last two athletes, both won bronze so I'd like to get:

    100 metres
    Men
    Alajos Szokoly
    HUN --> (this one taken from "alt" attribute inside html)
    Bronze --> (static parameter for the third athlete)

and

        100 metres
        Men
        Frank Lane
        USA --> (this one taken from "alt" attribute inside html)
        Bronze --> (static parameter for the fourth athlete)

Last two athletes are recognizible because in html they are on the same row of td align="left" attribute.

How to get that? Thank you

2
  • What have you tried so far? Where is your PHP that extracts these value? Commented Aug 10, 2017 at 15:23
  • Just updated post a few seconds ago ;) Commented Aug 10, 2017 at 15:28

1 Answer 1

1

This should work for you:

function getTextBetweenTags($url, $tagname) 
{
    $values = array();
    $html = file_get_html($url);
    foreach($html->find($tagname) as $tag)
    {
        //echo $tag;
        $row = array();
        foreach($tag->find('td') as $td)
        {
            $a_tags = $td->find('a');
            if(count($a_tags) ==0)
            {
                $val ="";
            }
            elseif(count($a_tags)==1)
            {               
                $val = $a_tags[0]->innertext. '<br>';
            }
            else
            {
                $val = array();
                foreach($a_tags as $a)
                {
                    $val[] = $a->innertext. '<br>';
                }
            }
            $values[] = $val;
        }
        print_r ($values);
    unset($values);
    }

}

This outputs the array in this format:

Array
(
    [0] => Men's 100 metres<br>
    [1] => Tom Burke<br>
    [2] => Fritz Hofmann<br>
    [3] => Array
        (
            [0] => Alajos Szokoly<br>
            [1] => Frank Lane<br>
        )

)
Array
(
    [0] => Men's 400 metres<br>
    [1] => Tom Burke<br>
    [2] => Herbert Jamison<br>
    [3] => Charles Gmelin<br>
)
Sign up to request clarification or add additional context in comments.

6 Comments

It's ok but there is a particular case in which it does not work. When an athlete did not won a medal I get the next event as third element of the previous array, indeed it should be a new array.
Can you give example of that HTML?
The HTML is the same as above. I mean, if you look at the output, the Men's 110 metres Hurdles event has only two athletes who won gold and silver medal, no one won bronze. Well, the following event, Men's High Jump, starts from the third element of Men's 110 metres Hurdles and not just as a new array as the other ones. I hope my explanation was quite clear.
The code works just fine for the situation you described. Check out the array it creates here: prnt.sc/g72aox . You can see that a new array is created for each row and if there is no entry for a medal, then there is an empty array element for that. For example,. for Men's 110 metres Hurdles, $array[3] is empty then the the next array begins with Men's Pole Vault in its 0 position. Each array begins with the name of the sport.
Yes, you're right, I'm sorry. I was wrong to indent the output. Another thing: I was seeing, since the 0 position is always Men's Pole vault, or Men's 110 metres Hurdles, if it is possible explode it and create a sub array composed by two elements: first one by Pole Vault or 110 metres hurdles for example, and second one by Men.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.