Scraping data from JavaScript-powered website

Question

I want to scrape data from http://www.ifanca.org/Pages/Certified-Products.aspx?search=22535. This is my PHP script:

<?php
 //get the html returned from the following url
$html = file_get_contents(
  'http://www.ifanca.org/Pages/Certified-Products.aspx?search=22535');

$pokemon_doc = new DOMDocument();

libxml_use_internal_errors(TRUE); //disable libxml errors

if(!empty($html)) { //if any html is actually returned

  $pokemon_doc->loadHTML($html);
  libxml_clear_errors(); //remove errors for yucky html

  $pokemon_xpath = new DOMXPath($pokemon_doc);

  $pokemon_row = $pokemon_xpath->query('//*[@id="example"]');

  if($pokemon_row->length > 0){
    foreach($pokemon_row as $row){
      echo $row->nodeValue . "<br/>";
    }
  }
}
?>

It gives me the result:

Product Name Company Name Sold In Marketing Type Product Type Product Code Logo Ifanca Code

which is fine. But when I am trying to get product name e.g "4Life Transfer Factor Belle Vie" by quering //*[@id="example"]/tbody/tr/td[1] then it gives me nothing.

Screenshot

I need help to get Product Name data.

Did one of the below answers solve the problem here, Kamran? If so, please consider accepting it. — halfer
– halfer, Commented Feb 20, 2015 at 21:09

Aleksei Matiushkin · Accepted Answer · 2015-02-14 19:02:24Z

2

If you wget the file and examine it’s content, you’ll find that everything is being fulfilled with the javascript, while initial HTML of the table is:

<table id="example" class="display"  
       width="100%" cellpadding="0" cellspacing="0" border="0">

  <thead>
    <tr><th width="22%" style="width:22% !important">Company Name </th>
        <th width="13%" style="width:13% !important">Sold In</th>
        <th width="23%" style="width:23% !important">Product Name</th>h>
        <th width="22%" style="width:22% !important">Company Name </th>
        <th width="13%" style="width:13% !important">Sold In</th></th>
        <th width="10%" style="width:10% !important">Marketing Type</th>
        <th width="10%" style="width:10% !important">Product Type</th>
        <th width="10%" style="width:10% !important">Product Code</th>
        <th width="5%" style="width:5% !important" >Logo</th>
        <th width="7%" style="width:7% !important">Ifanca Code</th>
  </thead>
  <tbody>
  </tbody>
</table>

Neither file_get_contents nor DOMDocument would parse and execute javascript for you. That’s why you harvest an empty resultset for

//*[@id="example"]/tbody/tr/td[1]

it is simply not existing in the resulting document.

answered Feb 14, 2015 at 19:02

Aleksei Matiushkin

121k12 gold badges109 silver badges173 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Kamran Ullah Over a year ago

mudasobwa, how can i get Product Name data? Is there any alternative ?

Aleksei Matiushkin Over a year ago

@KamranUllah NodeJS and run coming javascript manually (but there might be other glitches as well.)

halfer · Accepted Answer · 2015-02-14 19:48:05Z

1

This site is dependent on JavaScript. If you open your Network developer tools (in Firefox and probably most other browsers) when the page loads, you'll see it generates four AJAX POST requests to the server. It is likely that each of these are dependent on the other, so it may not be trivial to scrape these.

Normally I recommend scraping AJAX GET requests, since there is (and should be) only one per data source, but this site is fetching content in a way that is wasteful of HTTP resources and in a way that is hard to scrape. Indeed, that may be the reason why the developers did it this way - they don't want other people to republish their information.

The input parameters of one of the requests take this XML:

<?xml version="1.0" encoding="UTF-8"?>
<Request xmlns="http://schemas.microsoft.com/sharepoint/clientquery/2009" SchemaVersion="15.0.0.0" LibraryVersion="15.0.0.0" ApplicationName="Javascript Library">
   <Actions>
      <ObjectPath Id="1" ObjectPathId="0" />
      <ObjectPath Id="3" ObjectPathId="2" />
      <ObjectPath Id="5" ObjectPathId="4" />
      <ObjectPath Id="7" ObjectPathId="6" />
      <ObjectIdentityQuery Id="8" ObjectPathId="6" />
      <ObjectPath Id="10" ObjectPathId="9" />
      <ObjectPath Id="12" ObjectPathId="11" />
      <ObjectIdentityQuery Id="13" ObjectPathId="11" />
      <ObjectPath Id="15" ObjectPathId="14" />
      <Query Id="16" ObjectPathId="9">
         <Query SelectAllProperties="true">
            <Properties />
         </Query>
         <ChildItemQuery SelectAllProperties="true">
            <Properties />
         </ChildItemQuery>
      </Query>
   </Actions>
   <ObjectPaths>
      <StaticProperty Id="0" TypeId="{3747adcd-a3c3-41b9-bfab-4a64dd2f1e0a}" Name="Current" />
      <Property Id="2" ParentId="0" Name="Web" />
      <Property Id="4" ParentId="2" Name="Lists" />
      <Method Id="6" ParentId="4" Name="GetByTitle">
         <Parameters>
            <Parameter Type="String">HCM</Parameter>
         </Parameters>
      </Method>
      <Method Id="9" ParentId="6" Name="GetItems">
         <Parameters>
            <Parameter TypeId="{3d248d7b-fc86-40a3-aa97-02a75d69fb8a}">
               <Property Name="DatesInUtc" Type="Boolean">true</Property>
               <Property Name="FolderServerRelativeUrl" Type="Null" />
               <Property Name="ListItemCollectionPosition" Type="Null" />
               <Property Name="ViewXml" Type="String">&lt;View Scope="RecursiveAll"&gt;&lt;Query&gt;&lt;Where&gt;&lt;And&gt;&lt;IsNotNull&gt;&lt;FieldRef Name="Year"/&gt;&lt;/IsNotNull&gt;&lt;In&gt;&lt;FieldRef Name="FileType"/&gt;&lt;Values&gt;&lt;Value Type="Choice"&gt;Image&lt;/Value&gt;&lt;Value Type="Choice"&gt;Flipbook&lt;/Value&gt;&lt;Value Type="Choice"&gt;pdf&lt;/Value&gt;&lt;/Values&gt;&lt;/In&gt;&lt;/And&gt;&lt;/Where&gt;&lt;OrderBy&gt;&lt;FieldRef Name="IssueNo" Ascending="False" /&gt;&lt;/OrderBy&gt;&lt;/Query&gt;&lt;RowLimit&gt;10&lt;/RowLimit&gt;&lt;/View&gt;</Property>
            </Parameter>
         </Parameters>
      </Method>
      <Method Id="11" ParentId="4" Name="GetByTitle">
         <Parameters>
            <Parameter Type="String">HDNL</Parameter>
         </Parameters>
      </Method>
      <Method Id="14" ParentId="11" Name="GetItems">
         <Parameters>
            <Parameter TypeId="{3d248d7b-fc86-40a3-aa97-02a75d69fb8a}">
               <Property Name="DatesInUtc" Type="Boolean">true</Property>
               <Property Name="FolderServerRelativeUrl" Type="Null" />
               <Property Name="ListItemCollectionPosition" Type="Null" />
               <Property Name="ViewXml" Type="String">&lt;View Scope="RecursiveAll"&gt;&lt;Query&gt;&lt;Where&gt;&lt;IsNotNull&gt;&lt;FieldRef Name="YYYY"/&gt;&lt;/IsNotNull&gt;&lt;/Where&gt;&lt;OrderBy&gt;&lt;FieldRef Name="IssueNumber" Ascending="False" /&gt;&lt;/OrderBy&gt;&lt;/Query&gt;&lt;RowLimit&gt;3&lt;/RowLimit&gt;&lt;/View&gt;</Property>
            </Parameter>
         </Parameters>
      </Method>
   </ObjectPaths>
</Request>

Yikes! If you want to build requests that way and scrape by sending a similar document, then you'd have to work out the format. I suspect here it would be much easier to use a headless browser, such as PhantomJS. There are PHP drivers for this, such as Spiderling. That will run the JavaScript for you (on a modern Webkit browser) and you'll be able to retrieve your data using an XPath or CSS selector.

(Remember that data on other sites may be subject to copyright. You could go to the trouble of setting up a scraper only to find that you are the target of an IP block, or worse still, legal action. The rights and wrongs of scraping are rather complicated, but my brief advice is if you can scrape from a range of targets, it makes your project less susceptible to failure).

edited Feb 14, 2015 at 19:48

answered Feb 14, 2015 at 19:34

halfer

20.2k20 gold badges110 silver badges207 bronze badges

7 Comments

Kamran Ullah Over a year ago

thanks for explanation. Well, i am trying to use Spiderling PHP driver. Currently i am using window 8.1 and WAMP. I have downloaded github.com/OpenBuildings/spiderling and put it into the root directory of WAMP. How can i use/configure spiderling ? Please provide me some step-wise procedure/guideline or refer me some basic tutorial.

halfer Over a year ago

See "A quick example" in that link, @KamranUllah. Tutorials don't come more basic than that! (If you have used Composer it should get PhantomJS for you, if not you will have to install that yourself).

halfer Over a year ago

Ah, I think that might have been mistaken - you have to install PhantomJS yourself. Just set it up so you can call phantomjs at the console. This should be as easy as running the installer and setting up the PATH, as is usual for Windows console programs.

Kamran Ullah Over a year ago

@ halfer sir, i have installed Phantomjs and test it. It's working fine for me. But i am confuse a bit that where can i place the "A quick example" code. I have created a file in which i put the code of "a quick example". Now where i place the file as it is not mention in the example.

halfer Over a year ago

@Kamran: put it in a file in whatever directory you like, and call it with php just as you would have done with your own. You can cd to that folder and then run php myscript.php.

|

Kamran Ullah · Accepted Answer · 2015-02-26 10:45:44Z

1

I solved my issue by using DIFFBOT Article API and the link of API is https://www.diffbot.com.

answered Feb 26, 2015 at 10:45

Kamran Ullah

214 bronze badges

Collectives™ on Stack Overflow

Scraping data from JavaScript-powered website

3 Answers 3

2 Comments

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related