3

Is it possible to parse data from web html page using windows batch?

let's say I have a web page: www.domain.com/data/page/1 Page source html:

...
<div><a href="/post/view/664654"> ....
....

In this case I would need get /post/view/664654 from web page.

My idea is to loop through www.domain.com/data/page/1 ... # (to some given number) and extract all the /post/view's. Then I would have a list of links, and from each of those links I would extract href values (either images or videos).

So far I was only successful in downloading image or video when I know exact link, using wget. But I don't know how (if possible at all) to parse html data.

edit

<body>
<nav>
    <section>links I dont need</section>
</nav>
<article>
    <section>links I need</section>
</article>

2
  • Got an XPath or a DOM hierarchy we could follow? Does the div's parent have an ID? Or is it maybe the first <a> tag of the document? Something like this might be a good starting point. Commented Apr 6, 2016 at 17:16
  • I would need to extract href content from any a tag in the document. So I could build a list. Is that possible? Commented Apr 6, 2016 at 17:21

2 Answers 2

2

It's better to parse structured markup as a hierarchical object, rather than scraping as flat text. That way you aren't so dependent upon the formatting of the data you're parsing (whether it's minified, spacing has changed, whatever).

The batch language isn't terribly well-suited to parse markup language like HTML, XML, JSON, etc. In such cases, it can be extremely helpful to use a hybrid script and borrow from JScript or PowerShell methods to scrape the data you need. Here's an example demonstrating a batch + JScript hybrid script. Save it with a .bat extension and give it a run.

@if (@CodeSection == @Batch) @then
@echo off & setlocal

set "url=http://www.domain.com/data/page/1"

for /f "delims=" %%I in ('cscript /nologo /e:JScript "%~f0" "%url%"') do (
    rem // do something useful with %%I
    echo Link found: %%I
)

goto :EOF
@end // end batch / begin JScript hybrid code

// returns a DOM root object
function fetch(url) {
    var XHR = WSH.CreateObject("Microsoft.XMLHTTP"),
        DOM = WSH.CreateObject('htmlfile');

    XHR.open("GET",url,true);
    XHR.setRequestHeader('User-Agent','XMLHTTP/1.0');
    XHR.send('');
    while (XHR.readyState!=4) {WSH.Sleep(25)};
    DOM.write('<meta http-equiv="x-ua-compatible" content="IE=9" />');
    DOM.write(XHR.responseText);
    return DOM;
}

var DOM = fetch(WSH.Arguments(0)),
    links = DOM.getElementsByTagName('a');

for (var i in links)
    if (links[i].href && /\/post\/view\//i.test(links[i].href))
        WSH.Echo(links[i].href);
Sign up to request clarification or add additional context in comments.

6 Comments

Unfortunately it's not working as expted. Web page has ~30 links, like: href="/post/view/1234#search=SearchString". Script extracts only 6 and all of them are wrong, example: /post/view/141143#c63445.
Maybe the content of the page is different based on whether you're logged in or not? I didn't code cookie management or login session handling.
No difference between being logged in and not.
What if you take out the regex test for \/post\/view and just do if (links[i].href) WSH.Echo(links[i].href)? Or maybe it's a user agent thing, and the web server is degrading to a mobile view because the user agent is unrecognized? Try changing the user agent to a Firefox user agent string.
it would see it does not extract links from html <article> tag. No idea why. See my edit, script extract's links I dont need (outside article)
|
-1

If you just need to get /post/view/664654, you can use grep command, e.g.

grep -o '/post/view/[^"]\+' *.html

For parsing more complex HTML, you can use HTML-XML-utils or pup.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.