Windows Batch / parse data from html web page

Question

Is it possible to parse data from web html page using windows batch?

let's say I have a web page: www.domain.com/data/page/1 Page source html:

...
<div><a href="/post/view/664654"> ....
....

In this case I would need get /post/view/664654 from web page.

My idea is to loop through www.domain.com/data/page/1 ... # (to some given number) and extract all the /post/view's. Then I would have a list of links, and from each of those links I would extract href values (either images or videos).

So far I was only successful in downloading image or video when I know exact link, using wget. But I don't know how (if possible at all) to parse html data.

edit

<body>
<nav>
    <section>links I dont need</section>
</nav>
<article>
    <section>links I need</section>
</article>

Got an XPath or a DOM hierarchy we could follow? Does the div's parent have an ID? Or is it maybe the first <a> tag of the document? Something like this might be a good starting point. — rojo
– rojo, Commented Apr 6, 2016 at 17:16
I would need to extract href content from any a tag in the document. So I could build a list. Is that possible? — CrazySabbath
– CrazySabbath, Commented Apr 6, 2016 at 17:21

rojo · Accepted Answer · 2016-04-06 17:59:05Z

2

It's better to parse structured markup as a hierarchical object, rather than scraping as flat text. That way you aren't so dependent upon the formatting of the data you're parsing (whether it's minified, spacing has changed, whatever).

The batch language isn't terribly well-suited to parse markup language like HTML, XML, JSON, etc. In such cases, it can be extremely helpful to use a hybrid script and borrow from JScript or PowerShell methods to scrape the data you need. Here's an example demonstrating a batch + JScript hybrid script. Save it with a .bat extension and give it a run.

@if (@CodeSection == @Batch) @then
@echo off & setlocal

set "url=http://www.domain.com/data/page/1"

for /f "delims=" %%I in ('cscript /nologo /e:JScript "%~f0" "%url%"') do (
    rem // do something useful with %%I
    echo Link found: %%I
)

goto :EOF
@end // end batch / begin JScript hybrid code

// returns a DOM root object
function fetch(url) {
    var XHR = WSH.CreateObject("Microsoft.XMLHTTP"),
        DOM = WSH.CreateObject('htmlfile');

    XHR.open("GET",url,true);
    XHR.setRequestHeader('User-Agent','XMLHTTP/1.0');
    XHR.send('');
    while (XHR.readyState!=4) {WSH.Sleep(25)};
    DOM.write('<meta http-equiv="x-ua-compatible" content="IE=9" />');
    DOM.write(XHR.responseText);
    return DOM;
}

var DOM = fetch(WSH.Arguments(0)),
    links = DOM.getElementsByTagName('a');

for (var i in links)
    if (links[i].href && /\/post\/view\//i.test(links[i].href))
        WSH.Echo(links[i].href);

edited Apr 6, 2016 at 17:59

answered Apr 6, 2016 at 17:33

rojo

24.5k5 gold badges61 silver badges104 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

CrazySabbath Over a year ago

Unfortunately it's not working as expted. Web page has ~30 links, like: href="/post/view/1234#search=SearchString". Script extracts only 6 and all of them are wrong, example: /post/view/141143#c63445.

rojo Over a year ago

Maybe the content of the page is different based on whether you're logged in or not? I didn't code cookie management or login session handling.

CrazySabbath Over a year ago

No difference between being logged in and not.

rojo Over a year ago

What if you take out the regex test for \/post\/view and just do if (links[i].href) WSH.Echo(links[i].href)? Or maybe it's a user agent thing, and the web server is degrading to a mobile view because the user agent is unrecognized? Try changing the user agent to a Firefox user agent string.

CrazySabbath Over a year ago

it would see it does not extract links from html <article> tag. No idea why. See my edit, script extract's links I dont need (outside article)

|

kenorb · Accepted Answer · 2018-04-11 13:20:10Z

-1

If you just need to get /post/view/664654, you can use grep command, e.g.

grep -o '/post/view/[^"]\+' *.html

For parsing more complex HTML, you can use HTML-XML-utils or pup.

answered Apr 11, 2018 at 13:20

kenorb

169k95 gold badges712 silver badges796 bronze badges

Collectives™ on Stack Overflow

Windows Batch / parse data from html web page

2 Answers 2

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related