How to strip data from html using awk?

Question

I'd like to retrieve data from here https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1. I wget the page to file. The data I seek is in the form of (samples):

https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep1/1839026755987
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep1/244094163573

How to use awk code to run in bash to pull a list of all the filenames in this format from the html? There are 3 seasons, so something like setting

https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-

as the first delimeter, followed by 2 / and then " as the end delimiter should work.

I've tried for an hour but I can't work it out myself.

For such problems you should not only describe what you want but give an example for both the input and the output. In addition, the question title is rather useless. How is the question about HTML in any way, because you download the input from an https:// URL? — Hauke Laging
– Hauke Laging, Commented Aug 16 at 3:38
I was thinking awk would be the best to achieve this; I don't mind what is used, as long as it's available in the Debian Trixie command line. There's not much point showing what I've tried because I can't find how to tell awk to use a start delimiter, then look for another 2 / delimters, and use " as a final delimiter. Example input is the webpage shown, wget to a file. The output I have shown already as the wanted filename. — Tony Puryer
– Tony Puryer, Commented Aug 16 at 3:54
I did try this code using grep, but it didn't work: grep -oP "(?<=start_string).*?(?=end_string)" grep -oP "(?<=https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-).*?(?=")" — Tony Puryer
– Tony Puryer, Commented Aug 16 at 4:05

Kusalananda · Accepted Answer · 2025-08-18 07:25:58Z

We can convert the HTML document into an XML document:

xmlstarlet format --html --recover 2>/dev/null

From this, we can get the value of the script node under /html/head whose type attribute is application/ld+json using

xmlstarlet select --template --value-of '/html/head/script[@type="application/ld+json"]'

This gives us a JSON document. We extract the URLs for each episode of each season with a JSON processor:

jq -r '.mainEntity.containsSeason[].episode[].url'

To only get season 1 (as an example), you would do either

jq -r '.mainEntity.containsSeason[] | select(.seasonNumber == 1).episode[].url'

or, if you trust the first season to always be the first one in that containsSeason array,

jq -r '.mainEntity.containsSeason[0].episode[].url'

All together (for all seasons):

curl -s 'https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1' |
xmlstarlet fo -H -R /dev/stdin 2>/dev/null |
xmlstarlet sel -t -v '/html/head/script[@type="application/ld+json"]' |
jq -r '.mainEntity.containsSeason[].episode[].url'

Output:

https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep1/1839026755987
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep2/1839026755988
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep3/1839026755989
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep4/1839026755990
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep5/1839026755992
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep6/1839026755993
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep1/2440941635730
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep2/2440941635731
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep3/2440941635732
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep4/2440941635733
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep5/2440941635819
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep6/2440941635820
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep1/2440941635824
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep2/2440941635826
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep3/2440941635827
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep4/2440941635831
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep5/2440941635834
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep6/2440941635837

This demonstrates how to utilise the structure of the given data to select the necessary bits. In this particular case, two structured document formats were involved: XML (converted from the original HTML to make parsing easier) and JSON (embedded within the XML).

Each structured document format requires its own parser, and we used xmlstarlet for the XML to extract the embedded JSON, and jq for the JSON to extract the URLs. Each parser tool uses its own expression language; XPATH with xmlstarlet and jq's own expression syntax, respectively. This not only makes the parsing of the data more robust, but also ensures that the resulting strings etc. are properly decoded if needed.

I can't upvote due to low reputation, but thank you. This works well. I'll try and get my head around what's going on here. I'm a mechanic who dabbles with Linux and the terminal and would like to learn more. Thank you for helping me out today. — Tony Puryer
– Tony Puryer, Commented Aug 16 at 7:17

cas · Accepted Answer · 2025-08-17 05:37:18Z

Using perl:

$ export BASE_URL='https://www.sbs.com.au'
$ URL="$BASE_URL/ondemand/tv-series/la-unidad/season-1"

$ curl -s "$URL" | perl -lne '
    BEGIN {
      $/ = q(") # set perl's record separator, $/, to "
    };
    print if m=https:.*/la-unidad-s=' 
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep1/1839026755987
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep2/1839026755988
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep3/1839026755989
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep4/1839026755990
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep5/1839026755992
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep6/1839026755993
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep1/2440941635730
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep2/2440941635731
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep3/2440941635732
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep4/2440941635733
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep5/2440941635819
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep6/2440941635820
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep1/2440941635824
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep2/2440941635826
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep3/2440941635827
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep4/2440941635831
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep5/2440941635834
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep6/2440941635837

NOTE: As with anything that doesn't use a proper parser for HTML, XML, json, etc using a simple regexp to extract data is fragile and could break whenever SBS makes even minor changes to their web site.

Old answer based on 'lynx -dump' or 'html2':

(I haven't deleted this because it's generically useful for people trying to extract links from actual HTML rather than json code embedded in a javascript function)

Don't try to parse HTML with regexes alone, that is doomed to failure unless you're an expert with all things HTML, the HTML is extremely simple, and the web site never changes its format. And even then it will be fragile and prone to breaking. In short: just don't.

See also: Parsing Html The Cthulhu Way and Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms

You should use a language that has a HTML parsing library. Perl has several. Python does too. As do many other languages.

Alternatively, if you just want to extract a list of links, you could use the -dump option of text-mode web browsers like lynx or links.

e.g. first set up some variables for the URL:

$ BASE_URL='https://www.sbs.com.au'
$ URL="$BASE_URL/ondemand/tv-series/la-unidad/season-1"

Then fetch the URL and pipe the output into grep:

$ lynx -dump -listonly -nonumbers "$URL" |
    grep '/la-unidad-s[0-9]'
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep1/1839026755987
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep2/1839026755988
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep3/1839026755989
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep4/1839026755990
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep5/1839026755992
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep6/1839026755993

Another option is to use html2 from the xml2 package (which doesn't seem to have a home page any more, but is packaged for Debian) to convert the html to a line-oriented format.

This is more complicated than using lynx, but you get full access to each individual HTML element, not just the links, in a line-oriented format suitable for processing with text processing tools like sed, grep, and awk. And perl and python too, without needing their HTML parser libs. For example:

$ curl -s "$URL" |
    html2 2>/dev/null |
    sed -ne '/@href=.*\/la-unidad-s[0-9]/ {s:^.*/a/@href=::;p}'
/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep1/1839026755987
/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep2/1839026755988
/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep3/1839026755989
/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep4/1839026755990
/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep5/1839026755992
/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep6/1839026755993

Note that, unlike lynx -dump, it doesn't prepend the base URL (https://www.sbs.com.au) to relative URLs in the HTML source, the URLs are printed exactly as they appear in the HTML. You can add that yourself with the previously defined $BASE_URL variable.

Or, if you export the BASE_URL variable so that it's in the environment and available to child processes (i.e. programs you run from your shell or shell script), you could do something like this using perl:

$ export BASE_URL='https://www.sbs.com.au'
$ URL="$BASE_URL/ondemand/tv-series/la-unidad/season-1"
$ curl -s "$URL" |
    html2 2>/dev/null |
    perl -lne '
      if (m:/\@href=(.*/la-unidad-s[0-9].*):) {
        print $ENV{BASE_URL} . $1;
      }'
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep1/1839026755987
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep2/1839026755988
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep3/1839026755989
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep4/1839026755990
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep5/1839026755992
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep6/1839026755993

Thank you for taking the time to explain this to me cas. I will install lynx and use your code to apply it, learn how it works, and try some examples out to gain a better understanding of it. I'm new to this and appreciate that you have given your time to help me out. — Tony Puryer
– Tony Puryer, Commented Aug 16 at 4:55
Unfortunately I'm not able to do what I need with the above scripts. They do not yield anything from series 2 or 3, even after changing the $URL to "*/season-2". — Tony Puryer
– Tony Puryer, Commented Aug 16 at 5:36
They're not "scripts". They're examples of how to do things. I don't dabble in magic code that you can just run without understanding. You're supposed to adapt the ideas and techniques to suit your changing needs. — cas
– cas, Commented Aug 16 at 5:51
I did, and unfortunately the only @href references are for season 1. I'll have to go with the dirty regex route. I did not know how to pipe from curl, so many thanks for that tip, and again, for all your time today. — Tony Puryer
– Tony Puryer, Commented Aug 16 at 6:26
I did give examples asking for both season 1 and season 2 urls, and also wrote that there were 3 seasons. Granted, I asked for help with HTML. I'm new to this, so forgive me if I didn't get the terminology exactly right. I didn't know there was a difference. Not everyone is an elite coder. — Tony Puryer
– Tony Puryer, Commented Aug 16 at 10:52

Tony Puryer · Accepted Answer · 2025-08-16 06:51:11Z

1

I had to go the dirty regex route due to the site not showing links for other seasons with a @href tag. I also discovered that you can pipe to yt-dlp from STDIN with the -a - switch

URL=insert_sbs_url_here
curl -s $URL |  awk 'BEGIN{RS="\",\"image\":\""; FS=",\"url\":\""}NF>1{print $NF}' | yt-dlp -a - -o "%(title)s.%(ext)s"

edited Aug 16 at 6:51

answered Aug 16 at 6:37

Tony Puryer

998 bronze badges

Neither availability of @href tags ("I had to go the dirty regex route due to the site not showing links for other seasons with a @href tag.") nor use of yt-dlp ("I also discovered that you can pipe to yt-dlp from STDIN with the -a - switch") are mentioned in the question so it's hard to see how this answer is needed to answer the question that was asked, nor how it would help others trying to find an answer to the question that was asked. Adding an explanation to the answer (and probably the question) of what @hfref tags and yt-dlp have to do with the question would help.

Ed Morton
– Ed Morton

2025-08-16 12:45:11 +00:00
Commented Aug 16 at 12:45
Btw you should mention that requires GNU awk for multi-char RS, it won't work with other awk variants.

Ed Morton
– Ed Morton

2025-08-16 12:46:07 +00:00
Commented Aug 16 at 12:46
1

I'm using Debian's version of awk, which Google tells me is based on Mawk and "adheres more closely to the POSIX standard". It works fine with multi character RS.

Tony Puryer
– Tony Puryer

2025-08-17 00:36:42 +00:00
Commented Aug 17 at 0:36
Newer versions of mawk 2 have pulled in a lot of the gawk source code and so support several of the gawk extensions. I don't know what google is comparing it too with that statement but multi-char RS is not part of POSIX and is not supported by most awks. Maybe that statement was comparing mawk 2 to mawk 1 which was a minimal-featured awk designed strictly for speed of execution.

Ed Morton
– Ed Morton

2025-08-17 10:24:14 +00:00
Commented Aug 17 at 10:24

Add a comment |

Ed Morton · Accepted Answer · 2025-08-16 12:36:23Z

Though the info about not parsing HTML or XML with regexps is valid, the actual input anyone cares about is often a tiny subset of XML and then you can do something cheap and cheerful like this which will just print the string within quotes following every "url": using any awk, in any shell on any Unix box:

$ awk -v RS='"' 'p2 == "url"; {p2=p; p=$0}' file
https://www.sbs.com.au/ondemand/
https://www.sbs.com.au/ondemand/tv-series/la-unidad
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep1/1839026755987
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep2/1839026755988
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep3/1839026755989
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep4/1839026755990
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep5/1839026755992
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep6/1839026755993
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep1/2440941635730
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep2/2440941635731
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep3/2440941635732
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep4/2440941635733
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep5/2440941635819
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep6/2440941635820
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep1/2440941635824
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep2/2440941635826
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep3/2440941635827
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep4/2440941635831
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep5/2440941635834
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep6/2440941635837

From there it'd be a trivial tweak to the awk script to select and/or modify any of those URLs however you like before output, e.g. this might be what you want:

$ awk -v RS='"' -F'/' '(p2 == "url") && (NF > 7); {p2=p; p=$0}' file
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep1/1839026755987
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep2/1839026755988
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep3/1839026755989
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep4/1839026755990
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep5/1839026755992
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-1/la-unidad-s1-ep6/1839026755993
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep1/2440941635730
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep2/2440941635731
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep3/2440941635732
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep4/2440941635733
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep5/2440941635819
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-2/la-unidad-s2-ep6/2440941635820
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep1/2440941635824
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep2/2440941635826
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep3/2440941635827
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep4/2440941635831
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep5/2440941635834
https://www.sbs.com.au/ondemand/tv-series/la-unidad/season-3/la-unidad-s3-ep6/2440941635837

Stack Exchange Network

How to strip data from html using awk?

4 Answers 4

Old answer based on 'lynx -dump' or 'html2':

You must log in to answer this question.

Hot Network Questions

How to strip data from html using awk?

4 Answers 4

Old answer based on 'lynx -dump' or 'html2':

You must log in to answer this question.

Related

Hot Network Questions