0

I'm having some issue with the headless chromium-browser not creating the html files correctly. The only thing / file that gets created is a single {}.html file

My domains.txt contains:

https://ibm.com/ 
https://www.linux.org/whats-new/

PS: I'm using Ubuntu 18.04 64bit linux

The command I use is below:

cat domains.txt | xargs -I {} -P 4 sh -c timeout 25s chromium-browser --headless --no-sandbox --user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' --dump-dom https://{} 2> /dev/null > {}.html

This was take from this link

0

1 Answer 1

2

The code:

cat domains.txt | xargs -I {} -P 4 sh -c timeout 25s chromium-browser --headless --no-sandbox --user-agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36' --dump-dom https://{} 2> /dev/null > {}.html

This lacks quotes around the argument to sh -c. With correct quoting, it also injects code into the the sh -c script from xargs, which is a security vulnerability.

The pipeline is better written

xargs -I {} -P 4 sh -c '
    timeout 25s chromium-browser \
        --headless --no-sandbox \
        --user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537. 36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36" \
        --dump-dom \
        "https://$1" 2>/dev/null >"$1.html"' sh {} <domains.txt

... but note that this still writes to files called things like https://ibm.com/.html if you have those strings in the domains.txt file (i.e. to files in weirdly named subdirectories), and it will try to fetch URLs like https://https://ibm.com/.

I think the intention is to keep only the actual domains, not full URLs, in the domains.txt file, i.e.

ibm.com
www.linux.org

Personally, I would rather go far a simpler solution using curl.

1
  • Thanks...This issue is curl doesn’t parse and render JavaScript. Commented Dec 10, 2019 at 0:24

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.