1

I'm trying to scrape a set of news articles using rvest and boilerpipeR. The code works fine for most of time, however, it crashes for some specific values. I searched online high and low and could not find anyone with anything similar.

require(rvest)
require(stringr)
require(boilerpipeR)

# this is a problematic URL, its duplicates also generate fatal errors
url = "http://viagem.estadao.com.br/noticias/geral,museu-da-mafia-ganha-exposicao-permanente-da-serie-the-breaking-bad,10000018395"

content_html = getURLContent(url) # HTML source code in character type
article_text = ArticleExtractor(content_html) # returns 'NA' 

# next line induces fatal error 
encoded_exit = read_html(content_html ,encoding = "UTF-8")

paragraph = html_nodes(encoded_exit,"p")
article_text = html_text(paragraph)
article_text = iconv(article_text,from="UTF-8", to="latin1")

This is not the only news piece that ArticleExtractor() returns 'NA' to, and the code was built to handle it as a viable result. This whole snippet is inside a tryCatch(), so regular errors should not be able to stop execution.

The main issue is that the entire R session just crashes and has to be reloaded, which prevents me from grabbing data and debugging it.

What could be causing this issue?
And how can I stop it from crashing the entire R session?

1 Answer 1

1

I had the same problem. RScript crashes without any error message (session aborted), no matter if I use 32bit or 64bit. The solution for me was to look at the URL I was scraping. If the URL has some severe mistakes in the HTML-Code-syntax, RScript will crash. It's reproducable. Check the page with https://validator.w3.org. In your case:

"Error: Start tag body seen but an element of the same type was already open."

From line 107, column 1; to line 107, column 25

crashed it. So your document had two <body><body> opening Tags. A quick&dirty solution for me was to check first, if read_html gets valid HTML content:

url = "http://www.blah.de"
page = read_html(url, encoding = "UTF-8")

# check HTML-validity first to prevent fatal crash
if (!grepl("<html.*<body.*</body>.*</html>", toString(page), ignore.case=T)) {
   print("Skip this Site")
}

# proceed with html_nodes(..) etc

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the answer. I'm a year late and not even working with R anymore, but I was able to rebuild the environment, reproduce the error and check that your workaround actually works. Kudos to you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.