I'm working on a series of scripts that pulls URLs from a database and uses the textstat package to calculate the readability of the page based on a set of predefined calculations. The function below takes a url (from a CouchDB), calculates the defined readability scores, and then saves the scores back to the same CouchDB document.
The issue I'm having is with error handling. As an example, the Flesch Reading Ease score calculation requires a count of the total number of sentences on the page. If this returns as zero, an exception is thrown. Is there a way to catch this exception, save a note of the exception in the database, and move on to the next URL in the list? Can I do this in the function below (preferred), or will I need to edit the package itself?
I know variations of this question have been asked before. If you know of one that might answer my question, please point me in that direction. My search has been fruitless thus far. Thanks in advance.
def get_readability_data(db, url, doc_id, rank, index):
readability_data = {}
readability_data['url'] = url
readability_data['rank'] = rank
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
headers = { 'User-Agent' : user_agent }
try:
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
content = response.read()
readable_article = Document(content).summary()
soup = BeautifulSoup(readable_article, "lxml")
text = soup.body.get_text()
try:
readability_data['flesch_reading_ease'] = textstat.flesch_reading_ease(text)
readability_data['smog_index'] = textstat.smog_index(text)
readability_data['flesch_kincaid_grade'] = textstat.flesch_kincaid_grade(text)
readability_data['coleman_liau'] = textstat.coleman_liau_index(text)
readability_data['automated_readability_index'] = textstat.automated_readability_index(text)
readability_data['dale_chall_score'] = textstat.dale_chall_readability_score(text)
readability_data['linear_write_formula'] = textstat.linsear_write_formula(text)
readability_data['gunning_fog'] = textstat.gunning_fog(text)
readability_data['total_words'] = textstat.lexicon_count(text)
readability_data['difficult_words'] = textstat.difficult_words(text)
readability_data['syllables'] = textstat.syllable_count(text)
readability_data['sentences'] = textstat.sentence_count(text)
readability_data['readability_consensus'] = textstat.text_standard(text)
readability_data['readability_scores_date'] = time.strftime("%a %b %d %H:%M:%S %Y")
# use the doc_id to make sure we're saving this in the appropriate place
readability = json.dumps(readability_data, sort_keys=True, indent=4 * ' ')
doc = db.get(doc_id)
data = json.loads(readability)
doc['search_details']['search_details'][index]['readability'] = data
#print(doc['search_details']['search_details'][index])
db.save(doc)
time.sleep(.5)
except: # catch *all* exceptions
e = sys.exc_info()[0]
write_to_page( "<p>---ERROR---: %s</p>" % e )
except urllib.error.HTTPError as err:
print(err.code)
This is the error I receive:
Error(ASL): Sentence Count is Zero, Cannot Divide
Error(ASyPW): Number of words are zero, cannot divide
Traceback (most recent call last):
File "new_get_readability.py", line 114, in get_readability_data
readability_data['flesch_reading_ease'] = textstat.flesch_reading_ease(text)
File "/Users/jrs/anaconda/lib/python3.5/site-packages/textstat/textstat.py", line 118, in flesch_reading_ease
FRE = 206.835 - float(1.015 * ASL) - float(84.6 * ASW)
TypeError: unsupported operand type(s) for *: 'float' and 'NoneType'
This is the code that calls the function:
if __name__ == '__main__':
db = connect_to_db(parse_args())
print("~~~~~~~~~~" + " GETTING IDs " + "~~~~~~~~~~")
ids = get_ids(db)
for i in ids:
details = get_urls(db, i)
for d in details:
get_readability_data(db, d['url'], d['id'], d['rank'], d['index'])
try/exceptso I'm having a hard problem understanding what the problem is.