Best Way to Debug html5lib.html5parser.ParseError: Unexpected character after attribute value"?

Question

I am currently working on a personal project and utilizing the chessdotcom Public API Package. I am currently able to store in a variable the PGN from the daily puzzle (Portable Game Notation) which is a required input to create a chess gif (https://www.chess.com/gifs).

I wanted to use requests and html parsers to essentially fill out the form on the gifs site and create a gif through my python script. I made a request to the gif website and the response.text returns a huge html string (thousands of lines) which I am parsing using html5lib. I am currently getting a "html5lib.html5parser.ParseError: Unexpected character after attribute value." I can't seem to figure out where in this giant response the issue is. What are some tips/tricks to debug this issue? Where do I even begin looking for this unexpected character?

import requests as req
import html5lib
from datetime import datetime
from chessdotcom import Client, get_player_profile, get_player_game_archives,get_player_stats, get_current_daily_puzzle, get_player_games_by_month


Client.request_config['headers']['User-Agent'] = 'PyChess Program for Automated YouTube Creation'


class ChessData:
    def __init__(self, name):
        self.player = get_player_profile(name)
        self.archives = get_player_game_archives(name)
        self.stats = get_player_stats(name)
        self.games = get_player_games_by_month(name, datetime.now().year, datetime.now().month)
        self.puzzle = get_current_daily_puzzle()
        self.html_parser = html5lib.HTMLParser(strict=True, namespaceHTMLElements=True, debug=True)

    def organize_puzzles(self, puzzles):
        #dict_keys(['title', 'url', 'publish_time', 'fen', 'pgn', 'image'])
        portableGameNotation = puzzles['pgn']
        html_data = req.get('https://www.chess.com/gifs')
        print(html_data.text)
        self.html_parser.parse(html_data.text.replace('&', '&amp;'))

    def get_puzzles(self):
        self.organize_puzzles(self.puzzle.json['puzzle'])

I had initially had issues with "Name Entity Expected. Got None" error which I temporarily bypassed by replacing all instances of & with & entity.


Traceback (most recent call last):
  File "C:/ChessProgram/ChessTop.py", line 17, in <module>
    main()
  File "C:/ChessProgram/ChessTop.py", line 14, in main
    ChessResults.get_puzzles()
  File "C:\ChessProgram\ChessData.py", line 32, in get_puzzles
    self.organize_puzzles(self.puzzle.json['puzzle'])
  File "C:\ChessProgram\ChessData.py", line 29, in organize_puzzles
    self.html_parser.parse(html_data.text.replace('&', '&amp;'))
  File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 284, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 133, in _parse
    self.mainLoop()
  File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 216, in mainLoop
    self.parseError(new_token["data"], new_token.get("datavars", {}))
  File "C:\ChessProgram\lib\site-packages\html5lib\html5parser.py", line 321, in parseError
    raise ParseError(E[errorcode] % datavars)
html5lib.html5parser.ParseError: Unexpected character after attribute value.

I tried replacing the & with & to fix the entity name issue and manually searched through this html response for the different attributes and looking for anything out of place.

html may use & in many places - ie. >, <, ©, etc. - so replacing all & with & may create wrong values. — furas
– furas, Commented Dec 9, 2023 at 1:12
i don't know why you parse this page. Normally it needs only to use request.post() to send data to page - to simulate filled form. And later you may need to parse page with result. But maybe it will need only to find link to GIF using standard functions for string - without parsing all HTML. — furas
– furas, Commented Dec 9, 2023 at 1:18
I see other problem. It uses cookies and it send unique token in form. It may need to use requests.Session and get first page with form to get cookies. But it may have more complex system to block scripts/bots and it may need Selenium to control real web browser. — furas
– furas, Commented Dec 9, 2023 at 1:42
I created working code but I used beautifulsoup to get token, and later I used normal text.find() to get url to image. — furas
– furas, Commented Dec 9, 2023 at 1:52
use html5lib.HTMLParser() without parameters (or with stricte=False) and it will work without .replace('&', '&'). — furas
– furas, Commented Dec 9, 2023 at 1:56

furas · Accepted Answer · 2023-12-09 02:24:42Z

0

Normally to debug html I would try to split HTML to smaller elements and test it. But with html5lib it may be problem because it may need full HTML to parse it. So it may need to write own functions in parser to display more information during parsing.

But if you use html5lib.HTMLParser() without parameters (or with stricte=False) then it runs correctly even without .replace('&', '&')

But still I wouldn't use html5lib for this because I don't see any functions to search elements in HTML. It may need to write own functions.

It is much simpler to do it with BeautifulSoup or lxml (or other modules)

Other problem: page uses cookies and it has hidden input with token which it probably compares with cookies (to generate image) and this needs `requests.Session()

So I do

create requests.Session()
use session to get() page with form
use BeautifulSoup to search hidden input with token
use session to post() all data like real form
use standard text.find() to find url to animated gif
(it has unique address - so it is easy to find it without BeautifulSoup)
use session to get() aniamted gif and write it in local file
(it needs .content instead of .text to work with bytes instead of string)
(optional) use webbrowser to display url with animated gif in default browser
(if local image viewer has problem to display animated gifs)

Full working code:

#import requests 
from requests import Session
from bs4 import BeautifulSoup 

#headers = {
#    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0' 
#}    

s = Session()
#s.headers.update(headers)

url = 'https://www.chess.com/gifs'

# --- get token ---

response = s.get(url)
html = response.text

#soup = BeautifulSoup(html, 'html.parser')
soup = BeautifulSoup(html, 'html5lib')

item = soup.find('input', {'id': 'animated_gif__token'})
#print(item)

token = item['value']
print('token:', token)

# --- send form, get response and search image ---

game =  "https://www.chess.com/live/game/3048628857"

payload = {
    "animated_gif[data]": game,
    "animated_gif[board_texture]": "green", # "brown",
    "animated_gif[piece_theme]": "neo",
    "animated_gif[_token]": token
}

response = s.post(url, data=payload)
html = response.text

start = html.find('https://images.chesscomfiles.com/uploads/game-gifs/')
end   = html.find('"', start)

image_url = html[start:end]

print(image_url)

# --- download file ---

response = s.get(image_url)

# write using `bytes` instead of `text`
with open('animation.gif', 'wb') as f:
    f.write(response.content)

# --- show image_url in browser ---

import webbrowser

webbrowser.open(image_url)

edited Dec 9, 2023 at 2:24

answered Dec 9, 2023 at 2:18

furas

149k12 gold badges121 silver badges171 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Pandyy21 Over a year ago

Thank you for your indepth answer. I had to inspect the html page to understand how you knew the payload should reference "animated_gif". I was able to get the auto creation of the daily puzzle thanks to yourcode. And I wasn't aware of this hidden token. Is this always the case when the page uses cookies?

furas Over a year ago

I used built-in DevTools (tab: Network) in Firefox or Chrome to see what browser sends to server. And it shows all needed fields without checking HTML :) Some pages send token in cookies without hidden field, some pages use hidden fields, some pages don't use token at all. BTW: I have only on video which shows work with DevTool in Firefox but without sound and without text. You have to observe mouse. And it doesn't show that it needs to press F12 on keyboard to activate DevTools. DevTools to find JSON data in EpicGames

furas Over a year ago

and I have code for other my answers on Stackoverflow for different pages (using request, beautifulsoup or Selenium or scrapy: GitHub: furas / python-examples / __scraping__

Collectives™ on Stack Overflow

Best Way to Debug html5lib.html5parser.ParseError: Unexpected character after attribute value"?

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related