how to replace HTML codes in HTML file using python?

Question

I'm trying to replace all HTML codes in my HTML file in a for Loop (not sure if this is the easiest approach) without changing the formatting of the original file. When I run the code below I don't get the codes replaced. Does anyone know what could be wrong?

import re
tex=open('ALICE.per-txt.txt', 'r')

tex=tex.read()




for i in tex:
  if i =='&#245;':
      i=='õ'
  elif i == '&#231;':
      i=='ç'



with open('Alice1.replaced.txt', "w") as f:
    f.write(tex)
    f.close()

With for i in tex you iterate over single characters, but 'õ' has 6 characters. This will never be equal. And you never change tex. You change only i and overwrite the value of i in each loop. — Matthias
– Matthias, Commented Feb 1, 2021 at 14:58

Matthias · Accepted Answer · 2021-02-01 15:03:07Z

1

You can use html.unescape.

>>> import html
>>> html.unescape('&#245;')
'õ'

With your code:

import html

with open('ALICE.per-txt.txt', 'r') as f:
    html_text = f.read()

html_text = html.unescape(html_text)

with open('ALICE.per-txt.txt', 'w') as f:
    f.write(html_text)

Please note that I opened the files with a with statement. This takes care of closing the file after the with block - something you forgot to do when reading the file.

answered Feb 1, 2021 at 15:03

Matthias

13.3k6 gold badges45 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

how to replace HTML codes in HTML file using python?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related