I'm working on a code to translate strings in HTML.
More specifically, my objective is to perform string replacement. The steps are: file parsing, identifying the string in line (if there is one), and finally replacing this string by its translated version, taken from a dictionary.
I got valuable help here, on html parsing and string replacement for each line.
To open the html file as a txt, and sweep through it line by line, I took an example here.
Using knowledge from both the examples, I wrote the code below:
from bs4 import BeautifulSoup
from bs4 import SoupStrainer
from html_dictionary import port_eng
def str_replace_port_eng(file_name, tag_name):
with open(file_name, 'rb') as src:
doc = src.read()
soup = BeautifulSoup(doc, 'html.parser')
src.close()
only_tag_name = soup.find_all(str(tag_name))
with open("new_file.html", "w") as outf:
for line in soup:
for html_line in range(len(only_tag_name)):
pt_word = str(only_tag_name[html_line].text).strip()
pt_word = pt_word.strip('+')
pt_word = pt_word.strip(' ')
if pt_word != "":
en_word = port_eng[pt_word]
new_line = (str(only_tag_name[html_line]).replace(pt_word, en_word))
outf.writelines(new_line)
else:
en_word = pt_word
new_line = (str(only_tag_name[html_line]).replace(pt_word, en_word))
outf.writelines(new_line)
newpg = str_replace_port_eng("input_test.html", "a")
Input file (example):
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--[if lt IE 7 ]> <html class="ie6" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 7 ]> <html class="ie7" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 8 ]> <html class="ie8" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 9 ]> <html class="ie9" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html xmlns="http://www.w3.org/1999/xhtml"> <!--<![endif]-->
<body>
<div style="padding-top:0px;height:100%;" id="wrap">
<div style="padding-bottom:0px;" id="header" class="ie-dropdown-fix">
<!-- /// HEADER //////////////////////////////////////////////////////////////////////////////////////////////////////////// -->
<div style="margin-left:10px;" class="row">
<div class="span3">
<!-- // Logo //
<a href="index.html" id="logo"><img src="_layout/images/logo.png" alt="" class="responsive-img" /></a>
-->
</div><!-- end .span3 -->
<div style="color:#00233C;width:1100px;background-color:#FFFFFF;margin-right:0px" class="span6">
<!-- // Dropdown Menu // -->
<ul style="color:#00233C;margin-left:10px;width:1100px;" id="dropdown-menu" class="fixed">
<li class="current"><a style="color:#00233C;" href="..."><i class="icon icon-home"></i> Início</a></li>
<li><a style="color:#00233C;margin-left:10px;" href="#"><i class="icon icon-question-sign"></i> Ajuda <small class="mute">+</small></a>
<ul class="sub-menu">
<li><a href="#">FAQ <small class="mute">+</small></a>
<ul>
<li><a href="..." target="_blank">Classificação da Informação</a></li>
<li><a href="..." target="_blank">Reúso de Ativos Digitais</a></li>
<li><a href="..." target="_blank">Biblioteca</a></li>
</ul>
</li>
<li><a href="#">Alerta <small class="mute">+</small></a>
<ul>
<li><a href="..." target="_blank">Criar Alerta</a></li>
<li><a href="..." target="_blank">Criar Alerta Múltiplo</a></li>
</ul>
</li>
<li><a href="..." target="_blank">Aviso ou Notícia</a></li>
<li><a href="#">Busca <small class="mute">+</small></a>
<ul>
<li><a href="..." target="_blank">Busca Simples</a></li>
<li><a href="..." target="_blank">Busca Avançada</a></li>
</ul>
</li>
<li><a href="#">Documentos <small class="mute">+</small></a>
<ul>
<li><a href="..." target="_blank">Carregar Novo Documento</a></li>
<li><a href="..." target="_blank">Editar Documento</a></li>
</ul>
</li>
</div>
</div>
</body>
</html>
Expected Output:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--[if lt IE 7 ]> <html class="ie6" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 7 ]> <html class="ie7" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 8 ]> <html class="ie8" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 9 ]> <html class="ie9" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html xmlns="http://www.w3.org/1999/xhtml"> <!--<![endif]-->
<body>
<div style="padding-top:0px;height:100%;" id="wrap">
<div style="padding-bottom:0px;" id="header" class="ie-dropdown-fix">
<!-- /// HEADER //////////////////////////////////////////////////////////////////////////////////////////////////////////// -->
<div style="margin-left:10px;" class="row">
<div class="span3">
<!-- // Logo //
<a href="index.html" id="logo"><img src="_layout/images/logo.png" alt="" class="responsive-img" /></a>
-->
</div><!-- end .span3 -->
<div style="color:#00233C;width:1100px;background-color:#FFFFFF;margin-right:0px" class="span6">
<!-- // Dropdown Menu // -->
<ul style="color:#00233C;margin-left:10px;width:1100px;" id="dropdown-menu" class="fixed">
<li class="current"><a style="color:#00233C;" href="..."><i class="icon icon-home"></i> Start</a></li>
<li><a style="color:#00233C;margin-left:10px;" href="#"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a>
<ul class="sub-menu">
<li><a href="#">FAQ <small class="mute">+</small></a>
<ul>
<li><a href="..." target="_blank">Information Security</a></li>
<li><a href="..." target="_blank">Digital Asset Reuse</a></li>
<li><a href="..." target="_blank">Library</a></li>
</ul>
</li>
<li><a href="#">Alerta <small class="mute">+</small></a>
<ul>
<li><a href="..." target="_blank">Create Alert</a></li>
<li><a href="..." target="_blank">Create Multiple Alert</a></li>
</ul>
</li>
<li><a href="..." target="_blank">News</a></li>
<li><a href="#">Busca <small class="mute">+</small></a>
<ul>
<li><a href="..." target="_blank">Simple Search</a></li>
<li><a href="..." target="_blank">Advanced Search</a></li>
</ul>
</li>
<li><a href="#">Documents <small class="mute">+</small></a>
<ul>
<li><a href="..." target="_blank">Load New Document</a></li>
<li><a href="..." target="_blank">Edit Document</a></li>
</ul>
</li>
</div>
</div>
</body>
</html>
Actual Output:
<a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i> Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i> Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a>
And now I'm looking for the error in the code, and how I can fix it.
Thanks in advance,
Tiago
with open(file_name, 'rb') as src:, iterates over all your files of interest. However, after that loop ends, you are ending up with only thesoupof your last file. Was this intentional ? Because, if it's not, then you might want to bring the rest of your code under that loop.soupobj. I saw this error after removing the 1stforfrom the code; the result was the translated strings as a single line.soupobj from the html and iterate over it. In the end I need the whole html, just with the strings replaced. The output file, as given by the code, begins with the 1st line with a replaced string, letting behind those lines prior to it. Is there any way to iterate over all the html and replace the strings? I mean, not iterate oversoupbut edit it, using the (let's say) line of the html as a reference?