0

I'm working on a code to translate strings in HTML.

More specifically, my objective is to perform string replacement. The steps are: file parsing, identifying the string in line (if there is one), and finally replacing this string by its translated version, taken from a dictionary.

I got valuable help here, on html parsing and string replacement for each line.

To open the html file as a txt, and sweep through it line by line, I took an example here.

Using knowledge from both the examples, I wrote the code below:

from bs4 import BeautifulSoup
from bs4 import SoupStrainer
from html_dictionary import port_eng

def str_replace_port_eng(file_name, tag_name):

with open(file_name, 'rb') as src:
    doc = src.read()
    soup = BeautifulSoup(doc, 'html.parser')
    src.close()

only_tag_name = soup.find_all(str(tag_name))

with open("new_file.html", "w") as outf:
    for line in soup:
        for html_line in range(len(only_tag_name)):
            pt_word = str(only_tag_name[html_line].text).strip()
            pt_word = pt_word.strip('+')
            pt_word = pt_word.strip(' ')

            if pt_word != "":
                en_word = port_eng[pt_word]
                new_line = (str(only_tag_name[html_line]).replace(pt_word, en_word))
                outf.writelines(new_line)
            else:
                en_word = pt_word
                new_line = (str(only_tag_name[html_line]).replace(pt_word, en_word))
                outf.writelines(new_line)

newpg = str_replace_port_eng("input_test.html", "a")

Input file (example):

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--[if lt IE 7 ]> <html class="ie6" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 7 ]>    <html class="ie7" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 8 ]>    <html class="ie8" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 9 ]>    <html class="ie9" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html xmlns="http://www.w3.org/1999/xhtml"> <!--<![endif]-->

<body>

	<div style="padding-top:0px;height:100%;" id="wrap">
	
		<div style="padding-bottom:0px;" id="header" class="ie-dropdown-fix">
		
		<!-- /// HEADER  //////////////////////////////////////////////////////////////////////////////////////////////////////////// -->

			<div style="margin-left:10px;" class="row">
				<div class="span3">
				
					<!-- // Logo // 
					<a href="index.html" id="logo"><img src="_layout/images/logo.png" alt="" class="responsive-img" /></a>
					-->
					
				</div><!-- end .span3 -->
				<div  style="color:#00233C;width:1100px;background-color:#FFFFFF;margin-right:0px" class="span6">
				
					<!-- // Dropdown Menu // -->
					<ul style="color:#00233C;margin-left:10px;width:1100px;" id="dropdown-menu" class="fixed">
						<li class="current"><a  style="color:#00233C;" href="..."><i class="icon icon-home"></i>  Início</a></li>
						<li><a  style="color:#00233C;margin-left:10px;" href="#"><i class="icon icon-question-sign"></i>  Ajuda <small class="mute">+</small></a>
							<ul class="sub-menu">
								<li><a href="#">FAQ <small class="mute">+</small></a>
									<ul>
										<li><a href="..." target="_blank">Classificação da Informação</a></li>	
										<li><a href="..." target="_blank">Reúso de Ativos Digitais</a></li>												
										<li><a href="..." target="_blank">Biblioteca</a></li>
									</ul>
								</li>							
								<li><a href="#">Alerta <small class="mute">+</small></a>
									<ul>
										<li><a href="..." target="_blank">Criar Alerta</a></li>	
										<li><a href="..." target="_blank">Criar Alerta Múltiplo</a></li>												
									</ul>
								</li>
								<li><a href="..." target="_blank">Aviso ou Notícia</a></li>
								<li><a href="#">Busca <small class="mute">+</small></a>
									<ul>
										<li><a href="..." target="_blank">Busca Simples</a></li>
										<li><a href="..." target="_blank">Busca Avançada</a></li>									
									</ul>
								</li>
								<li><a href="#">Documentos <small class="mute">+</small></a>
									<ul>
										<li><a href="..." target="_blank">Carregar Novo Documento</a></li>
										<li><a href="..." target="_blank">Editar Documento</a></li>													
									</ul>
								</li>
		</div>
	</div>
</body>
</html>

Expected Output:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<!--[if lt IE 7 ]> <html class="ie6" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 7 ]>    <html class="ie7" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 8 ]>    <html class="ie8" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if IE 9 ]>    <html class="ie9" xmlns="http://www.w3.org/1999/xhtml"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html xmlns="http://www.w3.org/1999/xhtml"> <!--<![endif]-->
    <body>

	<div style="padding-top:0px;height:100%;" id="wrap">
	
		<div style="padding-bottom:0px;" id="header" class="ie-dropdown-fix">
		
		<!-- /// HEADER  //////////////////////////////////////////////////////////////////////////////////////////////////////////// -->

			<div style="margin-left:10px;" class="row">
				<div class="span3">
				
					<!-- // Logo // 
					<a href="index.html" id="logo"><img src="_layout/images/logo.png" alt="" class="responsive-img" /></a>
					-->
					
				</div><!-- end .span3 -->
				<div  style="color:#00233C;width:1100px;background-color:#FFFFFF;margin-right:0px" class="span6">
				
					<!-- // Dropdown Menu // -->
					<ul style="color:#00233C;margin-left:10px;width:1100px;" id="dropdown-menu" class="fixed">
						<li class="current"><a  style="color:#00233C;" href="..."><i class="icon icon-home"></i>  Start</a></li>
						<li><a  style="color:#00233C;margin-left:10px;" href="#"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a>
							<ul class="sub-menu">
								<li><a href="#">FAQ <small class="mute">+</small></a>
									<ul>
										<li><a href="..." target="_blank">Information Security</a></li>	
										<li><a href="..." target="_blank">Digital Asset Reuse</a></li>												
										<li><a href="..." target="_blank">Library</a></li>
									</ul>
								</li>							
								<li><a href="#">Alerta <small class="mute">+</small></a>
									<ul>
										<li><a href="..." target="_blank">Create Alert</a></li>	
										<li><a href="..." target="_blank">Create Multiple Alert</a></li>												
									</ul>
								</li>
								<li><a href="..." target="_blank">News</a></li>
								<li><a href="#">Busca <small class="mute">+</small></a>
									<ul>
										<li><a href="..." target="_blank">Simple Search</a></li>
										<li><a href="..." target="_blank">Advanced Search</a></li>									
									</ul>
								</li>
								<li><a href="#">Documents <small class="mute">+</small></a>
									<ul>
										<li><a href="..." target="_blank">Load New Document</a></li>
										<li><a href="..." target="_blank">Edit Document</a></li>													
									</ul>
								</li>
		</div>
	</div>
</body>
</html>

Actual Output:

<a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a><a href="..." style="color:#00233C;"><i class="icon icon-home"></i>  Home</a><a href="#" style="color:#00233C;margin-left:10px;"><i class="icon icon-question-sign"></i>  Help <small class="mute">+</small></a><a href="#">FAQ <small class="mute">+</small></a><a href="..." target="_blank">Information Security</a><a href="..." target="_blank">Reuse of Digital Assets</a><a href="..." target="_blank">Library</a><a href="#">Alert <small class="mute">+</small></a><a href="..." target="_blank">Create Alert</a><a href="..." target="_blank">Create Multiple Alert</a><a href="..." target="_blank">News</a><a href="#">Search <small class="mute">+</small></a><a href="..." target="_blank">Simple Search</a><a href="..." target="_blank">Advanced Search</a><a href="#">Documents <small class="mute">+</small></a><a href="..." target="_blank">Load New Document</a><a href="..." target="_blank">Edit Document</a>

And now I'm looking for the error in the code, and how I can fix it.

Thanks in advance,

Tiago

3
  • The loop that is followed under with open(file_name, 'rb') as src:, iterates over all your files of interest. However, after that loop ends, you are ending up with only the soup of your last file. Was this intentional ? Because, if it's not, then you might want to bring the rest of your code under that loop. Commented Jun 27, 2019 at 0:13
  • @Argon Thanks for your comment. I didn't noticed that the loop as seen in the post will return only the soupobj. I saw this error after removing the 1st for from the code; the result was the translated strings as a single line. Commented Jun 27, 2019 at 11:16
  • To be able to search for the strings, I get a soup obj from the html and iterate over it. In the end I need the whole html, just with the strings replaced. The output file, as given by the code, begins with the 1st line with a replaced string, letting behind those lines prior to it. Is there any way to iterate over all the html and replace the strings? I mean, not iterate over soup but edit it, using the (let's say) line of the html as a reference? Commented Jun 27, 2019 at 19:54

1 Answer 1

0

The best solution I found was to copy the content from the .html file, paste it in a .py file, and start the editing work from there.

from bs4 import BeautifulSoup
from html_dictionary import port_eng    # Dictionary
from html_input_file import raw_text    # Input file: .py file with string
                                        # defined by triple quotes (""" """)

rtx = list(raw_text.split('\n'))
ans_list = []                           # List of lines with replaced string
off_list = []                           # List of items not found in dictionary, and index of occurrence

for raw_line in rtx:
    soup = BeautifulSoup(raw_line, "lxml")
    tag_cont = soup.text                # tag content
    tag_cont = tag_cont.strip('+')
    tag_cont = tag_cont.strip(' ')

    if tag_cont in port_eng.keys():
        en_word = port_eng[tag_cont]
        new_item = str(raw_line).replace(tag_cont, en_word)

    else:l
        en_word = tag_cont
        new_item = str(raw_line).replace(tag_cont, en_word)
        if tag_cont not in off_list:    
            off_list.append(tuple([tag_cont, rtx.index(raw_line)]))

    ans_list.append(new_item)

return ans_list, off_list

I take the output directly form the screen via print() and copy it into a new .html file - it's not the most elegant solution indeed, but works.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.