1

im trying to convert a file with an HTML table to CSV format. An excerpt from this file follows:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

    <html xmlns="http://www.w3.org/1999/xhtml" >
    <head id="Head1"><link rel="shortcut icon" href="favicon.ico" /><title>
Untitled Page
    </title></head>
    <body>
        <form name="form1" method="post" action="mypricelist.aspx" id="form1">
    <input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/somethingrandom" />

<div>
    <table id="price_list" border="0">
<tr>
    <td>ProdCode</td><td>Description</td><td>Your Price</td>
</tr><tr>
    <td>ab101</td><td>loruem</td><td>1.1</td>
</tr><tr>
    <td>ab102</td><td>ipsum</td><td>0.1</td>
</tr><tr>

i tried using

    xls2csv -x -c\; evprice.xls > evprice.csv

but that gives me an error saying

    evprice.xls is not OLE file or Error

I googled. it said that is was because the file wasn't proper xls but just html.

When i try

    file evprice.xls

its says its html so found a 'solution', using libreoffice.

    libreoffice --headless -convert-to csv ./evprice.xls 

well this does not give an error but the csv output file is all weird, like opening an exe file in notepad.

it contains a lot of strange characters like these

    —¬ž­þ9ü~ÆóXþK¢

anyone know why this is happening, and got a working solution?

8
  • Is the sample data you're working with publicly available? I don't know how anyone would be able to provide something which would work with a file in an uncertain format whose content we've never seen. Commented Feb 25, 2014 at 15:05
  • sorry, its not public. i can place a part of the file Commented Feb 25, 2014 at 15:07
  • 1
    I wouldn't really describe that as an "XLS" file at all -- it's an HTML table, nothing Excel or XLS-related to it at all. Commented Feb 25, 2014 at 15:13
  • ...so, given that, this looks like a duplicate of stackoverflow.com/questions/259091/… (though the accepted answer there isn't automated at all, there are others which are). Commented Feb 25, 2014 at 15:14
  • well, i don't know what kind of file it is. they say its a generated xls file but whatever. libreoffice can open it when i do it manually. why is it giving these weird characters when using the command line version? Commented Feb 25, 2014 at 15:17

1 Answer 1

1

I have built a Python utility which converts all the tables in an HTML file into separate CSV files.

You can find it here.

The crux of the script is this:

from BeautifulSoup import BeautifulSoup
import csv

filename = "MY_HTML_FILE"
fin      = open(filename,'r')

print "Opening file"
fin  = fin.read()

print "Parsing file"
soup = BeautifulSoup(fin,convertEntities=BeautifulSoup.HTML_ENTITIES)

print "Preemptively removing unnecessary tags"
[s.extract() for s in soup('script')]

print "CSVing file"
tablecount = -1
for table in soup.findAll("table"):
  tablecount += 1
  print "Processing Table #%d" % (tablecount)
  with open(sys.argv[1]+str(tablecount)+'.csv', 'wb') as csvfile:
    fout = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in table.findAll('tr'):
      cols = row.findAll(['td','th'])
      if cols:
        cols = [x.text for x in cols]
        fout.writerow(cols)
Sign up to request clarification or add additional context in comments.

1 Comment

Very useful indeed. I encountered some errors as in "UnicodeEncodeError: 'ascii' codec can't encode character at special name..."; however, I could fix it by adding these lines at the top of the file: import sys; reload sys; sys.setdefaultencoding('utf-8'). The accepted answer to this question is what I actually did and suggest in this comment.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.