0

I've been using Jsoup for HTML parsing, but I encountered a big problem. It takes too long like 1 hour.

Here's the site that I am parsing.

<tr>
    <td class="class1">value1 </td>
    <td class="class1">value2</td>
    <td class="class1">value3</td>
    <td class="class1">value4</td>
    <td class="class1">value5 </td>
    <td class="class1">value6</td>
    <td class="class1">value7</td>
    <td class="class1">value8</td>
    <td class="class1">value9</td>
</tr>

In the site, there are thousands of tables like this, and I need to parse them all to a list. I only need value1 and value6, so to do that I am using this code.

Document doc = Jsoup.connect(url).get();
            ls = new LinkedList();
            for(int i = 15; i<doc.text().length(); i++) {//15 because the tables I want starting from 15
                Element element = doc.getElementsByTag("tr").get(i);//table index
                Elements row = element.getElementsByTag("td");
                value6 = row.get(5).text();//getting value6
                value1 = row.get(0).text();//getting value1
                node = new Node(value1, value6);
                ls.insert(node);

As I said it takes too much time, so I need to do it faster. Any ideas how to fix this problem ?

4
  • Parsing a single file takes an hour? How many files are you parsing? How big are they? Are they all present locally before you begin? Or are you crawling a site at the same time? Seems unlikely that running the code for a single URL would take an hour on modern hardware. Commented Feb 23, 2016 at 15:51
  • Can you include anything you have missed, such as the ls, value1, and value 6 variables. Maybe from there I can help out more. Commented Feb 23, 2016 at 16:02
  • as I said, there are hundred tables in the site thats all I am parsing, just the tables. No, I am not crawling, and yes its just 1 url consisting tables. The values are just a text, nothing much. For example, value1 is a name like Michael and value6 is just a door number 5. Commented Feb 23, 2016 at 16:48
  • and in second code section value6 and value1 is just a String type values. Commented Feb 23, 2016 at 17:00

1 Answer 1

2

I think your problem stems from the for loop for(int i = 15; i<doc.text().length(); i++). What you do here is loop over the whole text of the document character by character. I highly doubt that this is what you want to do. I think you want to cycle over the table rows instead. So something like this should work:

Document doc = Jsoup.connect(url).get();
Elements trs = doc.select("tr");
for (int i =  15; i < trs.size(); i++){
  Element tr = trs.get(i);
  Elements tds = tr.select("td").;
  String value6 = tds.get(5).text(); //getting value6
  String value1 = tds.get(1).text(); //getting value1
  //do whatever you need to do with the values
}
Sign up to request clarification or add additional context in comments.

4 Comments

thank you for the comment. Yes it does what I need, but the speed is the same.
Can you provide the url you are parsing? How many tables and how many rows are there? Is it maybe so massive, that your computer starts swapping?
I cant provide the url because of the security reasons. But there are approx. 1400 tables in there.
Maybe you can split the input html prior parsing into chunks that each can be processed within memory. However, 1400 tables still sounds not like a lot. There may be still something strange in your code. Did you check if maybe the loading of the html (without parsing) already takes so long?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.