Html parsing in Java using Jsoup

Question

I've been using Jsoup for HTML parsing, but I encountered a big problem. It takes too long like 1 hour.

Here's the site that I am parsing.

<tr>
    <td class="class1">value1 </td>
    <td class="class1">value2</td>
    <td class="class1">value3</td>
    <td class="class1">value4</td>
    <td class="class1">value5 </td>
    <td class="class1">value6</td>
    <td class="class1">value7</td>
    <td class="class1">value8</td>
    <td class="class1">value9</td>
</tr>

In the site, there are thousands of tables like this, and I need to parse them all to a list. I only need value1 and value6, so to do that I am using this code.

Document doc = Jsoup.connect(url).get();
            ls = new LinkedList();
            for(int i = 15; i<doc.text().length(); i++) {//15 because the tables I want starting from 15
                Element element = doc.getElementsByTag("tr").get(i);//table index
                Elements row = element.getElementsByTag("td");
                value6 = row.get(5).text();//getting value6
                value1 = row.get(0).text();//getting value1
                node = new Node(value1, value6);
                ls.insert(node);

As I said it takes too much time, so I need to do it faster. Any ideas how to fix this problem ?

Parsing a single file takes an hour? How many files are you parsing? How big are they? Are they all present locally before you begin? Or are you crawling a site at the same time? Seems unlikely that running the code for a single URL would take an hour on modern hardware. — Geoffrey Wiseman
– Geoffrey Wiseman, Commented Feb 23, 2016 at 15:51
Can you include anything you have missed, such as the ls, value1, and value 6 variables. Maybe from there I can help out more. — Tom C
– Tom C, Commented Feb 23, 2016 at 16:02
as I said, there are hundred tables in the site thats all I am parsing, just the tables. No, I am not crawling, and yes its just 1 url consisting tables. The values are just a text, nothing much. For example, value1 is a name like Michael and value6 is just a door number 5. — K.Smith
– K.Smith, Commented Feb 23, 2016 at 16:48
and in second code section value6 and value1 is just a String type values. — K.Smith
– K.Smith, Commented Feb 23, 2016 at 17:00

luksch · Accepted Answer · 2016-02-23 17:43:01Z

2

I think your problem stems from the for loop for(int i = 15; i<doc.text().length(); i++). What you do here is loop over the whole text of the document character by character. I highly doubt that this is what you want to do. I think you want to cycle over the table rows instead. So something like this should work:

Document doc = Jsoup.connect(url).get();
Elements trs = doc.select("tr");
for (int i =  15; i < trs.size(); i++){
  Element tr = trs.get(i);
  Elements tds = tr.select("td").;
  String value6 = tds.get(5).text(); //getting value6
  String value1 = tds.get(1).text(); //getting value1
  //do whatever you need to do with the values
}

answered Feb 23, 2016 at 17:43

luksch

11.7k6 gold badges41 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

K.Smith Over a year ago

thank you for the comment. Yes it does what I need, but the speed is the same.

luksch Over a year ago

Can you provide the url you are parsing? How many tables and how many rows are there? Is it maybe so massive, that your computer starts swapping?

K.Smith Over a year ago

I cant provide the url because of the security reasons. But there are approx. 1400 tables in there.

luksch Over a year ago

Maybe you can split the input html prior parsing into chunks that each can be processed within memory. However, 1400 tables still sounds not like a lot. There may be still something strange in your code. Did you check if maybe the loading of the html (without parsing) already takes so long?

Collectives™ on Stack Overflow

Html parsing in Java using Jsoup

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related