0

I need to extract values ​​from an HTML page.

The page contain this:

enter image description here

And I want to extract only the values from there.

I tried this code:

   import java.io.*;
import java.net.*;
import javax.swing.text.html.*;
import javax.swing.text.html.parser.*;

public class Test extends HTMLEditorKit.ParserCallback {
  StringBuffer txt;
  Reader reader;

  // empty default constructor
  public Test() {}

  // more convienient constructor
  public Test(Reader r) {
    setReader(r);
  }

  public void setReader(Reader r) { reader = r; }

  public void parse() throws IOException {
    txt = new StringBuffer();
    ParserDelegator parserDelegator = new ParserDelegator();
    parserDelegator.parse(reader, this, true);
  }

  public void handleText(char[] text, int pos) {
    txt.append(text);
  }

  public String toString() {
    return txt.toString();
  }

  public static void main (String[] argv) {
    try {
      // the HTML to convert
      URL toRead;
      if(argv.length==1)
        toRead = new URL(argv[0]);
      else
        toRead = new URL("http://test.com/values.html");

      BufferedReader in = new BufferedReader(
        new InputStreamReader(toRead.openStream()));
      Test d = new Test(in);
      d.parse();
      in.close();
      System.out.println(d.toString());
    }
    catch (Exception e) {
      e.printStackTrace();
    }
  }
}

And what I got was this extract:

Measured valuestable{font-family:verdana,arial,helvetica,sans-serif;color:#000;font-size:10px;background-color:#fff;}Temperature:24.9°CRelative humidity:48.3%RHDew point:13.3°C

Is there any chance to extract only the values​​?

25.0
51.0
14.1

Thank you all for your help and understanding.

Sincere greetings.


Thank you all for your help. As suggested I used JSoup as follows:

   Document doc;
   try {

    // need http protocol
    doc = Jsoup.connect("http:/test.com/values.html").get();



    String text = doc.text();

    System.out.println("text : " + text);
            Element pending = doc.select("table td:eq(1)").get(0);
            Element nextDate = doc.select("table td:eq(1)").get(1);
            Element date = doc.select("table td:eq(1)").last();

            System.out.println(pending.text() + "\n" + nextDate.text() + "\n" + date.text());




} catch (IOException e) {
    e.printStackTrace();
}

}

The result was this:

23.9°C 
52.8%RH
13.7°C

It is not possible to extract only the values​​, without ºC and % RH?

I apologize for the inconvenience.

2
  • 3
    You can use JSoup, parse the page and extract data from a specific tag Commented Jul 21, 2014 at 14:55
  • Very Thanks for reply. Could you give me some example code please? Commented Jul 21, 2014 at 14:58

3 Answers 3

1

Hey after using my idea of jsoup, What you need is conversion of string to numbers with decimals, So use the following code to get the below results. Because elements is not aware of numbers...

public static void main(String[] args) {
    String str="23.9°C";
    System.out.println(str.replaceAll("[^0-9.]+", " ").toString());
    str="52.8%RH";
    System.out.println(str.replaceAll("[^0-9.]+", " ").toString());
    str="13.7°C";
    System.out.println(str.replaceAll("[^0-9.]+", " ").toString());
}

23.9 
52.8 
13.7 
Sign up to request clarification or add additional context in comments.

1 Comment

rpirez, Is it solved ur problem Or do u need any other thing?
1

rpirez,

Use the Jsoup library for parsing the HTML page using java, It provides the best way of parsing the HTML page by documents, element, tags, line by line and so on,

Example: Document doc = Jsoup.connect("http://en.wikipedia.org/").get();

or getting the elements by ID,

// If its a single data

Document doc = Jsoup.parse(html);

Element data1 = doc.getElementById("data1");

// If its a multiple data,
Elements inputElements = data1.getElementsByTag("input");
// Using elements do something like this to parse the data perfectly,    
for (Element inputElement : inputElements) {
    String key = inputElement.attr("name");
    String value = inputElement.attr("value");
}

If you have any prob in using this jar, Please do let us know...

Thanks and Regards, Harry

3 Comments

Thanks for your reply, it's really usefull. I edit my question, is possible to you help me to extract only values?
public static void main(String[] args) { String str="23.9°C"; System.out.println(str.replaceAll("[^0-9.]+", " ").toString()); str="52.8%RH"; System.out.println(str.replaceAll("[^0-9.]+", " ").toString()); str="13.7°C"; System.out.println(str.replaceAll("[^0-9.]+", " ").toString()); }
My above code will work for ur conversions, What you need to do is convert the final pending.text() as string, Then use my above code this will return you the following answer. 23.9 52.8 13.7
0

Google for jericho, this is a very good framework to parse html page, which is better than the one from apache Httpclient.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.