how to find out the type of character encoding in a web page using java
2 Answers
Open a connection to the URL (using URL.openConnection()), adn the parse the content type returned by the getContentType() method (which should contain the charset). If not present in this header, you might have to parse the HTML content and look for a tag such as
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
2 Comments
Lucas Zamboulis
I would change "might have to" to "will have to".
Paŭlo Ebermann
You should also look at the XML declaration, like
<?xml version="1.0" encoding="ISO-8859-1" ?>. (If existent, it should be right at the beginning of the document.)I believe this does exactly what you need. Has both code and explanation. http://nadeausoftware.com/node/73
A quick summary is as follows:
Create a WebFile class where:
- Constructor
public WebFile( String urlString )opens aURLConnection, reads in the headers, including the character encoding. If the encoding is not present, then you'll have to read the encoding from the web page itself. If this is not present either, you could try your luck with Character Encoding Detection Algorithm - Method
private Object readStream(int length, java.io.InputStream stream)reads the page data from the stream and returns aStringusing the character encoding, i.e.return new String( bytes, charset ), or returns the byte array created by reading the stream if there is no encoding present or if there's an encoding exception. - You have getters and setters for the page content (e.g. invokes readStream just once, returns the encoding)
2 Comments
Joachim Sauer
Providing only a link to an external resource is not a good answer. The link can go invalid and become useless. You should have at least a summary in your answer.
Lucas Zamboulis
@Joachim Sauer: didn't want to rewrite the perfectly good description of that page - but didn't think about the invalid link scenario. Fixed, thanks.