character encoding in a web page using java

Question

how to find out the type of character encoding in a web page using java

JB Nizet · Accepted Answer · 2011-02-22 11:53:59Z

2

Open a connection to the URL (using URL.openConnection()), adn the parse the content type returned by the getContentType() method (which should contain the charset). If not present in this header, you might have to parse the HTML content and look for a tag such as

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

answered Feb 22, 2011 at 11:53

JB Nizet

694k94 gold badges1.3k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Lucas Zamboulis Over a year ago

I would change "might have to" to "will have to".

Paŭlo Ebermann Over a year ago

You should also look at the XML declaration, like <?xml version="1.0" encoding="ISO-8859-1" ?>. (If existent, it should be right at the beginning of the document.)

Community · Accepted Answer · 2017-05-23 10:33:08Z

1

I believe this does exactly what you need. Has both code and explanation. http://nadeausoftware.com/node/73

A quick summary is as follows:

Create a WebFile class where:

Constructor public WebFile( String urlString ) opens a URLConnection, reads in the headers, including the character encoding. If the encoding is not present, then you'll have to read the encoding from the web page itself. If this is not present either, you could try your luck with Character Encoding Detection Algorithm
Method private Object readStream(int length, java.io.InputStream stream) reads the page data from the stream and returns a String using the character encoding, i.e. return new String( bytes, charset ), or returns the byte array created by reading the stream if there is no encoding present or if there's an encoding exception.
You have getters and setters for the page content (e.g. invokes readStream just once, returns the encoding)

edited May 23, 2017 at 10:33

CommunityBot

11 silver badge

answered Feb 22, 2011 at 11:50

Lucas Zamboulis

2,5515 gold badges24 silver badges28 bronze badges

2 Comments

Joachim Sauer Over a year ago

Providing only a link to an external resource is not a good answer. The link can go invalid and become useless. You should have at least a summary in your answer.

Lucas Zamboulis Over a year ago

@Joachim Sauer: didn't want to rewrite the perfectly good description of that page - but didn't think about the invalid link scenario. Fixed, thanks.

Collectives™ on Stack Overflow

character encoding in a web page using java

2 Answers 2

2 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related