I'm scraping data from a website by getting the HTML code from the website then parsing it in Java.
I'm currently using java.net.URL as well as java.net.URLConnection. This is the code I use to get the HTML code from a certain website (Found on this website, slightly edited to fit my needs):
public static String getURL(String name) throws Exception{
//Set URL
String s = "";
URL url = new URL(name);
URLConnection spoof = url.openConnection();
//Spoof the connection so we look like a web browser
spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
String strLine = "";
//Loop through every line in the source
while ((strLine = in.readLine()) != null){
//Prints each line to the console
s = s + strLine + "\n";
}
return s;
}
When I run it, the HTML code is received correctly for about 100-200 webpages. However, before I am done grabbing HTML code, I get a "java.io.IOException: Server returned HTTP response code: 503 for URL" exception. I've researched this topic fully and other questions like this one do not cover the package I am using.
Thanks in advance for the help!