1

I'm scraping data from a website by getting the HTML code from the website then parsing it in Java.

I'm currently using java.net.URL as well as java.net.URLConnection. This is the code I use to get the HTML code from a certain website (Found on this website, slightly edited to fit my needs):

public static String getURL(String name) throws Exception{

    //Set URL
    String s = "";
    URL url = new URL(name);
    URLConnection spoof = url.openConnection();

    //Spoof the connection so we look like a web browser
    spoof.setRequestProperty( "User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)" );
    BufferedReader in = new BufferedReader(new InputStreamReader(spoof.getInputStream()));
    String strLine = "";

    //Loop through every line in the source
    while ((strLine = in.readLine()) != null){

        //Prints each line to the console
        s = s + strLine + "\n";
    }
    return s;
}

When I run it, the HTML code is received correctly for about 100-200 webpages. However, before I am done grabbing HTML code, I get a "java.io.IOException: Server returned HTTP response code: 503 for URL" exception. I've researched this topic fully and other questions like this one do not cover the package I am using.

Thanks in advance for the help!

3
  • A 503 is usually caused by a temporary overloading of the web server. It may be your process that's swamping it, or maybe there's something else accessing the web server. What happens if you try inserting a short sleep between each of your requests? Commented Jan 30, 2014 at 4:36
  • Running it now. With a 100-millisecond rest in between each access, there seem to be fewer long pauses in between each access, but they are still there. Waiting until it is done. Edit 1: At access 339 out of 358, it gives the same error. Adding the delay did not seem to help, so I'll run it with a 1000-second delay. Commented Jan 30, 2014 at 4:40
  • Okay. Adding a full 1-second delay still puts it out at about 240 accesses. I'll try the answer below. Commented Jan 30, 2014 at 4:57

1 Answer 1

1

Maybe server have a limits. In this case you can try Socket and input/outputStream instead of URLConnection

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.