Are your scraping operations coming to a halt and you have no idea why? By simply learning what error codes mean you can easily automate your IP settings and become a scraping master. Navigating the online web should be easy, however, if you are not properly managing your proxies when crawling or scraping, many errors can result in unsuccessful requests. When a request fails it returns with the respective HTTP error code which divulges the reason the request was unsuccessful.
Understanding the nature of error codes is the first step to overcoming them.
Let’s start with what some of the HTTP status codes mean
A 200 status code is the response you want to attain, it means everything is ok and the request has been received by the target site.
A 3XX error code means you were redirected because your request has multiple responses. For instance, a 301 error code means a page has been permanently moved and therefore you were redirected to the new URL. If the redirect is occurring due to a lack of information, within the request itself, than this can easily be configured or overcome by specifying a user-agent within your proxy settings. Choosing a specific user-agent provides more detailed information in the request meaning there is less room for misinterpretation and less chance of the request being redirected.
A 4XX is a client-side error is received when the request you sent to the server was misunderstood or inaccurate, resulting in the page being unable to load. A 401 error code means you are not authorized to access the target site and that is why the page will not load. An example of this is trying to access a specific profile on a social media site when you are not signed in.
A 403 error code, however, means your access to the site is forbidden, the request was understood but the site does not want to grant admittance. In some cases, the site will provide an explanation but the site may merely respond with a 403 error code itself with no reasoning whatsoever. The site can also respond with a 404 error code which means ‘Not Found’ and commonly arises when the server doesn’t want to divulge its reason for denying entry.
A 407 error code refers to tunnel connect failing or proxy authentication is required. When using a proxy this means the credentials you provided are inaccurate, your request is missing authorization details or the crawler being used has not been authenticated with the proxy provider. Another cause of a 407 error is found within your proxy settings such as the necessary IP not having been previously whitelisted or a specific zone you are trying to use being inactive. Merely update your proxy settings to include all the IPs accessing the network within your whitelist. Make sure all proxy authentication credentials match those on your zone page and that the requests being sent (especially through API) include all the necessary information.
If the site you are attempting to access has implemented a rate limit, you may come across a 429 error code which means you tried to send too many requests too quickly from the same IP. Sites commonly implement these restrictions to protect themselves from attacks or to ensure that their servers are not overrun. When using a proxy, merely rotate the IPs more consistently or set limitations on the number of requests, sent per IP, per a specific time frame.
A 5XX is an internal server-side error or the site’s server is having an issue and it is unable to specify. When it comes to using a proxy provider, a 502 is the most commonly received status code and refers to a bad gateway error or a timeout where one server received an invalid response from another. This type of response can be returned due to a variety of issues including the super proxies having refused the connection, no IPs are available for the settings chosen, or the requests being sent have been detected as a bot.
To overcome a 502, it is suggested to rotate the IP however, it may be necessary to change the IP type or proxy network you are using. For example, if you are using a data center IP and receive a 502 error, chances are the site you are attempting to access blocks data center IPs in general, which is a common blocking technique. In these cases merely rotating the IP would not be sufficient.
Now that we know what we are dealing with, let’s dive into how to solve common HTTP error codes.
The easiest way to go about mastering how to avoid error codes is to utilize the proxy manager. The proxy manager, a free open-source software, automates the management of proxies to easily assist in overcoming any error code received. When choosing a proxy port, the zone information and all associated credentials are automatically applied, the most simple way to ensure you will not receive a 407 error code.
Within the proxy manager is a rules section, perfect for overcoming any error code as it provides the ability to trigger a specific rule and when triggered apply a specific action to be taken. The rules can be based on a specific URL, due to a max or min request time and/or based on a specific status code being received. Within the proxy manager, you can choose that the trigger is when a site responds with a specific undesirable error code. When this rule is triggered a specific action will automatically occur and this action can be; retry the request, rotate the IP, retry with a new network, ban the IP or save IP to a reserve pool.
For a 403 error code, it is recommended to rotate the IP or even better waterfall through to a stronger IP or network type. You begin by sending a request to a target site using a data center IP and receive a 403 error. The rules section has been set-up to send the exact same request through the residential network upon receiving a 403. Now that same request is automatically retried with a new, residential IP.
Understanding the error codes you are receiving and why you are getting them is the first step in overcoming them. The Luminati Proxy Manager comes equipped with in-depth success ratio metrics providing specific details about requests, status codes, the time it took the request and more. This data saves Luminati customers time and money not only by providing solutions to these common scraping hurdles but also by including the means to automate the process of overcoming them. Reduce your bandwidth, eliminate time wasted on solving common coding issues, defeat site-blocking techniques and become a master of web data extraction with no coding required. Download the free, open-source Luminati Proxy Manager, compatible with any existing infrastructure and even integrated through raw API.
If you are interested in learning more about error codes and how to overcome them check out this webinar on ‘How To Troubleshoot Common Error Codes’ or sign-up here to get connected to a proxy expert to help you combat common error codes and find the best solution for your needs.