Luminati learning hub


Videos

{{video.duration}}
{{video.title}}

Become a Luminati professional

How to crawl a website without getting blocked or misled (cloaked)?

Why should I care?
When a target website detects crawlers from a proxy (datacenter) IP, it typically

How does the target website identify my crawling activity?
Target websites log the IPs of whomever visits them and analyzes the activity of these IPs. Assuming you are using a traditional data center proxy, the target website can:

  1. Identify that the activity from a single IP (the rate of requests) is much greater than what a human can accomplish in a given timeframe
  2. Identify that the IP address originated from a proxy server list, which these target websites have access to
  3. Identify that the IPs have the same subnet block range
How to prevent being detected?
  1. To prevent being detected by the amount of requests per IP, you can reduce the number of requests per second. However, this will reduce your crawling speed
  2. To prevent the target from identifying your IP as coming from a proxy server, you must rotate your requests through residential IPs. You should be able to circulate through enough IPs that the target website can not detect your activity
  3. When using residential IPs there is no subnet block range
By using a traditional proxy solution, it’s only a matter of time before the target website will identify your crawling activities, and can block or provide you with the wrong information.

How to get an IP in a specific city

Why should I care?
Example: If you are responsible for testing Yelp’s city level service - you need to check the site from 10,000 different cities around the world.

How to get an IP in a specific city?
If you only use a traditional data center proxy solution for your information collecting tasks, you are limited to where these datacenters are located. Large residential networks can get you IPs in any specific city in the world.

Cost effectiveness of residential IPs


How did we calculate this table?

Your company needs to collect information from the web by sending 1,000 http requests per hour to a specific website. You write the scraper code and run it through a server. The target website allows 50 requests per minute from the same IP before blocking your scraper. Now, you have to purchase more proxies.

Assuming you choose datacenter proxies:
You don’t want to share IPs, so you buy 200 dedicated datacenter IPs. You code the integration of the scraper with the new datacenter proxies for 2 hours and then run the new program. This time, it takes 3 days for your target website to detect your scraper. Once your proxies are detected, you’ll have to purchase new proxies and repeat this process again, checking each day to make sure the proxies haven’t been detected. Cost per month (all numbers are from real customers):

Your total cost per month will be at least $1400 for just these things alone, and the salary of $30 per hour for a developer is very conservative. Additionally, this doesn’t account for a lack of information reliability if your target website sends misinformation before blocking you or if your information flow is cut every few days, which can be detrimental to your brand or your revenue stream.

Assuming you choose Luminati residential proxies:
You buy a basic package of 40GB with access to unlimited residential IPs. It takes 2 hours to integrate your scraper. Due to an average of 3 million residential IPs available each day, your target website can’t detect your scraper, allowing you to focus on other projects.

The bandwidth and unlimited IPs cost just $500 per month. Your information is always reliable because your requests are always successful and access is never cut in the middle of the month. When your business grows as a result of this scraping and your projects exceed 600MB each month, the difference in costs can be much higher than just ~$1000.

Luminati also allows you to suspend your account when not in use, so your cost can be lower than $500 per month. Start by using the $5 voucher for free datacenter traffic to test Luminati’s benefits and then ask for access to our residential network for cheap and reliable data collection.

How to accelerate your web scraping

Why should I care?
If the number of requests you rotate through a single IP are higher than what target websites allow, the website you target will identify your IP and block or mislead you with false information. It means that your information collecting can be much slower than what you’re used to.

How do I improve the speed of my data harvesting?
Assuming you're running 10 million requests, 1 request per second per IP with 1000 data center IPs, your routine can take about 3 hours. With 10,000,000 residential IPs, your routine can potentially take 1 second.

Guidelines to rotate multiple parallel sessions through Luminati’s residential network:

  1. Open Luminati Proxy Manager
  2. Go to the ‘proxies’ tab
  3. Check the port of your residential zone
  4. Edit in the port settings ‘preset’ to ‘round-robin (ip) pool’
  5. Route your requests to 127.0.0.1:{portnum} where the {portnum} is the port of the residential zone

How to rotate your IP address

Why should I care?
When many requests are sent to a website from the same IP, the website can tag the IP used as a crawler and send misleading information or block you. Periodically changing, or rotating, your IP address helps prevent a target site from identifying your IP as a crawler. Rotating your IPs can drop your failure rate to below 1%.

How do I rotate my IP address?
With its easy-to-use proxy manager, Luminati allows you to control:


Guidelines for rotating your IP address with Luminati:
  1. Open Luminati Proxy Manager
  2. Go to the ‘proxies’ tab
  3. Click on the proxy you want to edit, then on the edit button
  4. Browse rotation options under 'Preset' or 'IP Policy'
  5. Route your requests to 127.0.0.1:{portnum} where the {portnum} is the port of the residential zone

Introducing: Luminati Chrome Extension

Why should I use it?
Use the Chrome extension to self-test your website, verify your ads, or simply browse a site as if from another country. It’s a powerful complement to the Luminati Proxy Manager and an easy-to-use tool for less technical users. You can also have people who don’t have access to the dashboard use the extension without knowing the account credentials.

What features are available in the extension?
Luminati Chrome Extension supports datacenter and residential IP browsing, allowing you to search from any country. You can adjust the user agent, customize configuration of the DNS to maximize discretion or speed, and set random IP rotation.


Guidelines to download Luminati Chrome Extension:

How to use SOCKS5 with Luminati

Why should I care?
A SOCKS server is a proxy server that works for any type of network protocol on any port and establishes a connection to a server on behalf of a user, then routes traffic between the user and the server.

Why use SOCKS5 with Luminati?
When you use SOCKS5 with Luminati, the proxy manager converts any requests to port 80 or port 443 to http and https requests, so you don’t have to worry about which format is accepted by your target site. With any other port, the traffic is sent as-is between the user and the host.


Guidelines to use SOCKS5:

How do I avoid subnet block range?

What is a subnet block range?
Smaller sections of a network are called subnets, which are useful for grouping hosts together and managing them all at once. Subnets are based on IP address, making it easy for websites or malicious users to target or block an entire subnet. For example, blocking 223 would block all IP addresses that begin with 223, while blocking 223.1 would only block IP addresses that begin with 223.1, and so on - this is the subnet block range.

Why should I care?
Because a database’s IPs are often all within the same subnet, they are easily blocked by websites, especially if they are known proxies.

Avoiding subnet block range:
Residential proxies cannot be blocked this way, because they don’t have subnet ranges. Using Luminati’s residential IPs ensures you won’t be affected if a website uses the subnet block range method.


What is IPv4 format

Why should I use it?
IPv4 is a format in which many IP addresses are written that is used to connect devices to the Internet. It consists of four numbers between 1 and 255, separated by periods. Proxy services with datacenter IPs often give out lists of IP addresses in this format to their customers.

How do I use IPv4 format with Luminati proxies?
Because Luminati’s IP addresses are constantly changing or rotating, there is no list of IPs in IPv4 format like there might be when using datacenter IPs. To ensure that the IPs are in the correct form, make sure to change your proxy settings to the corresponding port number of the IP address you want. Your device’s IP address will now be in IPv4 format.

Guidelines to change proxies to IPv4 format:
  1. Open Luminati Proxy Manager
  2. Note the number of the port you want to use
  3. Open browser proxy settings
  4. Select “Manual Proxy Settings”
  5. Direct http to 127.0.0.1 and port to the number of the port

How to switch from API to Proxy Manager

Why should I care?
Using the Luminati Proxy Manager offers advanced features that are not readily available in the API. Instead of having to manually code mechanisms for tasks like keeping an IP as long as possible or rotating your IP after each request, you can simply click a button in the proxy manager.

Guidelines to to switch from API to Proxy Manager:

  1. Install the Luminati Proxy Manager here
  2. Change the code to send HTTP requests directly to the specified port (for example, 127.0.0.1:24000) instead of to zproxy.luminati.io:22225
  3. Configure the settings for your custom proxies and zones through the LPM dashboard. You do not need to send the user parameter (lum-customer-customer_name-zone-zone_name…) alongside your requests, as all the needed data is wrapped within the manual proxy configuration.

How do I know if I’m getting cloaked? (misleading information)

Why should I care?
Getting cloaked means that you’re getting misleading information from the website you are scraping.
Example: If you are collecting comparative competitive information to feed your automatic pricing algorithms, the target website can return artificial lower prices to your requests, to skew your pricing and profits.

How to know when you’re getting cloaked
When using traditional proxy networks (data center based IPs), your target websites may identify your activity quite easily and may cloak your requests. Therefore, the only way to ensure you’re not getting cloaked is to rotate your requests through residential IPs.

Guidelines for rotating requests through millions of residential IPs:

  1. Open Luminati Proxy Manager
  2. Go to the ‘proxies’ tab
  3. Check the port of your residential zone
  4. Edit in the port settings ‘preset’ to ‘round-robin (ip) pool’
  5. Route your requests to 127.0.0.1:{portnum} where the {portnum} is the port of the residential zone

FAQ

×