IntroducingHow to python check proxy with aiohttp

How to python check proxy with aiohttp

To conduct data analysis, for example during the market research, we first need to determine the scope and collect the necessary data. Some websites and companies provide easy and convenient way to access the data via api. However, many limit the number of requests from one IP address. Therefore in order to scrape data anonymously and prevent your IP from being blocked by the web server, we recommend that you python check proxy.

What is a proxy server?

A proxy server is a remote server through which you connect to obfuscate your initial address. Since the proxy hides and overlays your authentic IP address with its own IP, the destination server can only see the IP of the proxy. Hence, if you rotate proxies with each request, the endpoint will recognize them as separate ones since they are coming from different IP addresses. Thus, you increase the speed and chance of obtaining data for research.

proxy server

Where is my proxy?

In this article we demonstrate how to obtain and check free proxies by using python libraries requests, selenium, BeautifulSoup and NumPy.

Python check proxy with requests

In a simple recommendation system we used python3 requests library to collect the vacancies data. Requests http library allows for the usage of proxies and multi-threading.

To extend the script to work with proxies we need to check that proxy servers operate on the target website. Hence, we create a script that scrapes the proxy server list from https://free-proxy-list.net/ and saves only those that work with our target.

import requests
from bs4 import BeautifulSoup
import numpy
import concurrent.futures

# Get HTML response html = requests.get('https://www.free-proxy-list.net/')

# Parse HTML response content = BeautifulSoup(html.text, 'lxml')

# Extract proxies table table = content.find('table')

# Extract table rows rows = table.findAll('tr')

# Create proxies result list results = []

# Loop over table rows for row in rows: # Use only non-empty rows if len(row.findAll('td')): # Append rows containing proxies to results list results.append(row.findAll('td')[0].text +':' + row.findAll('td')[1].text)

# Create proxies final list final =[] def test(proxy): #test each proxy on whether it access api of hh.ru headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'} try: params = { 'text': f'NAME:C++', 'area': 113, 'page': 0, 'per_page': 100 } requests.get('https://api.hh.ru/vacancies', headers=headers, proxies={'http' : proxy}, timeout=1, params=params) final.append(proxy) except: pass return proxy

#test multiple proxies concurrently with concurrent.futures.ThreadPoolExecutor() as executor: executor.map(test, results)

#to print the number of proxies #print(len(final))

#save the working proxies to a file numpy.save('file.npy', final)

This script yielded 295 proxy servers that work with our target website: HeadHunter.

Python check proxy with selenium

We also create a script that additionally uses selenium library to scrape and check the proxy server list from https://advanced.name/.

from selenium import webdriver
from bs4 import BeautifulSoup
import numpy 
import concurrent.futures
import requests

dr = webdriver.Chrome(executable_path="C:\YourPath") dr.get("https://advanced.name/freeproxy?ddexp4attempt=1&page=1") content = BeautifulSoup(dr.page_source,"lxml") table = content.find('tbody') rows = table.findAll('tr') results = [] for row in rows: # Use only non-empty rows if len(row.findAll('td')): # Append rows containing proxies to results list results.append(row.findAll('td')[1].text +':' + row.findAll('td')[2].text) dr.close()

final =[] def extract(proxy): #this was for when we took a list into the function, without conc futures. headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'} try: #change the url to https://httpbin.org/ip that doesnt block anything params = { 'text': f'NAME:C++', 'area': 113, 'page': 0, 'per_page': 100 } requests.get('https://api.hh.ru/vacancies', headers=headers, proxies={'http' : proxy}, timeout=1, params=params) final.append(proxy) except: pass return proxy

with concurrent.futures.ThreadPoolExecutor() as executor: executor.map(extract, results)

#print(len(final)) numpy.save('file.npy', final)

# Parse HTML response content = BeautifulSoup(html.text, 'lxml')

# Extract proxies table table = content.find('table')

# Extract table rows rows = table.findAll('tr')

# Create proxies result list results = []

# Loop over table rows for row in rows: # Use only non-empty rows if len(row.findAll('td')): # Append rows containing proxies to results list results.append(row.findAll('td')[0].text +':' + row.findAll('td')[1].text)

# Create proxies final list final =[] def test(proxy): #test each proxy on whether it access api of hh.ru headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:80.0) Gecko/20100101 Firefox/80.0'} try: params = { 'text': f'NAME:C++', 'area': 113, 'page': 0, 'per_page': 100 } requests.get('https://api.hh.ru/vacancies', headers=headers, proxies={'http' : proxy}, timeout=1, params=params) final.append(proxy) except: pass return proxy

#test multiple proxies concurrently with concurrent.futures.ThreadPoolExecutor() as executor: executor.map(test, results)

#to print the number of proxies #print(len(final))

#save the working proxies to a file numpy.save('file.npy', final)

In turn, it determined 99 items in the proxy server list.

Proxies with asynchronous HTTP client/server library

We can also use the proxies in asynchronous HTTP client/server library AIOHTTP for asyncio and Python. Meanwhile Aiohttp-proxy library assists with SOCKS proxy connector for AIOHTTP.

import aiohttp
from aiohttp_proxy import ProxyConnector, ProxyType
import asyncio
import sys
import numpy

if sys.version_info[0] == 3 and sys.version_info[1] >= 8 and sys.platform.startswith('win'): asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())

async def fetch(url, proxy): host, port = proxy.split(':')[0], proxy.split(':')[1] connector = ProxyConnector( proxy_type=ProxyType.HTTP, host=host, port= int(port), ) async with aiohttp.ClientSession(connector=connector,trust_env=True) as session: async with session.get(url) as response: return await response.text()

if name == "main": data = numpy.load('file.npy') loop = asyncio.get_event_loop() l = loop.run_until_complete(fetch('http://api.hh.ru/', data[-1])) print(l) loop.run_until_complete(asyncio.sleep(0.1)) loop.close()

Summary

In this article we demonstrated how to scrape and utilize free proxies. Nevertheless, free easy proxy servers will quickly get flagged since too many people use them.

For more reliable service we recommend paid residential proxy servers that Bright Data provides. Formerly known as Luminaty the third-party service has a well-written api documentation for python that you can use to manage your proxies.

proxy server api