I am attempting to scrape device information from a specific website (gsmarena) based on its model number. I would like to extract the model name (and eventually price). I'm using a headless browser and rotating proxies to do so, but have had little success in extracting the info. required for ~2000 devices (am able to extract roughly 10 before all IPs blocked).
The (~200) proxies are obtained from https://free-proxy-list.net/, which seem to contain a few that work.
I've explored a number of different options but have had little success. Below is the code I'm currently running- any help would be appreciated.
def get_device_name(device_model, proxies_list):
# Check if the device model is provided, return None if not
if device_model is None:
return None
# Create a list for storing working proxies
working_proxies = list(proxies_list)
# Start a web driver instance
attempts = 0
while attempts < len(working_proxies):
proxy = working_proxies[attempts]
print(proxy)
try:
# Set options for the web driver
options = webdriver.ChromeOptions()
options.add_argument(f'--proxy-server={proxy}')
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome(options=options)
# Make the request to the website
url = 'https://www.gsmarena.com/res.php3?sSearch=' + device_model
driver.get(url)
# Wait 5 seconds before making another request
time.sleep(5)
# Find the device name on the page
device_name = driver.find_element(By.TAG_NAME, 'strong').text.replace("\n"," ")
driver.close()
print(device_name)
return device_name
except Exception:
# If an exception is raised, increment attempts and remove the failed proxy
attempts += 1
print("Attempt {} with proxy {} failed. Trying again with a different proxy...".format(attempts, proxy))
proxies_list.remove(proxy)
print("Proxy {} removed.".format(proxy))
continue
# Return None if all attempts failed
print("All attempts failed. Unable to get device name.")
return None
Apply the get_device_name function to the device_model column of user_agents_2
user_agents_3 = user_agents_2 user_agents_3['device_name'] = user_agents_2['device_model'].apply(get_device_name, proxies_list=proxies_list)
print(user_agents_3)
source https://stackoverflow.com/questions/75849946/extracting-device-name-from-model-using-python-scraping
Comments
Post a Comment