Python's Secret Weapon: Generating Spoofable User Agents for Web Scraping

The Essential Guide to User Agent Rotation in Python

I needed to write a code to scrap content from a website. But I wanted to do that using different browser agents, so that the web server might assume that it is coming from different sources.

Even though the IP address might be the same, the browser agents and hence the headers would look different. This might prevent the web server from blacklisting the scraping tool.

Web scraping is a powerful tool for extracting data from websites, but it often faces countermeasures from servers designed to block automated activity. To circumvent these blocks and maintain a low profile, a crucial technique is user agent rotation. By varying the browser agent string sent with each request, you can mimic diverse, genuine users and significantly reduce the likelihood of detection.

Obsolete Old Code

In the past, I had just used random string generation and a list to generate the random user agents.

import requests
import random

user_agents = [
      "Mozilla/5.0 (X11; Linux aarch64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.188 Safari/537.36 CrKey/1.54.250320",
      "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
      "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
      "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
      "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
      "Mozilla/5.0 Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)",
      "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
      # Add more User-Agent strings as needed
    ]

# Choose a random User-Agent from the list
random_user_agent = random.choice(user_agents)

headers = {
              "accept": "*/*",
              "User-Agent": random_user_agent
              }
response = requests.request("POST", url, headers=headers)

Now this was cumbersome and I not much flexible. I can place a whole lot of browser agents text in a text file in separate lines and call a line in random. Again, that is also not much flexible.

Using pyuser_agent

Then I discovered "pyuser_agent" package. It was released in Dec 2021, but it has not been updated since. It still works pretty well though.

import pyuser_agent

obj = pyuser_agent.UA()
for i in range(10):
  print(obj.random)

This will give the following results

Mozilla/5.0 (Android; U; Android; pl; rv:1.9.2.8) Gecko/20100202 Firefox/3.5.8
Mozilla/4.0 (compatible; MSIE 4.5; Mac_PowerPC)
Opera/9.80 (Windows NT 6.1; U; de) Presto/2.2.15 Version/10.10
Opera/7.53 (X11; Linux i686; U) [en_US]
Opera/9.80 (S60; SymbOS; Opera Tablet/9174; U; en) Presto/2.7.81 Version/10.5
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10 ChromePlus/1.5.2.0alpha1
Opera/9.30 (Nintendo Wii; U; ; 2047-7; de)
Mozilla/5.0 (Android; U; Android; pl; rv:1.9.2.8) Gecko/20100202 Firefox/3.5.8
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.9) Gecko/20070126 Ubuntu/dapper-security Firefox/1.5.0.9
Opera/9.25 (Windows NT 6.0; U; MEGAUPLOAD 1.0; ru)

It uses the same kind of logic about getting random line from a text file. But here it gets it from a json file, "store_dump.json".

You could see the source code for this package by clicking here.

Using fake_headers

I also have used the package, "fake_headers". It was released on Feb 2019. It was last updated on Jun 2019. But it still works for our purpose.

from fake_headers import Headers

if __name__ == "__main__":
  header = Headers(
    headers=False 
  )

  for i in range(10):
    print(header.generate())

This will give the following results

{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.170 Safari/537.36 OPR/52.0.2871.64'}
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64; rv:58.0) Gecko/20100101 Firefox/58.0'}
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 OPR/56.0.3051.116'}
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64; rv:59.0.1) Gecko/20100101 Firefox/59.0.1'}
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36 OPR/51.0.2830.55'}
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'}
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36 OPR/55.0.2994.44'}
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2; rv:55.0.2) Gecko/20100101 Firefox/55.0.2'}
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729 Safari/537.36 OPR/57.0.3098.106'}
{'Accept': '*/*', 'Connection': 'keep-alive', 'User-Agent': 'Mozilla/5.0 (X11; Linux i686 on x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.162 Safari/537.36 OPR/51.0.2830.55'}

It uses the different kind of logic about getting random value from different lists via code.

You could see the source code for this package by clicking here.

Using fake_useragent

The library that I use now is "fake_useragent". It was released in Mar 2013, but it has been regularly updated and is the most current.

from fake_useragent import UserAgent

ua = UserAgent()
for i in range(10):
  print(ua.random)

This will give the following results

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15
Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.76
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; rv:109.0) Gecko/20100101 Firefox/117.0
Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36
Mozilla/5.0 (Windows NT 10.0; rv:109.0) Gecko/20100101 Firefox/117.0
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36 Edg/116.0.1938.76

This library has many features as well. It gets random value from a json file, "browsers.json". You could see the source code for this package by clicking here.

Considerations and Best Practices

  • Package Maintenance: While pyuser_agent and fake-headers are effective, they haven't received recent updates. Consider exploring alternatives like fake-useragent for more active maintenance and features.

  • Up-to-Date User Agents: Ensure the user agents you're using are relevant and realistic to avoid suspicion.

  • Customization: Some libraries allow filtering agents by browser, operating system, or popularity to tailor your requests.

  • Ethical Scraping: Always respect website terms of service and rate limits to avoid legal and ethical issues.

Conclusion

User agent rotation is an essential tool for responsible and successful web scraping. By employing appropriate Python libraries and adhering to best practices, you can significantly enhance your scraping operation's effectiveness and stealth.