Web Scraper Firefox Extension

  



Your web browser will send what is known as a “User Agent” for every page you access. This is a string to tell the server what kind of device you are accessing the page with. Here are some common User Agent strings:

BrowserUser Agent
Firefox on Windows XPMozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6
Chrome on LinuxMozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.472.63 Safari/534.3
Internet Explorer on Windows VistaMozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
Opera on Windows VistaOpera/9.00 (Windows NT 5.1; U; en)
AndroidMozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522+ (KHTML, like Gecko) Safari/419.3
IPhoneMozilla/5.0 (iPhone; U; CPU like Mac OS X; en) AppleWebKit/420+ (KHTML, like Gecko) Version/3.0 Mobile/1A543a Safari/419.3
BlackberryMozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, Like Gecko) Version/6.0.0.141 Mobile Safari/534.1+
Python urllibPython-urllib/2.1
Old Google BotGooglebot/2.1 ( http://www.googlebot.com/bot.html)
New Google BotMozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
MSN Botmsnbot/1.1 (+http://search.msn.com/msnbot.htm)
Yahoo BotYahoo! Slurp/Site Explorer
  • The general approach to using a persistent extension with Firefox is to specify a Firefox profile containing the already-installed extension. Behind the scenes, Selenium normally creates a new profile in a temporary directory and then specifies it as a command-line argument when launching the Firefox process.
  • Web Scraper allows you to build Site Maps from different types of selectors. This system makes it possible to tailor data extraction to different site structures. Export data in CSV, XLSX and JSON formats Build scrapers, scrape sites and export data in CSV format directly from your browser.
  • Web scraping, web crawling, html scraping, and any other form of web data extraction can be complicated. Between obtaining the correct page source, to parsing the source correctly, rendering javascript, and obtaining data in a usable form, there’s a lot of work to be done.

You can find your own current User Agent here.

Some webpages will use the User Agent to display content that is customized to your particular browser. For example if your User Agent indicates you are using an old browser then the website may return the plain HTML version without any AJAX features, which may be easier to scrape.

Some websites will automatically block certain User Agents, for example if your User Agent indicates you are accessing their server with a script rather than a regular web browser.

If web scraping with JavaScript is easy, saving data into a CSV file is even easier. It can be done using these two packages —fs and json2csv. The file system is represented by the package fs, which is in-built. Json2csv would need to be installed using npm install json2csv command.

Fortunately it is easy to set your User Agent to whatever you like:

  • For FireFox you can use User Agent Switcher extension.
  • For Chrome there is currently no extension, but you can set the User Agent from the command line at startup: chromium-browser –user-agent=”my custom user agent”
  • For Internet Explorer you can use the UAPick extension.
  • And for Python scripts you can set the proxy header with:

    proxy = urllib2.ProxyHandler({‘http’: IP})
    opener = urllib2.build_opener(proxy)
    opener.urlopen(‘http://www.google.com’)

Using the default User Agent for your scraper is a common reason to be blocked, so don’t forget.

Please enable JavaScript to view the comments powered by Disqus.blog comments powered by Disqus

Firefox is my personal favorite browser, due in part to all of the great extensions available for it. When you try running Firefox with Selenium, however, you’ll probably find that Firefox is missing the extensions you have installed and normally use when browsing. Luckily, there’s a quick and easy way to install all your favorite Firefox extensions when using Selenium.

For example, let’s say we’d like to do a little light web scraping. To keep things simple, let’s just grab what’s trending off of Yahoo’s home page.

You can see the top 10 trending subjects off to the right, starting with Beaufort County.

Selenium without Firefox Extensions

Here’s how we’d normally scrape that info:

2
4
6
8
10
2.Faith Hill
4.Nicki Minaj
6.Cox Cable
8.Airbnb Vacation Rentals
10.Ally Bank

Great, that seems to work. But let’s say we’d prefer Firefox to be running with a couple of my favorite extensions, namely:

  • HTTPS Everywhere: Automatically enables HTTPS encryption on sites that support it, making for more secure browsing.
  • uBlock Origin: An efficient blocker that can make bloated web pages load much faster.

How do we get these extensions installed on Selenium’s instance of Firefox?

Getting the Necessary Information

Web Scraper Firefox Extension

First, we’ll need to find where those extensions are stored locally. Note that this means you’ll need to already have them installed on your machine for regular use of Firefox.

To find them, open up Firefox and navigate to the main drop down menu. Go to “Help”, and then “Troubleshooting Information”. Alternatively, you can get to the same place by entering about:support in your Firefox navigation bar.

In the “Application Basics” section click the “Open Directory” button, and in the file browser that pops up open the “extensions” folder. These are the extension installation files we’ll need to reference in our script. There should be a different “.xpi” file for every Firefox extension you have installed, and the file path to this folder should look something like “C:UsersGraysonAppDataRoamingMozillaFirefoxProfiles3rqg4psi.defaultextensions”.

It might be difficult to tell which files correspond to which extensions based on the file names, as the file names are sometimes unintelligible. To get around this, go back to your browser and on the same page as before scroll down to the “Extensions” section. Here you’ll find a table that pairs each extension name with its corresponding ID, and the ID should be almost the same as the installation file name, lacking just the “.xpi” suffix.

In our case though, the extension and file names aren’t too hard to match:

  • HTTPS Everywhere: https-everywhere@eff.org.xpi
  • uBlock Origin: uBlock0@raymondhill.net.xpi

Web Scraper Firefox Extension For Chrome

Selenium with Firefox Extensions

Now we just need to add a few lines of code to our original script to install these extensions. We’ll perform the installations right after we initialize the browser.

Run this script and see if you get the same results as last time. Look for the extension symbols near the top right of the browser. You should see the blue-and-white “S” symbol for HTTPS Everywhere and the reddish badge symbol for uBlock Origin.

2
4
6
8
10
2.Faith Hill
4.Nicki Minaj
6.Cox Cable
8.Airbnb Vacation Rentals
10.Ally Bank

So there you have it. We performed the same operation, but got to take our two favorite Firefox extensions along for the ride.

Web Scraper Firefox Extension

In addition to the peace of mind knowing that HTTPS security was used whenever possible, you may have noticed that our second script took significantly less time to load the page. This is because uBlock Origin blocked a number of unnecessary, resource-intensive requests, a great feature to have when you’re dealing with the slow, bloated web pages that are all too common nowadays.

Firefox Web Store Extensions

Anyways, I hope this gives you a few ideas as to how you can make your life a little more convenient. Let me know if you have any questions, and happy automating.