HTML Scraper with Ruby



Great Gem

Being a .NET Developer by trade, I’ve become increasingly interested in Ruby. Being a C# developer for years, one finds one self often drifting to the “dark side” of open-source to play, learn and experience.

And on one such night, I decide how about test driving Ruby – and this little project. In South Africa we have an online business Directory, and I thought writing a HTML Scraper would be a great first project in this language that speaks of “Convention over Configuration”, and “DRY”.

So where to begin?


Well, firstly, what are our requirements:

(a) A quick review of the business directory, revealed a list of Supermarkets available across Africa, and that will be the data we will mine for 

(b) Obviously the output should be in a “form” that we can use, CSV/PlainText.

( c) Perhaps a Library (Gem) that we can use to Scrape the HTML (that is manipulate the HTML/DOM), and that is NokoGiri.

(d) Understanding the layout of the pages and understanding the data structure (Name, Postal Address, Telephone etc). As with all websites of any value – CSS is inherent, and standards for presenting large amounts of data in the same structure (so we should be one up already)

Reviewing the Structure:

  1. Determine what the URL is going to look like and obviously because the information for the super-markets/stores are large they are paginated, so let’s see if we can find a way to reference them directly (that is get their pagination address).  As you can see below that seems to be the constant URL, and reviewing the “pagination” links, a number is added to the end in iterations of 10. That is; /S0914E/10 or /S0914E/20Url
  2. Secondly, breaking down the structure of a store/supermarket – we can almost guarantee they will be broken down by a <div>. I’m using Chrome and “Inspect Element” to narrow down the data and it’s structure. Each store is found in a class called “list-directory_entry_4”, the title for the store in a div class called “list-entryTitlebar” in the TEXT of the <a href>. The class “list-details” gives extra details about the store (location). Also there follows a URL that links to extra information which we will scrape as well found in class, “list-entryInfoBar”Structure of Store's Data

Code Time:

  • Some points worth noting, I use am STDOUT in Ruby to create our CSV. I call “sync” at the end to dump after I complete a page, else it will build in Memory (not good when dealing with tons of pages)
  • Secondly, NokoGiri supports XPath which is wonderful for accessing the HTML. Note I am using XPATH with starts with to match for some inconsistencies in store “divs”. Very handy! page.xpath("//div[starts-with(@class,'list-directory_entry_')]")
  •  In this section of code, I am creating the URL’s I intend to work with; that is the paginated pages with the content. So that I can apply the generic HTML scraping algorithm to it. 
  • URLs
  • Then iterate each URL, using NokoGiri, and use the XPath explained above to find the “div” for a store to scrape.XPath

  • Thereafter I grab the specific data I need using “css selectors” found in NokoGiri; Selector
             Lastly, I make use of .gsub(/s+/, “”) to remove any spaces that are not necessary. That’s it really, nice and simple.
  • Gsub



require ‘rubygems’
require ‘nokogiri’

require ‘open-uri’
#Create File for Output
$stdout =‘console.out’, ‘w’)
arrPages =
pageLoc = 10
#First Page (396 paginated pages)

arrPages.push “”


for i in 1..396

arrPages.push “” + pageLoc.to_s()

pageLoc = pageLoc + 10;

arrPages.each do |pageToScrape|
page = Nokogiri::HTML(open(pageToScrape))

page.xpath(“//div[starts-with(@class,’list-directory_entry_’)]”).each do |store|

result =

iUrl =

array =


# Get Details [Name] + [Location]

array.push store.css(“h2 a”).text + “|”

array.push store.css(“.list-details”).text + “|”
# Get About
# Get’s the More Info Page, filters that

iUrl = store.css(“.list-entryInfoBar a”)[0][‘href’]
infoPage = Nokogiri::HTML(open(iUrl))

array.push infoPage.css(“.phone”).text + “|”

array.push infoPage.css(“.fax”).text + “|”

array.push infoPage.css(“.email”).text + “|”

array.push infoPage.css(“.web”).text + “|”

array.push infoPage.css(“.address_1”).text + “|”

array.push infoPage.css(“.address_2”).text + “|”


puts array.join{” “}.gsub(/s+/, “”)

$stdout.sync = true

sleep 20


Helpful Links

  1. NokoGiri Tutorial
  2. Ruby in 20 minutes
  3. Other Ruby Help
  4. Bastards Book of Ruby



Don't be shellfish...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInShare on TumblrEmail this to someone