HTML Scraper with Ruby

Introduction:

 

Great Gem

Being a .NET Developer by trade, I’ve become increasingly interested in Ruby. Being a C# developer for years, one finds one self often drifting to the “dark side” of open-source to play, learn and experience.

And on one such night, I decide how about test driving Ruby – and this little project. In South Africa we have an online business Directory, and I thought writing a HTML Scraper would be a great first project in this language that speaks of “Convention over Configuration”, and “DRY”.

So where to begin?

Requirements:

Well, firstly, what are our requirements:

(a) A quick review of the business directory, revealed a list of Supermarkets available across Africa, and that will be the data we will mine for 

(b) Obviously the output should be in a “form” that we can use, CSV/PlainText.

( c) Perhaps a Library (Gem) that we can use to Scrape the HTML (that is manipulate the HTML/DOM), and that is NokoGiri.

(d) Understanding the layout of the pages and understanding the data structure (Name, Postal Address, Telephone etc). As with all websites of any value – CSS is inherent, and standards for presenting large amounts of data in the same structure (so we should be one up already)

Reviewing the Structure:

  1. Determine what the URL is going to look like and obviously because the information for the super-markets/stores are large they are paginated, so let’s see if we can find a way to reference them directly (that is get their pagination address).  As you can see below that seems to be the constant URL, and reviewing the “pagination” links, a number is added to the end in iterations of 10. That is; /S0914E/10 or /S0914E/20Url
  2. Secondly, breaking down the structure of a store/supermarket – we can almost guarantee they will be broken down by a <div>. I’m using Chrome and “Inspect Element” to narrow down the data and it’s structure. Each store is found in a class called “list-directory_entry_4”, the title for the store in a div class called “list-entryTitlebar” in the TEXT of the <a href>. The class “list-details” gives extra details about the store (location). Also there follows a URL that links to extra information which we will scrape as well found in class, “list-entryInfoBar”Structure of Store's Data

Code Time:

  • Some points worth noting, I use am STDOUT in Ruby to create our CSV. I call “sync” at the end to dump after I complete a page, else it will build in Memory (not good when dealing with tons of pages)
  • Secondly, NokoGiri supports XPath which is wonderful for accessing the HTML. Note I am using XPATH with starts with to match for some inconsistencies in store “divs”. Very handy! page.xpath("//div[starts-with(@class,'list-directory_entry_')]")
  •  In this section of code, I am creating the URL’s I intend to work with; that is the paginated pages with the content. So that I can apply the generic HTML scraping algorithm to it. 
  • URLs
 
  • Then iterate each URL, using NokoGiri, and use the XPath explained above to find the “div” for a store to scrape.XPath


  • Thereafter I grab the specific data I need using “css selectors” found in NokoGiri; Selector
 
 
 
             Lastly, I make use of .gsub(/s+/, “”) to remove any spaces that are not necessary. That’s it really, nice and simple.
  • Gsub



Code:

 

require ‘rubygems’
require ‘nokogiri’

require ‘open-uri’
#Create File for Output
$stdout = File.new(‘console.out’, ‘w’)
arrPages = Array.new
pageLoc = 10
#First Page (396 paginated pages)

arrPages.push “http://x.x.com/type/supermarkets/any/supermarkets/S0914E/”

 

for i in 1..396

arrPages.push “http://x.x.com/type/supermarkets/any/supermarkets/S0914E/” + pageLoc.to_s()

pageLoc = pageLoc + 10;

end
arrPages.each do |pageToScrape|
page = Nokogiri::HTML(open(pageToScrape))

page.xpath(“//div[starts-with(@class,’list-directory_entry_’)]”).each do |store|


result = String.new

iUrl = String.new

array = Array.new

 

# Get Details [Name] + [Location]

array.push store.css(“h2 a”).text + “|”

array.push store.css(“.list-details”).text + “|”
# Get About
# Get’s the More Info Page, filters that

iUrl = store.css(“.list-entryInfoBar a”)[0][‘href’]
infoPage = Nokogiri::HTML(open(iUrl))

array.push infoPage.css(“.phone”).text + “|”

array.push infoPage.css(“.fax”).text + “|”

array.push infoPage.css(“.email”).text + “|”

array.push infoPage.css(“.web”).text + “|”

array.push infoPage.css(“.address_1”).text + “|”

array.push infoPage.css(“.address_2”).text + “|”

 

puts array.join{” “}.gsub(/s+/, “”)

$stdout.sync = true
end

sleep 20

end


Helpful Links

  1. NokoGiri Tutorial
  2. Ruby in 20 minutes
  3. Other Ruby Help
  4. Bastards Book of Ruby

 

 

Don't be shellfish...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInShare on TumblrEmail this to someone

One thought on “HTML Scraper with Ruby”

Leave a Reply