HTML Scraper with Ruby

Introduction:

 

Great Gem

Being a .NET Developer by trade, I’ve become increasingly interested in Ruby. Being a C# developer for years, one finds one self often drifting to the “dark side” of open-source to play, learn and experience.

And on one such night, I decide how about test driving Ruby – and this little project. In South Africa we have an online business Directory, and I thought writing a HTML Scraper would be a great first project in this language that speaks of “Convention over Configuration”, and “DRY”.

So where to begin?

Requirements:

Well, firstly, what are our requirements:

(a) A quick review of the business directory, revealed a list of Supermarkets available across Africa, and that will be the data we will mine for 

(b) Obviously the output should be in a “form” that we can use, CSV/PlainText.

( c) Perhaps a Library (Gem) that we can use to Scrape the HTML (that is manipulate the HTML/DOM), and that is NokoGiri.

(d) Understanding the layout of the pages and understanding the data structure (Name, Postal Address, Telephone etc). As with all websites of any value – CSS is inherent, and standards for presenting large amounts of data in the same structure (so we should be one up already)

Reviewing the Structure:

  1. Determine what the URL is going to look like and obviously because the information for the super-markets/stores are large they are paginated, so let’s see if we can find a way to reference them directly (that is get their pagination address).  As you can see below that seems to be the constant URL, and reviewing the “pagination” links, a number is added to the end in iterations of 10. That is; /S0914E/10 or /S0914E/20Url
  2. Secondly, breaking down the structure of a store/supermarket – we can almost guarantee they will be broken down by a <div>. I’m using Chrome and “Inspect Element” to narrow down the data and it’s structure. Each store is found in a class called “list-directory_entry_4”, the title for the store in a div class called “list-entryTitlebar” in the TEXT of the <a href>. The class “list-details” gives extra details about the store (location). Also there follows a URL that links to extra information which we will scrape as well found in class, “list-entryInfoBar”Structure of Store's Data

Code Time:

  • Some points worth noting, I use am STDOUT in Ruby to create our CSV. I call “sync” at the end to dump after I complete a page, else it will build in Memory (not good when dealing with tons of pages)
  • Secondly, NokoGiri supports XPath which is wonderful for accessing the HTML. Note I am using XPATH with starts with to match for some inconsistencies in store “divs”. Very handy! page.xpath("//div[starts-with(@class,'list-directory_entry_')]")
  •  In this section of code, I am creating the URL’s I intend to work with; that is the paginated pages with the content. So that I can apply the generic HTML scraping algorithm to it. 
  • URLs
 
  • Then iterate each URL, using NokoGiri, and use the XPath explained above to find the “div” for a store to scrape.XPath


  • Thereafter I grab the specific data I need using “css selectors” found in NokoGiri; Selector
 
 
 
             Lastly, I make use of .gsub(/s+/, “”) to remove any spaces that are not necessary. That’s it really, nice and simple.
  • Gsub



Code:

 

require ‘rubygems’
require ‘nokogiri’

require ‘open-uri’
#Create File for Output
$stdout = File.new(‘console.out’, ‘w’)
arrPages = Array.new
pageLoc = 10
#First Page (396 paginated pages)

arrPages.push “http://x.x.com/type/supermarkets/any/supermarkets/S0914E/”

 

for i in 1..396

arrPages.push “http://x.x.com/type/supermarkets/any/supermarkets/S0914E/” + pageLoc.to_s()

pageLoc = pageLoc + 10;

end
arrPages.each do |pageToScrape|
page = Nokogiri::HTML(open(pageToScrape))

page.xpath(“//div[starts-with(@class,’list-directory_entry_’)]”).each do |store|


result = String.new

iUrl = String.new

array = Array.new

 

# Get Details [Name] + [Location]

array.push store.css(“h2 a”).text + “|”

array.push store.css(“.list-details”).text + “|”
# Get About
# Get’s the More Info Page, filters that

iUrl = store.css(“.list-entryInfoBar a”)[0][‘href’]
infoPage = Nokogiri::HTML(open(iUrl))

array.push infoPage.css(“.phone”).text + “|”

array.push infoPage.css(“.fax”).text + “|”

array.push infoPage.css(“.email”).text + “|”

array.push infoPage.css(“.web”).text + “|”

array.push infoPage.css(“.address_1”).text + “|”

array.push infoPage.css(“.address_2”).text + “|”

 

puts array.join{” “}.gsub(/s+/, “”)

$stdout.sync = true
end

sleep 20

end


Helpful Links

  1. NokoGiri Tutorial
  2. Ruby in 20 minutes
  3. Other Ruby Help
  4. Bastards Book of Ruby

 

 

Don't be shellfish...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInShare on TumblrEmail this to someone

Preventing ‘Stop running this script’ in Browsers

 Introduction

On all major versions of IE you may run into this error – “This page contains a script which is taking an unusually long time to finish. To end this script now, click Cancel”. This is beyond irritating especially when the code is live and is affecting customers. Microsoft has a fix for it, but obviously we cannot ask a customer to do this.

Background

What we are going to try and do is create a “for loop” that builds an array. But consisting of 100000 items.

Using the code

I’ve include a sample project – it’s always so much easier when you can play with the code yourself. Have fun!

Before we get into the code, Kudos completely to Guido Tapia for his code and implementation.

I’ve modified the code and made it slightly mode applicable to me, and also to external readers who do not have the context of the code. There was also some bugs which I ironed out. But again, Kudos to him.

 Collapse | Copy Code
 // Make an Object
RepeatOperation = function (anonymousOperation, whenToYield) {
    var count = 0;
    return function () {
        if (++count >= whenToYield) {
            count = 0;
            setTimeout(function () { anonymousOperation(); }, 100);
        }
        else {
            anonymousOperation();
        }
    }
};

Above we create a simple Object called “RepeatOperation”. In it we have to arguments. One will be our (or your) anonymous operation (work that needs to get done). The second argument, will be the when to yield, or when to fire the Time Out.

In side the function, is a basic count to determine where we are in the “process”. Every time our anonymous operation is called it will come back in here and increment count. When it reaches the yield count, it will then set the Time Out letting the Browser know that this is no an infinite loop.

 Collapse | Copy Code
// Implementation
var i = 0;
var myArray = new Array(noInArray);
var yieldAfter = 100;
var noInArray = 100000;
var ro = new RepeatOperation(function () {  // Anonymous function which is our work that we need 2 
    myArray[i] = i * 1;
    if (++i < noInArray) {
        ro();
    }
    else {
        // Finished with Operation
        $("#txtBox").val("Completed Operation and no Browser Warning!");
    }
}, yieldAfter);

// Let's begin
ro();

Now we get to the meat – remembering that state is maintained across the calls due to that little thing called “closure”. For a better understanding of this implementation, look at this, it’s a wonderful read.

Moving along, we create an instance of RepeatOperation, passing in our “operation” as well as when to yield (after x number of iterations). We then begin by calling RepeatOperation through ro(); Remembering in light of closures, this will now call the “return” function found in RepeatOperation, and where it will validate where we are, and set the Time Out if necessary.

The code will continuously call itself back, and while doing so, build the array that we need or get the work done that we need. When done, it will finish and add the text to our “txtBox”.

Don't be shellfish...Share on FacebookShare on Google+Tweet about this on TwitterShare on LinkedInShare on TumblrEmail this to someone