Chronicling my experiences with ruby on rails, web application development/management.

Monday, June 29, 2009

30 days of Rails - Hpricot parsing

Scenario - You need some information off of a website, but there are around 100 rows. This information will also need to populate the name column of your model, so you need to clean up the data. Bascially you need to parse information from a paragraph or one big string. This is where hpricot and some ruby come in handy.


html string parser - Rake task


Here is the code i ended up using, and will try and explain it at the end



namespace :sponsor do
desc "seed data"
task :seed => :environment do
require "open-uri"
require "hpricot"
doc = open("http://www.nascar.com/guides/sponsors/") { |f| Hpricot(f) }
@snag = (doc/"#cnnContentArea").inner_html
@snag.gsub(/<.*>/,'').split(/\n\t+/).map {|t| t.strip }.reject {|t|t=='' }.each { |t| Sponsor.create!(:name => t) }
end
end


Code explained


I needed to pull information from that actual url. All of this information lived within the div tag id cnnContentArea. inside the source of the html page, it was prepopulated with \n\t\t\t and the number of tabs varied.


Using hpricot, we open the url page and save it as an hpricot object (see the doc = line above.


Next, we take the saved page and pull out the content from the cnnContentArea div tag, and we save it as an instance variable @snag.


The last chunk of code, does the following:


  1. The @snag string basically replaces itself without any of the paragraph tags (that what gsub does).
  2. Next, the string is split into mini strings based on a delimiter, and in this case anything that matches \n followed by any number of tabs (\t). I was lucky to be able to do it this way.
  3. We then iterate through each mini string(that's what map is doing) and we then strip all whitespace, before and after the words in the mini string. (see the block t.strip
  4. As an added measure, if the mini array is empty, then it is rejected (courtesy of map)
  5. Lastly, we then take the clean mini string and add it to our model, which in this case is a simple Sponsor model under the name column in the DB.


I have to thank emitilin from IRC as he spruced up my code, but i do take solace in the fact that i knew where to go and now i have an even greater understanding of ruby and rails. I'm sure in the comments you can point me to a method that does this in one fail swoop, but when you are learning ruby, you are a winner when you get to apply the knowledge to a problem you need fixed.

No comments:

Post a Comment