scrAPI and redirects without complete URL
I’ve been using scrAPI and loving it to scrape web pages in ruby. Unfortunately, I got stuck for a while when trying to read a page with a 302 redirect to a URL not beginning with http (see careerbuilder.com for examples). Turns out it’s a straightforward fix. I’ve sent in a bug request, but I’m also providing a patch file and instructions until that gets done.
*** reader.rb.orig 2006-10-06 10:32:43.000000000 -0400
--- reader.rb 2006-10-06 10:32:30.000000000 -0400
***************
*** 159,163 ****
:redirect_limit=>redirect_limit-1)
when Net::HTTPRedirection
! return read_page(response["location"],
:last_modified=>options[:last_modified],
:etag=>options[:etag],
— 159,165 —-
:redirect_limit=>redirect_limit-1)
when Net::HTTPRedirection
! loc = response["location"]
! loc = url.to_s.split(/\//)[0..2].join(’/') + loc if loc !~ /^https?:\/\//
! return read_page(loc,
:last_modified=>options[:last_modified],
:etag=>options[:etag],
Copy the above code into a file; let’s call it scrapi.patch (alternatively, you can download it here). Then, type the following command (you’ll need write access to the file):
patch -bd /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/ < scrapi.patch
That should be it! A backup copy of reader.rb will be saved and the URL will be prepended with the appropriate path if the uri.scheme is missing.

September 7th, 2007 at 11:24 am
I am trying to scrap search from careerbuilder.com, How did you managed to take care of odd/even rows and company name in the row. I am new to this. I am trying to use gathered information to assist students looking for job.
thanks in advance
nick.bh
September 7th, 2007 at 9:39 pm
I’m not even sure this works anymore, but here is what I wrote to scrape CB for one company:
require 'unescape'class FiberlinkJob < Scraper::Base
process "td>a.job_title”, :title => :text, :url => “@href”
process “td”, :description => :text
process “span.tip_11″, :location => :text, :posted_on => :text,
:jobcode => “”, :guid => “”
result :title, :description, :location, :posted_on, :url, :jobcode, :guid
def collect
self.title = unescape(self.title)
self.url = unescape(self.url)
self.url = self.url.sub(/\&sc.*/, ”)
self.location = location.sub(/.*Location: (^[&]+).*/, ‘\1′)
self.posted_on = posted_on.sub(/.*Posted: ([A-Za-z]+).(\d+).*/, ‘\1 \2′)
self.guid = self.url
end
end
class Fiberlink < Scraper::Base
def Fiberlink.url
"http://www.careerbuilder.com/JobSeeker/Companies/CompanyJobResults.aspx?Comp_DID=C250M6WSR21P12J2CN"
end
array :jobs
process "table#snapshotOff1 tr", :jobs => FiberlinkJob
result :jobs
end
Good luck!
Rick