scrAPI and redirects without complete URL
I’ve been using scrAPI and loving it to scrape web pages in ruby. Unfortunately, I got stuck for a while when trying to read a page with a 302 redirect to a URL not beginning with http (see careerbuilder.com for examples). Turns out it’s a straightforward fix. I’ve sent in a bug request, but I’m also providing a patch file and instructions until that gets done.
*** reader.rb.orig 2006-10-06 10:32:43.000000000 -0400
--- reader.rb 2006-10-06 10:32:30.000000000 -0400
***************
*** 159,163 ****
:redirect_limit=>redirect_limit-1)
when Net::HTTPRedirection
! return read_page(response["location"],
:last_modified=>options[:last_modified],
:etag=>options[:etag],
--- 159,165 ----
:redirect_limit=>redirect_limit-1)
when Net::HTTPRedirection
! loc = response["location"]
! loc = url.to_s.split(/\//)[0..2].join('/') + loc if loc !~ /^https?:\/\//
! return read_page(loc,
:last_modified=>options[:last_modified],
:etag=>options[:etag],
Copy the above code into a file; let’s call it scrapi.patch (alternatively, you can download it here). Then, type the following command (you’ll need write access to the file):
patch -bd /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/ < scrapi.patch
That should be it! A backup copy of reader.rb will be saved and the URL will be prepended with the appropriate path if the uri.scheme is missing.

September 7th, 2007 at 11:24 am
I am trying to scrap search from careerbuilder.com, How did you managed to take care of odd/even rows and company name in the row. I am new to this. I am trying to use gathered information to assist students looking for job.
thanks in advance
nick.bh
September 7th, 2007 at 9:39 pm
I’m not even sure this works anymore, but here is what I wrote to scrape CB for one company:
Good luck!
Rick