scrAPI and redirects without complete URL

Rick Wargo

I’ve been using scrAPI and loving it to scrape web pages in ruby. Unfortunately, I got stuck for a while when trying to read a page with a 302 redirect to a URL not beginning with http (see careerbuilder.com for examples). Turns out it’s a straightforward fix. I’ve sent in a bug request, but I’m also providing a patch file and instructions until that gets done.

*** reader.rb.orig 2006-10-06 10:32:43.000000000 -0400
--- reader.rb 2006-10-06 10:32:30.000000000 -0400
***************
*** 159,163 ****
:redirect_limit=>redirect_limit-1)
when Net::HTTPRedirection
! return read_page(response["location"],
:last_modified=>options[:last_modified],
:etag=>options[:etag],
— 159,165 —-
:redirect_limit=>redirect_limit-1)
when Net::HTTPRedirection
! loc = response["location"]
! loc = url.to_s.split(/\//)[0..2].join(’/') + loc if loc !~ /^https?:\/\//
! return read_page(loc,
:last_modified=>options[:last_modified],
:etag=>options[:etag],

Copy the above code into a file; let’s call it scrapi.patch (alternatively, you can download it here). Then, type the following command (you’ll need write access to the file):

patch -bd /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/ < scrapi.patch

That should be it! A backup copy of reader.rb will be saved and the URL will be prepended with the appropriate path if the uri.scheme is missing.

2 Responses to “scrAPI and redirects without complete URL”

  1. Nick Bhanji Says:

    I am trying to scrap search from careerbuilder.com, How did you managed to take care of odd/even rows and company name in the row. I am new to this. I am trying to use gathered information to assist students looking for job.

    thanks in advance

    nick.bh

  2. Rick Wargo Says:

    I’m not even sure this works anymore, but here is what I wrote to scrape CB for one company:

    require 'unescape'

    class FiberlinkJob < Scraper::Base
    process "td>a.job_title”, :title => :text, :url => “@href”
    process “td”, :description => :text
    process “span.tip_11″, :location => :text, :posted_on => :text,
    :jobcode => “”, :guid => “”

    result :title, :description, :location, :posted_on, :url, :jobcode, :guid

    def collect
    self.title = unescape(self.title)
    self.url = unescape(self.url)

    self.url = self.url.sub(/\&sc.*/, ”)

    self.location = location.sub(/.*Location: (^[&]+).*/, ‘\1′)
    self.posted_on = posted_on.sub(/.*Posted: ([A-Za-z]+).(\d+).*/, ‘\1 \2′)
    self.guid = self.url
    end
    end

    class Fiberlink < Scraper::Base
    def Fiberlink.url
    "http://www.careerbuilder.com/JobSeeker/Companies/CompanyJobResults.aspx?Comp_DID=C250M6WSR21P12J2CN"
    end

    array :jobs
    process "table#snapshotOff1 tr", :jobs => FiberlinkJob
    result :jobs
    end

    Good luck!
    Rick

Leave a Reply

stop spam with honeypot!