scrAPI and redirects without complete URL

Rick Wargo

I’ve been using scrAPI and loving it to scrape web pages in ruby. Unfortunately, I got stuck for a while when trying to read a page with a 302 redirect to a URL not beginning with http (see careerbuilder.com for examples). Turns out it’s a straightforward fix. I’ve sent in a bug request, but I’m also providing a patch file and instructions until that gets done.

*** reader.rb.orig      2006-10-06 10:32:43.000000000 -0400
--- reader.rb     2006-10-06 10:32:30.000000000 -0400
***************
*** 159,163 ****
:redirect_limit=>redirect_limit-1)
when Net::HTTPRedirection
!         return read_page(response["location"],
:last_modified=>options[:last_modified],
:etag=>options[:etag],
--- 159,165 ----
:redirect_limit=>redirect_limit-1)
when Net::HTTPRedirection
!           loc = response["location"]
!           loc = url.to_s.split(/\//)[0..2].join('/') + loc if loc !~ /^https?:\/\//
!         return read_page(loc,
:last_modified=>options[:last_modified],
:etag=>options[:etag],

Copy the above code into a file; let’s call it scrapi.patch (alternatively, you can download it here). Then, type the following command (you’ll need write access to the file):

patch -bd /usr/lib/ruby/gems/1.8/gems/scrapi-1.2.0/lib/scraper/ < scrapi.patch

That should be it! A backup copy of reader.rb will be saved and the URL will be prepended with the appropriate path if the uri.scheme is missing.

2 Responses to “scrAPI and redirects without complete URL”

  1. Nick Bhanji Says:

    I am trying to scrap search from careerbuilder.com, How did you managed to take care of odd/even rows and company name in the row. I am new to this. I am trying to use gathered information to assist students looking for job.

    thanks in advance

    nick.bh

  2. Rick Wargo Says:

    I’m not even sure this works anymore, but here is what I wrote to scrape CB for one company:

    require 'unescape'
    
    class FiberlinkJob < Scraper::Base
      process "td>a.job_title", :title => :text, :url => "@href"
      process "td", :description => :text
      process "span.tip_11", :location => :text, :posted_on => :text,
        :jobcode => "", :guid => ""
      
      result :title, :description, :location, :posted_on, :url, :jobcode, :guid
      
      def collect
        self.title = unescape(self.title)
        self.url = unescape(self.url)
        
        self.url = self.url.sub(/\&sc.*/, '')
    
        self.location = location.sub(/.*Location: (^[&]+).*/, '\1')
        self.posted_on = posted_on.sub(/.*Posted: ([A-Za-z]+).(\d+).*/, '\1 \2')
        self.guid = self.url
      end
    end
    
    class Fiberlink < Scraper::Base
      def Fiberlink.url
      "http://www.careerbuilder.com/JobSeeker/Companies/CompanyJobResults.aspx?Comp_DID=C250M6WSR21P12J2CN"
      end
        
      array :jobs
      process "table#snapshotOff1 tr", :jobs => FiberlinkJob
      result :jobs
    end

    Good luck!
    Rick

Leave a Reply

stop spam with honeypot!