Automatically Scraping Jobs from LinkedIn Using Ruby
There are a number of job postings on LinkedIn that are of no interest to me; many times I have to go through page after page of jobs to look for the few that are appropriate. To make it less tedious, I have shortened the process by automating this task and filtering out the jobs I do not want. I've done so using Ruby and shell scripting. I run the script every night through cron and mail the results to me.
This process has a few hurdles to jump:
- automation of the login and paging of the search results
- collecting the interesting job information
- html email output
- running an interactive browsing session in a cron process
I have tackled these issues with the following tools:
- Firefox on Linux (Fedora)
- Ruby & Gems: Nokogiri (or Hpricot), FireWatir
- sendmail
- Xvfb
This process is dependent on Firefox and Linux (Fedora) but can be modified without much difficulty to utilize other platforms. The scripts perform no error checking to keep the size small for the purpose of this blog entry. Read on for the code.
Firefox Automation
I use FireWatir to automate Firefox on Fedora. For this process, I need to authenticate to LinkedIn, issue the search query and page through the results.
To authenticate, the login and password text fields must be set to your information and the form submitted. Unfortunately, the send key delay is long and the simulation of "keying" in the characters is very slow. To circumvent this, I have my login information and password saved in Firefox so when I visit the login page, I can authenticate just by submitting the form.
ff = Firefox.start("https://www.linkedin.com/secure/login?trk=hb_signin")
##ff.text_field(:id, "session_key-login").set "account@example.com"
##ff.text_field(:id, "session_password-login").set "MyP@55w0rD"
ff.form(:name, "login").submit
The first line starts Firefox with the -jssh switch and navigates to the LinkedIn login page. The following two lines key in the user name and password for the LinkedIn account used for login. I have these commented out to speed the process as I have Firefox cache the values. The last line submits the login form, authenticating the process to LinkedIn.
Firefox is the FireWatir class for starting a Firefox browser. The start method obviously invokes a new instance of Firefox and navigates to the given page.
The text_field method finds the text field by the id attribute and sets the value to the string specified by the set method.
Finally the form method searches for a form named "login" and submits it.
To navigate the search results, one request per page is issued, explicity specifying the page number. This is accomplished with the following code.
ff.goto("#{url}&page_num=#{pg}")To close the instance of Firefox instantiated by the start method, execute:
ff.close
near the end of the routine.
That is all that is required to automate Firefox using FireWatir. At the time of this writing, FireWatir does not run on x86_64 Linux platforms, at least on Fedora.
Scraping
Hpricot and Nokogiri offer fantastic libraries for scraping content from web pages and XML documents through either CSS or XPath navigation. The methods are fairly similar so you can choose either library. Personally I have found Nokogiri to be much faster than Hpricot.
Given the html document for the page with search results, the jobs are located within the li elements where the id attributes contain "vcard-". Within each list element, the job title is in the first h2 element, the url for the job posting is in the anchor element and the company name, date, and location are in the paragraph with the class of "company-info". Given that information, jobs can be limited easily and the remaining ones saved for a later email notification.
The following ruby code accomplishes the scraping; note cleansing of the information also occurs.
doc = Nokogiri::HTML(ff.html)
jobs = doc.xpath('//li[contains(@id,"vcard-")]')
jobs.each do |job|
title = job.search('h2').first.content.strip
job_url = "http://www.linkedin.com" + job.search('a').first['href'].to_s
info = job.search('p[@class="company-info"]').first.content.gsub(/[ \f\t\r\n]+/, ' ').split(/ - /).map { |s| s.strip }
company = info.shift
date = info.pop
location = info.join(' - ')
processing_date = date if processing_date.nil?
break if date != processing_date
rejected += 1
next unless companies.select { |c| company =~ c }.empty?
next unless titles.select { |t| title =~ t }.empty?
next unless locations.select { |l| location =~ l }.empty?
rejected -= 1
found += 1
results += "<tr><td><a href='#{job_url}' target='_top'>#{title}</a></td><td>#{company}</td><td>#{location}</td></tr>\n"
end
In keeping this as simple as possible, the job runs nightly so there is only need to process the jobs for a single day; the new date is the sentinel value.
To limit the jobs, a list of regular expression for companies, titles, and location is checked to determine if the job should be excluded.
To compose the email, the elements of a table are built as interesting jobs are found.
E-Mail Results
The results are easily emailed by outputing HTML encompassed by email headers. This output can be processed with sendmail to deliver the mail.
print <<-_EOT1_
MIME-Version: 1.0
Content-Type: text/html
From: LinkedIn Job Reporter <account\@example.com>
To: account\@example.com
Subject: Today's #{subject}
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html>
<head>
<title>#{subject} for #{processing_date}</title>
<style type="text/css">
table { font-family: Verdana, sans-serif; font-style: normal; font-size: 10pt; }
body { font-family: Verdana, sans-serif; font-style: normal; font-size: 10pt; }
h1 { font-family: Verdana, sans-serif; font-size: 14pt; color: navy; }
.date { font-size: 100%; font-weight: bold; background: lightyellow; }
.header { font-size: 100%; font-weight: bold; background: lightblue; }
</style>
</head>
<body>
<h1><a href="#{url}&page_num=1" target="_top">#{subject} for #{processing_date}</a></h1>
<p/>
<table border=1 cellpadding=4 cellspacing=0>
<tr class=header>
<td>Title</td>
<td>Company</td>
<td>Location</td>
</tr>
#{results}
</table>
<p/>
Found #{found} and rejected #{rejected} out of #{found+rejected} jobs.
</body>
</html>
_EOT1_
Note there is some additional information added to the email.
Cron Job
The trick here, since the process will not be connected with an X windows session, is to run it with the X virtual frame buffer window server, or Xvfb. This allows Firefox to run in the background without an active logged in window session. Some of the parameters are passed on the command line so the script can be run multiple times and finally the output from the ruby script is consumed by sendmail.
For some reason, FireWatir could not start an instance of Firefox without it already running so the script invokes Firefox with the -jssh flag and terminates the process at the end when the X server is killed.
The url should be modified to correspond to the types of jobs desired.
The script in its entirety follows.
#!/bin/sh export DISPLAY=:1 Xvfb $DISPLAY 2>/dev/null & xvfb_pid=$! firefox -jssh 2>/dev/null& sleep 5 # Local jobs subject='Local LinkedIn Job Results' url='http://www.linkedin.com/jsearch?searchLocationType=I&pplSearchOrigin=MDYS&sortCriteria=DD&countryCode=us&postalCode=19422&distance=75' ruby ~rick/Develop/linkedin.rb "$subject" $url 2>/dev/null >/tmp/jobs$$ /usr/lib/sendmail -t < /tmp/jobs$$ # Interesting jobs worldwide subject='Interesting Jobs Worldwide' url='http://www.linkedin.com/jobs?runSearch=&sortCriteria=1&jobFunction=it&experienceLevel=5&trk_info=jobview_similar_fc_ex' ruby ~rick/Develop/linkedin.rb "$subject" $url 2>/dev/null >/tmp/jobs$$ /usr/lib/sendmail -t < /tmp/jobs$$ rm -f /tmp/jobs$$ kill $xvfb_pid
I've made the code available for download; it contains little more than what is presented here.









