Case study: contracts recommendation system
Executive summary
Work on the system has been canceled after the stage of automatic gathering data from job boards. R&D phase shown ethical issues caused by conflicts between system intent and job boards robots.txt. Meanwhile I got basic skills in web scrapping - automatically gathering tailor made dataset from public sources, - and now offer this as a service.
The idea
The idea is to automatically aggregate contracts from popular job
boards and suggest new contracts based on my previous feedback.
The idea of this project belong to my former colleague and friend,
he did something similar for himself years ago. It was in my backlog
for a while and this week was a time to start work on this.
After gathering weekly/monthly database of new contracts all of them
should be processed to build a feature vector for each contract.
My feedback on each contract and feedback on my application would
extend this feature vector by two more values. Rating score for each
contract would be done by calculating Euclid distance between vectors
which is translated to sense of similarity between new contracts and
contracts I liked and got positive feedback from a customer.
R&D
Web Crawlers
R&D shown several interesting solutions on the market:
- Octoparse: almost no-code platform, did not give good results on
infoline.ru and 2gis.ru
- BeautifulSoup: quick and handy libs to extract data from html pages
- Scrapy: solid and enterprise level framework, combine crawler
and scaping capabilities, but so far I have been working with it
from console - do not know yet how it will behave during
integration into the python server
The code of one my crawlers looks like this:
class JobBoardSpider(scrapy.Spider): name = 'one-popular-job-board-spider' start_urls = [ 'https:///search/vacancy?area=1&ored_clusters=true&professional_role=96&search_period=30&text=Android&order_by=publication_time' ] def __init__(self): self.BASE_URL = ' ' self.JOB_SELECTOR = '.vacancy-serp-item-body' self.JOB_TITLE_SELECTOR = '.serp-item__title::text' self.JOB_COMPANY_SELECTOR = '.bloko-link_kind-tertiary::text' self.JOB_COMPANY_URL_SELECTOR = '.bloko-link_kind-tertiary::attr(href)' self.JOB_COMPENSATION_SELECTOR = '.bloko-header-section-2::text' self.NEXT_SELECTOR = '.bloko-button[data-qa="pager-next"]::attr(href)' def start_requests(self): for url in self.start_urls: yield scrapy.Request(url=url, callback=self.parse, headers={"User-Agent": "one-popular-job-board-crawler (+https://github.com/Gelassen/web-crawlers)"}, ) def parse(self, response): for vacancy in response.css(self.JOB_SELECTOR): yield { 'jobTitle' : vacancy.css(self.JOB_TITLE_SELECTOR).get(), 'compensation' : ''.join(vacancy.css(self.JOB_COMPENSATION_SELECTOR).getall()), 'company' : ''.join(vacancy.css(self.JOB_COMPANY_SELECTOR).getall()), 'companyUrl' : self.BASE_URL + vacancy.css(self.JOB_COMPANY_URL_SELECTOR).get() } next_page = response.css(self.NEXT_SELECTOR).get() if next_page is not None: yield scrapy.Request(response.urljoin(next_page))
It is run from a console by this command:
$ scrapy runspider scraper.py -o hh-scraped-results.json -s FEED_EXPORT_ENCODING=utf-8
However, if you tell your crawler to respect robots.txt, you would not get any data. Some site owners goes even further and apply variety of technics to make an automatic web scrapping more difficult and time consuming.
Scrapy supports many options to overcome such techniques, but that is what bring us ethical issues - do we really need this data to undermine another person will to do not automatically gather it?
Other considerations
There was an idea to setup more complex ML model to build such
recommendation, but after chat with my another former colleague
and friend we agreed on it might be an overengineering and it
is better to start with the most simple approach to solve this
task.
The choice of the right database is another interesting task:
in case of full text search platform will need something like
Elastic Search, but in case non normalized dataset from different
job boards no-relational NoSql database like MongoDB might be
required, however we are going to save calculated feature vectors
and do math operations on them which bring one more constraint
on db choice.
If you don't want to harm robots.txt rules, but don't mind
overexploit people from the poorest countries, you might consider
to use Mechanical turk to make
this tedious routine job for you.
Originally published on LinkedIn