Case study: contracts recommendation system

Executive summary

Work on the system has been canceled after the stage of automatic gathering data from job boards. R&D phase shown ethical issues caused by conflicts between system intent and job boards robots.txt. Meanwhile I got basic skills in web scrapping - automatically gathering tailor made dataset from public sources, - and now offer this as a service.

The idea

The idea is to automatically aggregate contracts from popular job boards and suggest new contracts based on my previous feedback.

The idea of this project belong to my former colleague and friend, he did something similar for himself years ago. It was in my backlog for a while and this week was a time to start work on this.

After gathering weekly/monthly database of new contracts all of them should be processed to build a feature vector for each contract. My feedback on each contract and feedback on my application would extend this feature vector by two more values. Rating score for each contract would be done by calculating Euclid distance between vectors which is translated to sense of similarity between new contracts and contracts I liked and got positive feedback from a customer.

R&D

Web Crawlers

R&D shown several interesting solutions on the market:
- Octoparse: almost no-code platform, did not give good results on infoline.ru and 2gis.ru
- BeautifulSoup: quick and handy libs to extract data from html pages
- Scrapy: solid and enterprise level framework, combine crawler and scaping capabilities, but so far I have been working with it from console - do not know yet how it will behave during integration into the python server

The code of one my crawlers looks like this:

                    class JobBoardSpider(scrapy.Spider):
                        name = 'one-popular-job-board-spider'
                        start_urls = [
                            'https:///search/vacancy?area=1&ored_clusters=true&professional_role=96&search_period=30&text=Android&order_by=publication_time'
                            ]

                        def __init__(self):
                            self.BASE_URL = ''
                            self.JOB_SELECTOR = '.vacancy-serp-item-body'
                            self.JOB_TITLE_SELECTOR = '.serp-item__title::text'
                            self.JOB_COMPANY_SELECTOR = '.bloko-link_kind-tertiary::text'
                            self.JOB_COMPANY_URL_SELECTOR = '.bloko-link_kind-tertiary::attr(href)'
                            self.JOB_COMPENSATION_SELECTOR = '.bloko-header-section-2::text'
                            self.NEXT_SELECTOR = '.bloko-button[data-qa="pager-next"]::attr(href)'

                        def start_requests(self):
                            for url in self.start_urls:
                                yield scrapy.Request(url=url, 
                                    callback=self.parse, 
                                    headers={"User-Agent": "one-popular-job-board-crawler (+https://github.com/Gelassen/web-crawlers)"},
                                )

                        def parse(self, response):
                            for vacancy in response.css(self.JOB_SELECTOR): 
                                yield {
                                    'jobTitle' : vacancy.css(self.JOB_TITLE_SELECTOR).get(),
                                    'compensation' : ''.join(vacancy.css(self.JOB_COMPENSATION_SELECTOR).getall()),
                                    'company' : ''.join(vacancy.css(self.JOB_COMPANY_SELECTOR).getall()),
                                    'companyUrl' : self.BASE_URL + vacancy.css(self.JOB_COMPANY_URL_SELECTOR).get()
                                }

                            next_page = response.css(self.NEXT_SELECTOR).get()

                            if next_page is not None:
                                yield scrapy.Request(response.urljoin(next_page))

It is run from a console by this command:

                    $ scrapy runspider scraper.py -o hh-scraped-results.json -s FEED_EXPORT_ENCODING=utf-8

However, if you tell your crawler to respect robots.txt, you would not get any data. Some site owners goes even further and apply variety of technics to make an automatic web scrapping more difficult and time consuming.

Scrapy supports many options to overcome such techniques, but that is what bring us ethical issues - do we really need this data to undermine another person will to do not automatically gather it?

Other considerations

There was an idea to setup more complex ML model to build such recommendation, but after chat with my another former colleague and friend we agreed on it might be an overengineering and it is better to start with the most simple approach to solve this task.
The choice of the right database is another interesting task: in case of full text search platform will need something like Elastic Search, but in case non normalized dataset from different job boards no-relational NoSql database like MongoDB might be required, however we are going to save calculated feature vectors and do math operations on them which bring one more constraint on db choice.

If you don't want to harm robots.txt rules, but don't mind overexploit people from the poorest countries, you might consider to use Mechanical turk to make this tedious routine job for you.

Originally published on LinkedIn