WP1 - Web Scraping for Job Vacancy Statistics

WP1 - Web Scraping for Job Vacancy Statistics Big Data ESSNet CG Meeting, Brussels, 26-27 October 2017 Nigel Swier

Rationale Current Official Estimates (Survey) Online data Frequency Quarterly Real-time? Industry Sector Enterprise Size Job type / skills Geography National Totals More frequent More timely More granular Less burden Cheaper???

Participants (BD ESSNet WP1) United Kingdom (lead) Germany Sweden Slovenia Italy Greece Original Partners France Belgium Denmark Portugal Joined for SGA-2

Summary of Issues Complex and dynamic landscape. Many possible routes for accessing OJV data Processing OJV data is highly resource intensive Fundamental differences between OJV data and JVS concepts Incomplete (and unrepresentative) coverage Technology Platforms Data Science Skills

Review of SGA-2 Objectives

Task 1: Data Access To explore the feasibility of web scraping job vacancies from enterprise websites using the approaches developed by WP 2. To compile a list of URLs linked to enterprise units on the business register. The methods developed by WP2 will only provided limited information. This will not help us deliver experimental outputs by the end of the ESSNet, but it could be of benefit in the longer term. UK have developed a framework for building website specific mini-bots to obtain OJV counts from enterprise websites

Task 2: Data Handling If useful information can be extracted from enterprise websites, to develop a method for integrating this with data from job portals. To investigate and develop text-mining and machine learning approaches to extract information from unstructured text (e.g. supplementary information for coding/validating occupation, deriving skills, qualifications). Several text mining / classification experiments underway or planned: Greece (ISCO-08), France (FAP/ROME), Belgium (NACE) Avoiding overlaps with CEDEFOP

Task 3 Methodology and Technology To refine methods to improve the quality of the experimental job vacancy estimates produced during SGA-1 including improvements to linking and handling of jobs advertised on the web. To consider which stages of the GSBPM for job vacancy statistics could incorporate these new data sources and methods (e.g. data collection, data integration, non-response adjustment, increased precision) Still a lot of work to do around quality and estimation. This will not be complete by the end of the ESSNet

Task 4: Statistical Outputs To produce improved experimental estimates incorporating additional sources (More details later) To produce new experimental statistical products in the domain of job vacancies (e.g. estimates by geography and/or occupation group) (More details later) To explore whether the findings of this pilot could be used for new applications. For example: Comparing vacancies and associated skills requirements within an area to skills with the local labour market (Explored as part Eurostat Hackathon) Maintenance of occupation classification and coding frames (Greece, Sweden, France) An input into flash estimates for economic statistics (UK focus more details later)

Task 5: Future Perspectives Hold a two day workshop for sharing experiences in the field of job vacancy data for official statistics purposes (September/October 2017) Complete! Develop and implement a strategy for ongoing engagement and development on the use of web scraped job vacancy data for statistical purposes within the ESS. A longer-term roadmap for moving experimental ESSnet research into statistical production. Basis for collaboration established with CEDEFOP Continue as part of next ESSNet?

WP1 Meeting: Thessaloniki, 21-22 September

CEDEFOP Collaboration CEDEFOP project to scrape and process OJV data for all member states Completion in 2020 (but with some data available from end-2018) Need to avoid duplicating processes / activities CEDEFOP have identified similar issues around data quality Question: How can the ESS(Net) add value? Answer: Expertise in quality and privileged access to JVS micro data

Meeting objectives: Fully explore what experimental outputs could be produced by the end of SGA-2 Integrate new partners (and new people) into the WP Elaborate the collaboration with CEDEFOP Agree roles and develop a plan for activities and deliverables for SGA-2 Think about the future

Key Outcomes: Arrangements for sharing code & collaborating (e.g. Slack) Concrete actions agreed with CEDEFOP Country based pilot research plans for SGA-2 focused on producing concrete results Information shared through an expanded network

SGA-2 Research Plans

United Kingdom Data: Burning Glass, 2 major job search engines plus web scraping framework, CEDEFOP pilot, JVS Processing/Methodology Match OJV counts (from various sources) to JVS reporting units. Use the JVS counts and machine learning to train a model with OJV counts, industry, size indicators as features Use model to predict current vacancy estimates Expected Output(s): Weighted estimates? Job Vacancy now cast estimates

Enterprise count comparison (by portal)

Greece Data: Scraped job ads Already manually coded data to be used as training data Methodology: Use of text mining and ML approaches to classify job titles (and job descriptions) to ISCO-08 Expected Output: Occupation codes classified to job titles

Slovenia Data: Data from two job portals, enterprise websites, administrative data from Employment Service Processing/Methodology: Expand framework for collecting data by web scraping, administrative data and (some) enterprise websites Expected Output: Estimates of available OJVs (reference day/month) Estimates of newly available OJVs (reference day/month)

France Data: Web scraped data Public Employment agency data (partnerships with 150 job portals) Processing/Methodology: Cleaning, deduplication, Harmonising site specifi nomenclatures Matching OJV with administrative and JVS data Expected Output: Portal specifc vacancy count (monitor trends over time and compare with JVS)

Germany Data: Federal Employment Agency (portals and JVS data), CEDEFOP pilot, Stepstone Processing/Methodology: Matching of OJV and JVS data Explore feasibility of matching (particular challenges in Germany) Expected Output: Reliable matching methodology

Sweden Data: Swedish Employment Agency 2 large job portals Processing/Methodology: De-duplication and Matching Further develop quality framework Time series modelling Expected Output: Time series comparisons of OJV and JVS data Disaggregations (geography, NACE, ISCO)

Belgium Data: Administrative data from regional employment agencies Processing/Methodology: Machine learning model for predicting NACE code based on job description text. Multi-lingual model (French, Dutch, German, English) Expected Output: Model for predicting NACE

Looking Ahead

Looking Ahead Short-term (until May 2018) Workshop planned to validate CEDEFOP web scraping system (March 2018, Milan) Strategy for ongoing engagement: March 2018 Final SGA-2 technical report (including a roadmap for moving experimental research into production) May 2018 (Review at beginning of May?) Long-term An ESSNet network collaborating with CEDEFOP? Part of the next ESSNet? (but still might not completely deliver production ready outputs)

Budget Issues Denmark may need to withdraw Portugal role in WP1 still not defined