Big Data ESSNet - WP1 Research Plan for SGA-2 (version 2.0)

Big Data ESSNet - WP1 Research Plan for SGA-2 (version 2.0) 1. Introduction This document is summary of the work planned by Big Data ESSNet WP1 during SGA-2, which will run from August 2017 through to May 2018. The specific situation in each country is different in terms of the online job vacancy landscape, access to data, access to technology, business priorities, as well as skills and experience. Therefore, the general planning approach is bottom up with each country developing research plans that are feasible within their specific country setting, but with countries seeking to collaborate on specific topics where possible. SGA-2 work plans have been received from seven countries and these have now all been consolidated into this document. A summary of these plans is shown in Table 1. From this table several common overlapping topics of interest have been identified. The two most prevalent are: Data matching (UK, Slovenia, Germany, Sweden, France) Text mining (Greece, Slovenia, France, Belgium, Sweden, Portugal) There are also some overlaps on the following topic: Time series methods (UK, Sweden) Deduplication (France, Sweden) Data quality (Sweden, Germany, UK)

Table 1: Summary of SGA-2 workplans Country Data Processing/IT Outputs UK Daily JV counts for selected enterprises from several job portals Daily JV counts from selected enterprise websites Burning Glass (third party data back to 2012) Job vacancy survey Deriving stock measures from flow data (Burning Glass) Matching OJV and JVS data Handling outliers, missing data Time series methods Python, MongoDB, Google Cloud Nowcast estimates Methods for weighting OJV data Greece Web scraped job descriptions Slovenia Germany Sweden France Belgium 2 largest job portals National employment agency JVs from Enterprise websites National employment agency CEDEFOP pilot Stepstone Job vacancy survey National employment agency Metrodata, Job Safari Web scraping framework National employment agency (includes data from private job portals) Regional employment agency data (2012-2016) Web scraped job vacancy data (collected quarterly) Text mining Supervised Machine Learning Content Grabber, Python, Orange, Gensim Matching to Business register, deduplication Text analysis, machine learning Agenty, Scrapy, Orange Matching OJV and Business Register Data quality Matching OJV, JVS and Business Register Deduplication Time series methods Data quality Scrapy, SAS, SQL Server, Python Text mining specific information in job description harmonization of site specific nomenclatures Deduplication Matching to admin and JVS data Python, local computers Machine learning to predict NACE from job description, R, R Shiny Portugal Web scraped data Structuring/parsing of text Classification of job titles Comparison with other sources Method to automatically classify jobs from Greek OJV data Available vacancies by reference day and reference month Available newly posted vacancies by reference day and reference month Sustainable methodologies for matching and identification Improved survey estimates from time series models Quality framework Figures on vacancies with trend indicators (e.g. occupation, geography), duplicates removed NACE codes predicted from job decriptions Report on findings

2. Coordinating Research During SGA-1, a number of virtual sprints coordinated via Webex meetings were held on specific topics involving all the core SGA-1 partners. Although this achieved some useful results this may not be the best approach moving forward for a number of reasons: Webex is not an accessible technology for some countries and sound problems have often hindered communications. More recently, there have also been issues with video not working (this has also been an issue for other ESSNet meetings). Since the number of countries has expanded during SGA-2 these issues may be compounded. The research plans show that a number of topics are of interest to only a few countries, which may be better progressed by more informal collaboration between the interested partners. The intention is that WP1 members should now use Slack as the primary means of coordinating activity, exchanging information, and asking questions. This is better than relying on e-mail as it will help ensure that all exchanges are all in one place and can been seen by all. Our Slack workspace can be accessed here and there are also a number of channels that you will want to sign up to. Some of us are already starting to use Slack, but for this to work effectively it requires all of us to be actively using it, preferably every day. Of course, Webex can still be used if there is a consensus interested partners that this approach is useful. The ESSNet Wiki can be used for sharing although this for more suitable for more formal static content rather than informal exchanges. 3. Key dates 19-23 March: WP1 final planning and CEDEFOP validation meeting (Milan) 30 March: Move from research into write-up / Produce plan for continuing work after completion of the ESSNet 27 April 2018: Draft technical report complete for the Review Board. 14-15 May 2018: ESSNet Dissemination meeting (Sofia) 31 May-2018: Submission of SGA-2 technical report and Completion of ESSNet

4. WP1 Country Research Plans for SGA-2 This section describes the research planned for each country in SGA-2 4.1 United Kingdom Research aim: To produce experimental real time nowcast estimates that combine survey and online job vacancy data A framework for collecting daily job vacancy counts from enterprise websites has already been developed for 50 large companies. We aim to expand this to 500 enterprises (or as many as possible). We will also try and scrape counts for as many reporting units as possible for 1 or 2 industry sectors. We also plan to scrape some job vacancy counts by enterprise for some more job portals (e.g. Totaljobs and others where scraping is not prohibited). Depending on the structure of the website, this could involve scraping of specific company page URLs or scraping of company directories (which will then matched back to the reporting units in the survey). We are in the process of finalising agreements with two of the larger UK portals (Indeed and Adzuna) we are exploring the possibility of getting access to Burning Glass. The aim is to obtain as much historical data as possible. We will also revisit whether data from the 2015 CEDEFOP pilot could be used. Processing and IT infrastructure The aim to compare just capture the JV counts, so the amount of additional processing of the scraped data is minimal. The main processing challenges lie with reliable matching of company names to survey reporting units. Further work will be needed around handling outliers, both in a time series to account for missing data (e.g. a spider failing) and comparing different data sources (to remove low quality data) We will continue using our current IT infrastructure based on Python/Scrapy, Mongo DB and Google Cloud. We visualise the data in a Python/Flask/Bokeh based dashboard application.

Statistical outputs Our main planned statistical outputs are near real time nowcasts of the current estimate of job vacancies. The basic idea is that counts from the survey will be used to train the data from the various online sources. These counts will be used as features along with other information about the enterprise (e.g. employees, SIC). It is envisaged that the nowcasting model will be heavily weighted towards the most recent survey results. This approach could be used to predict result at different levels including total vacancies, vacancies by industry, or even for individual enterprises (which could then be aggregated or weighted up) Depending on progress, other possible experimental outputs could include experimental statistics for geography using data obtained through our new partnership arrangements with data suppliers. 4.2 Greece (ELSTAT) Research aim: To focus on text mining approaches. The main idea is to explore the possibility to automatically classify jobs by their descriptions from the on-line job ads. We aim to explore text mining approaches to automatically classify the jobs. Since, we have already classified the job description of the on line scraped ads by 4-digit codes of ISCO-08, we are going to use them as a training dataset for supervised learning. Our intention is to explore, also, unsupervised learning methods such as latent semantic indexing (LSI 1 ). Processing and IT infrastructure We will continue using Content Grabber as a scraping tool. For text mining approaches we are going to use free python tools and libraries like Orange and Gensim. We plan to cooperate with other WP1 partners working in the same field like France and Sweden as well as Cedefop. 1 LSI is an indexing method to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

Statistical outputs Our main planned output is description of appropriate technique to automatic classify jobs from the Greek on-line job descriptions. Depending on progress, we will explore the possibility to extent the classification to ESCO. 4.3 Slovenia Research aim: To estimate the number of actual job vacancies in the country by combining data from administrative sources and scraping job portals and a number of enterprises' websites. Data are collected by scraping job portals, enterprise websites and receiving administrative data from Employment Service of Slovenia (ESS). For now we scrape 2 job portals. We work with the supposition that one job advertisement means one vacancy, although in a lot of cases this is not true. Delineating posts according to the number of people they seek is in development, as most of the time the actual number is not given (the advertisement is written in singular while trying to employ multiple people, or the advertisement is written in plural but with no specification to the number of posts). Scraping of enterprise websites is (for now) done every three months, but monthly or even weekly scraping is being considered. We check (by spider crawling) a sample of individual enterprises' websites and if the spider finds a job vacancy subpage according to a list of keywords and blacklisted words, then the page is scraped and estimated with machine learning models (logistic regression for estimation of job vacancies presence and a combination of linear regression and AdaBoost models for number of vacancies). These represent only 1 2 pp. of found job advertisements. Comparing number of found advertisements to number of found companies that advertise we see that the results are quite similar but with around a 5 pp. drop in share of found companies against all companies that advertise vacancies. It must be noted that by definition job agencies do not give out the employer of job vacancies, and therefore defining companies is harder.

Processing and IT infrastructure The infrastructure for our scraping work is based on Python programs and freeware scraping agents. Scraping of job portals is done with Agenty (a freeware Chrome app), while crawling and scraping of enterprises websites is done with Python through the module Scrapy. The websites are also sorted into those with acceptable languages (in our case Slovene) and foreign languages. Afterwards the number of advertisements is computed with the data mining and machine learning toolkit software Orange. The content of chosen enterprises subpages is split into a bag-of-words and further into subpages that have and don t have job advertisements using logistic regression. On subpages with advertisements we use a linear regression model and an AdaBoost model for number of posts estimation. The final number is the average of both estimations for each subpage. All data are linked to Slovenian Business Register and de-duplicated. Statistical outputs Experimental statistics SURS could provide until the March 2018 are: - The number of available on-line job vacancies on the reference day - The number of available on-line job vacancies in a reference month - The number of newly posted on-line job vacancies on the reference day - The number of newly posted on-line job vacancies in a reference month Conditionally (still developing): - The number of on-line job vacancies followed by JV regulation on the reference day REMARKS In the first period of time we are able to produce the number of job vacancies on total level with no breakdown by NACE Rev.2. Later, when we find a solution regarding the ads advertised by employment agencies (if ads are advertised for agency of for subscriber), experimental statistics could be shown by NACE. The company publishes on-line one ad for the same type of work (occupation) but they are actually looking for more than one person. By scraping data from the Web we assume that one ad is one job vacancy. DEFFINITIONS On-line is defined as job advertisement if it was published on job portal of private company or on job portal of state employment agency. Available on-line job vacancy on the reference day is defined as a post which was on the reference day still available on the Web. The ad was published on-line on the reference day or before. Available on-line job vacancy in a reference month is defined as a post which was in a reference month available on the Web. The ad was published on-line in the reference month or before. Counted are also ads which were available on the Web in the beginning of the reference month and but no more in the end of the month.

Newly posted on-line job vacancy on the reference day is defined as a post which was published on the Web on the reference day published on the Web. This indicator represents the number of daily posted ads. Newly posted on-line job vacancy in a reference month is defined as a post which was in the reference month published on the Web. It includes all posted ads in the month no matter if ad is still available on-line at the end of the month. On-line job vacancies followed by JV regulation is defined as a post which has been newly created, is unoccupied for which the employer is actively seeking a suitable candidate outside the enterprise and which will be filled immediately or in the near future (= assessment of time gap between the date of published ad and the employment). On line job vacancies followed by regulation is calculated as: on-line job vacancies on the reference day + published job vacancies in the past but no longer on the Web + time gap between published ad and employment (assessment)). 4.4 Germany Research aim: To learn more about matching and identification possibilities especially for job portal data matched with business register data. To investigate and develop text-mining and machine learning approaches to extract information from unstructured text (e.g. supplementary information for coding/validating occupation, deriving skills, qualifications). FEA for data of the biggest online job portal FEA for JVS data CEDEFOP for scraped data of three German job portals STEPSTONE for online job portal data In Germany the Federal Employment Agency (FEA) hosts the largest and most important online job portal. Furthermore, the FEA is also responsible for the German Job Vacancy Survey (JVS). Destatis is in the process of finalising agreements with FEA to get access to both mentioned data sources.

Furthermore, Destatis is exploring the possibility to get access to the data from the job portal STEPSTONE. Due to time restrictions and limited resources Destatis would like to analyse whether the already existing scraped data of CEDEFOP can be used as well. Processing and IT infrastructure: For verification of job portal data quality it is necessary to match job portal data with business register data and to get additional information on specific enterprise characteristics (e.g. economic activity). A main data source for the business register is the FEA s address stock and this is also the basis of the sampling of the job vacancy survey. As a consequence, statistical units in the job vacancy survey could more or less be seen as a kind of subsample of the business register. A data matching between JVS data and business register is therefore unproblematic. This is not the case for job portal data. How can statistical units be identified and matched in these cases? The URS ID, an ID number of every legal unit in the business register, cannot be used. So the question is, if it is possible, to do a serious matching only on the basis of rare structural information like name, address, economic activity code etc.? It has to be further investigated which matching rates between job portal data and business register can be achieved for job portal data. Statistical outputs In Germany the Federal Employment Agency (FEA) is not only responsible for the job vacancy survey but is also the biggest job portal owner. The great importance of the Federal Employment Agency FEA in the area of job vacancy statistics leads to a special situation in Germany compared to other countries. Nevertheless Destatis will do further investigation especially in the area of data quality of job portal results. For this purpose the possibilities of matching job portal data with business register data will be investigated in the next months. So the statistical output for the future is mainly more in the sense of a sustainable methodological output: solving matching and identification problems with new big data technologies. 4.5 Sweden Research aim: To find relationships (models) between job vacancy survey and online job portal data in order to improve survey estimates of job openings.

To continue work in error estimation of the processing of job vacancy data. Furthermore, to apply the two frameworks in our work. We have agreements with three on line data sources: The Swedish Employment Agency, collected from January 2012 (received in batches) Metrodata, collected weekly by API from the beginning of March 2017 Jobbsafari, collected daily by API from the beginning of April 2017 In addition, we use data from the Swedish Job Vacancy Survey from January 2012, and the Swedish Business Register. We recently received a batch of test data from a fourth company (Auranest) and if time allows, we will include these data too in the analysis. Processing and IT infrastructure We need to improve deduplication procedures; within and between data sources. Matching between sources is a challenge since the content and structure differ considerably between sources. The insufficient information from portals create great challenges for portal integration. Matching to the business register needs more work. We aim to use the two frameworks tested in the project (UNECE and NZ) and investigate whether they can be combined in a useful way. We see a need for a structured description of data sources, separately as well as merged. We will continue using our current IT infrastructure based on Python/Scrapy,SAS and Microsoft SQL server. Statistical outputs We will further investigate time series method as a way of comparing online job portal data with survey data, and building models for the relationship between the sources. These time series are not final outputs, and the aim is to use the models to improve the survey estimates. We will investigate the possibility to present estimates on new breakdowns (e.g. geographic areas, NACE and ISCO).

4.6 France Research aim: We aim at studying the robustness of our existing statistical indicators on job vacancy and labor market tightness. A framework for collecting online job offers from a selection of job boards has already been developed, and punctual data collections are currently being used so as to develop exploitation tools. The French public employment agency has developed partnerships with nearly 150 job portals and a few large companies over the past two years. We already have some excerpt datasets and are looking forward to collecting new datasets from them. Processing and IT infrastructure We have developed tools handling various tasks using job offers titles and descriptions: - Cleaning of textual data, including lemmatization. - Fuzzy matching of job titles to a referential of names so as to classify collected offers in an occupations nomenclature. - Generation of document-term matrix representations of text corpora for tasks of supervised classification (using machine learning algorithms) and clustering. We have made these tools scalable to large volumes of data, and continue working on improving classifiers performances. Tasks we are looking forward to tackling in the next months are the following: - Mining of specific information in job offers description (type of contract, etc.). - Harmonization of site-specific nomenclatures regarding job offers details. - De-duplication of collected job offers. - Attempts at linking collected offers with administrative sources and survey data. As of now, everything is done using Python and running on local computers. Work is mainly conducted by an intern working full-time on the topic since mid-june, supervised by two people who will take over the job after his departure in mid-december.

Statistical outputs First, we aim at producing figures on the job vacancies advertised online by some of the major French job portals, without mixing up data from those sites. The details crafted through our treatments would serve to discriminate those vacancies in various ways: occupation code in one of the major French nomenclatures, geographic area, etc. A regular production of these indicators could allow us to follow trends, including trends produced by the portals evolutions, and should help us in understanding the specificities of the portals (which kind of sectors publish their ads on them, how fresh are their data, etc.). Then, we would like to compare the previous indicators with those which are already produced through surveys or using job offers collected by the national unemployment agency on its own (i.e. excluding data from job portals) 2. This will help us test the robustness of the latter indicators, and give us clues as to determining the specific interest there could be in integrating data from private job portals in our statistical production. 4.7 Belgium Research aim: To estimate the NACE code of job vacancies collected by webscraping We already receive administrative datasets from regional employment agencies containing job vacancies between 2012 and 2016. Based on the unique and official register number of each enterprise, we can be confident about linking a job description, the NACE code of the enterprise and the job location. This linking has already been completed for Brussels and is ongoing for Flanders and Wallonia. We also scrape job vacancies from 10 job portals. We aim to expand to more portals and to scrape each trimester all job vacancies. We also plan to obtain agreements with some portals in order to obtain the job ads more easily and to avoid technical problems during webscraping. Processing and IT infrastructure The aim is to build a machine learning process to test the link between the job description and the NACE code of the enterprise, based on the databases from regional agencies. According to 2 With Pôle Emploi (national employment agency), we produce at Dares a trimestral study of labor market tensions, which is being interrupted for a few months so as to be improved. The last issue to date can be found at http://dares.travail-emploi.gouv.fr/img/pdf/2017-056.pdf.

our first test, done during 2017, we expect to arrive at a robust model for some NACE codes, but probably not for all of them. We have to configure a model for the languages used in Belgium (tree official languages Dutch, French and German, plus English), and to make several tests in order to obtain stable and comprehensive models. Consequently, for each NACE predictable by our models, we will estimate the NACE code for the job ads collected by webscraping, and calculate the variation of these NACE code between two trimesters. All this work will be done using R software. If it is considered to be useful for the dissemination of results we will create a Shiny application also based on R. Statistical outputs We will publish a report on the correlation between words in job ads and NACE codes, and a quality analysis of the machine learning process to predict NACE codes for the Job vacancy survey (JVS). 4.8 Portugal Research Aim To provide insights on the construction of national lexicons from online job advertisements and to investigate the possibility of automatically classifying web scraped data: Web portal (site aggregating job offers) Processsing/Methodology Web scraping, Data/Text/Web to obtain and identify attributes and descriptions of the job, general tasks of the position and specifications required such as qualifications or skills needed by the job applicant. Structuring the input text, parsing, along with the addition of some derived linguistic features and the removal of others, and subsequent insertion into a database. The goal will be to classify Portuguese descriptions of jobs automatically. Comparative analysis with other job classifications (with official or market source)

Work plan for SGA-2 1. List several potentially interesting job vacancies web portals 2. Select the best source 3. Determine the granularity at which data should be scraped 4. Detailed description of relevant information found on the selected web portal (attributes, descriptions, etc ) 5. Select attributes / characteristics to extract 6. Select the appropriate tool for data extraction and pre-processing 6.1. Design and deploy data extraction 6.2. Pre-process extracted data 7. Select the database for storing the extracted data 8. Handle the duplication of job offers 9. Incorporate a Portuguese Lexicon 10. Classify automatically the data scraped 11. Comparison of job classifications Statistical Outputs: A report containing: 1. Lessons learned 2. Depiction of strengths and weaknesses of the use of national languages as Portuguese 3. Description of the classification methods adopted 4. Conditions for future use of these classification methods for big data sources or other official sources for statistics as administrative and survey data.