Enhancing Sustainability: Building Modeling Through Text Analytics. Jessica N. Terman, George Mason University

Similar documents
RESEARCH METHODOLOGY

Scottish Hospital Standardised Mortality Ratio (HSMR)

A Semi-Supervised Recommender System to Predict Online Job Offer Performance

Barriers & Incentives to Obtaining a Bachelor of Science Degree in Nursing

Appendix. We used matched-pair cluster-randomization to assign the. twenty-eight towns to intervention and control. Each cluster,

The Internet as a General-Purpose Technology

MaRS 2017 Venture Client Annual Survey - Methodology

THE ROLE OF HOSPITAL HETEROGENEITY IN MEASURING MARGINAL RETURNS TO MEDICAL CARE: A REPLY TO BARRECA, GULDI, LINDO, AND WADDELL

WarmWise Business Custom Rebates Program Manual

Sample Exam Questions. Practice questions to prepare for the EDAC examination.

Page. II. TECHNICAL ASSISTANCE PROJECT DESCRIPTIONS.. 3 A. Introduction... B. Technical Assistance Areas.. 1. Rate Design Consumer Programs...

Analysis of Nursing Workload in Primary Care

Optimization Problems in Machine Learning

Physiotherapy outpatient services survey 2012

Paying for Outcomes not Performance

Implement This! Expanding Fiscal Federalism and Goal Congruence Theories to Single-Shot Games Utilizing a Bayesian Multivariate Frailty Model

Powering Our Communities. Grant Guidelines

Fertility Response to the Tax Treatment of Children

07/01/2010 ACTUAL START

Executive Summary. This Project

Settling for Academia? H-1B Visas and the Career Choices of International Students in the United States

The Hashemite University- School of Nursing Master s Degree in Nursing Fall Semester

Applying client churn prediction modelling on home-based care services industry

The Relationship between Performance Indexes and Service Quality Improvement in Valiasr Hospital of Tehran in 1393

HOSPITAL SAFETY: INVESTIGATION OF 5S IMPLEMENTATION. Thanwadee Chinda, Nalin Tangkaravakun, and Worraphat Wesadaphan. Abstract

Supplementary Material Economies of Scale and Scope in Hospitals

Broadband. Business. Leveraging Technology in Kansas to Stimulate Economic Growth

Implementation of Automated Knowledge-based Classification of Nursing Care Categories

Catalogue no G. Guide to Job Vacancy Statistics

Introduction and Executive Summary

EXTENDING THE ANALYSIS TO TDY COURSES

PANELS AND PANEL EQUITY

DEPARTMENT OF STATE TREASURER. Please note: This information revises some of the data included in Memorandum #1128

Free to Choose? Reform and Demand Response in the British National Health Service

Explaining Navy Reserve Training Expense Obligations. Emily Franklin Roxana Garcia Mike Hulsey Raj Kanniyappan Daniel Lee

Clusters, Networks, and Innovation in Small and Medium Scale Enterprises (SMEs)

time to replace adjusted discharges

Waiting Times for Hospital Admissions: the Impact of GP Fundholding

Chapter 8: Managing Incentive Programs

Module 13: Multiple Membership Multilevel Models. MLwiN Practical 1

Outcomes of Chest Pain ER versus Routine Care. Diagnosing a heart attack and deciding how to treat it is not an exact science

Technical Notes on the Standardized Hospitalization Ratio (SHR) For the Dialysis Facility Reports

TC911 SERVICE COORDINATION PROGRAM

Hitotsubashi University. Institute of Innovation Research. Tokyo, Japan

APPENDIX VII OTHER AUDIT ADVISORIES

A Primer on Activity-Based Funding

INCENTIVE SCHEMES & SERVICE LEVEL AGREEMENTS

Kidney Health Australia Survey: Challenges in methods and availability of transport for dialysis patients

University of Michigan Health System. Current State Analysis of the Main Adult Emergency Department

3M Health Information Systems. 3M Clinical Risk Groups: Measuring risk, managing care

Missed Nursing Care: Errors of Omission

Statistical methods developed for the National Hip Fracture Database annual report, 2014

What Job Seekers Want:

Mandatory Medi-Cal Managed Care: Effects on Healthcare Access and Utilization

SCHOOL - A CASE ANALYSIS OF ICT ENABLED EDUCATION PROJECT IN KERALA

2011 National NHS staff survey. Results from London Ambulance Service NHS Trust

A Canadian Perspective: Implementing Tiered Licensing in the Province of Ontario

for the Multifamily Sector

Palomar College ADN Model Prerequisite Validation Study. Summary. Prepared by the Office of Institutional Research & Planning August 2005

2012 Grant Application

Offshoring and Social Exchange

2013 Workplace and Equal Opportunity Survey of Active Duty Members. Nonresponse Bias Analysis Report

Chapter -3 RESEARCH METHODOLOGY

Office of Weatherization and Intergovernmental Program (OWIP)

Differences in employment histories between employed and unemployed job seekers

QUEUING THEORY APPLIED IN HEALTHCARE

Determinants of HIV Treatment Costs in Developing Countries

NURSES PROFESSIONAL SELF- IMAGE: THE DEVELOPMENT OF A SCORE. Joumana S. Yeretzian, M.S. Rima Sassine Kazan, inf. Ph.D Claire Zablit, inf.

Lessons from Medicaid Pay-for- Performance in Nursing Homes

North Carolina. CAHPS 3.0 Adult Medicaid ECHO Report. December Research Park Drive Ann Arbor, MI 48108

A QUANTITATIVE ACQUISITION PROCESS MODELING APPROACH TOWARD EXPEDITING SYSTEMS ENGINEERING Yvette Rodriguez

Nursing Manpower Allocation in Hospitals

Medicare Spending and Rehospitalization for Chronically Ill Medicare Beneficiaries: Home Health Use Compared to Other Post-Acute Care Settings

American Recovery and Reinvestment Act (ARRA) Professional Engineering and Related Technical Services

Profit Efficiency and Ownership of German Hospitals

PRE-DISASTER MITIGATION (PDM)

2013, Vol. 2, Release 1 (October 21, 2013), /10/$3.00

SSF Call for Proposals: Framework Grants for Research on. Big Data and Computational Science

Trends in Merger Investigations and Enforcement at the U.S. Antitrust Agencies

A PRELIMINARY CASE MIX MODEL FOR ADULT PROTECTIVE SERVICES CLIENTS IN MAINE

The role of Culture in Long-term Care

Research Design: Other Examples. Lynda Burton, ScD Johns Hopkins University

Published in the Academy of Management Best Paper Proceedings (2004). VENTURE CAPITALISTS AND COOPERATIVE START-UP COMMERCIALIZATION STRATEGY

Linking Entrepreneurship Education With Entrepreneurial Intentions Of Technical University Students In Ghana: A Case Of Accra Technical University

Measuring healthcare service quality in a private hospital in a developing country by tools of Victorian patient satisfaction monitor

A strategy for building a value-based care program

Nowhere To Go: Psychiatric Bed Reductions and Ambulance Diversions

Impact of Financial and Operational Interventions Funded by the Flex Program

Summary of Findings. Data Memo. John B. Horrigan, Associate Director for Research Aaron Smith, Research Specialist

Accounting, Organizations and Society

Basic Skills for CAH Quality Managers

Engaging Students Using Mastery Level Assignments Leads To Positive Student Outcomes

Healthcare- Associated Infections in North Carolina

Sarah Bloomfield, Director of Nursing and Quality

Quality Improvement Spillovers: Evidence from the Hospital Readmissions Reduction Program

UNITED STATES PATENT AND TRADEMARK OFFICE The Patent Hoteling Program Is Succeeding as a Business Strategy

GUIDELINES FOR CRITERIA AND CERTIFICATION RULES ANNEX - JAWDA Data Certification for Healthcare Providers - Methodology 2017.

FREQUENTLY ASKED QUESTIONS

Health Quality Ontario

Work- life Programs as Predictors of Job Satisfaction in Federal Government Employees

Transcription:

Enhancing Sustainability: Building Modeling Through Text Analytics Tony Kassekert, The George Washington University Jessica N. Terman, George Mason University

Research Background Recent work by Terman et. al (2015) founds the role of grant management significantly impacts sustainable policy implementation delays. Extant research on federal federalism indicates that goal congruence improves performance (Nicholson-Crotty 2004). Several theoretical inconsistencies with previous literature occurred when our team tried to combine both set of hypotheses in a single model. The purpose of this presentation is to explain how we are using text mining to improve our estimation and achieve theoretical consistency.

Research Question and Hypothesis Does economic development as a motivation for sustainable development impact implementation? Are local managers more technocratic? Do they more consistently work on implementation? Economic development motivations in particular would indicate a preference for spending funds on the grant in a timely fashion. We hypothesize that when local governments are focused on sustainability as an economic development tool, they are more likely to complete projects on time for similar projects.

Data Sources and Methods Data: Department of Energy administrative data All grant application text is directly from DOE. National survey to the population of EECBG grantees Over 50% response rate in 2009 Census bureau Methods: Bayesian clustering of textual data (tm and bclust R package) A relative risk survival (log-binomial) model is used to estimate implementation delay times. The model has a robust error based on the cluster.

Variables and Measurement Dependent variable (days of delay): Delay between jurisdiction-proposed EECBG start date (when they receive funds) and the actual date of funds dispersal/use. A positive coefficient equates to longer delays, while a negative coefficient indicates less delay. Independent variable The variable of interest is economic development motivation This is a factor score created off from several survey questions.

All Independent Variables Satisfaction w/doe Application Process Citizen advocacy level Satisfaction w/doe Approval Process Number of prior sustainable policies Satisfaction w/doe Tech. Support Green practices count Administrative Capacity Green development in planning External application assistance Economic development tool Citizen application participation Budget (logged) Copied policies from other governments Unemployment Innovative (new) policies to implement Manager Form of Government These were all chosen to compare results to a previous paper from APPAM.

Research Problem Most survey research classifies projects into groups that are not mutually exclusive. For example, the EECBG grant process had localities pick between: Energy Efficiency Strategy Technical Consultant Services Buildings Audits Financial Incentive Program Energy Efficiency Retrofits Buildings and Facilities Transportation Codes and Inspections Energy Distribution Material Conservation Program Reduction Greenhouse Gases Lighting Onsite Renewable Technology Other For example, LED lighting could be in the lighting category or part of a building retrofit, or energy efficiency strategy. This measurement error creates inefficacies in estimate standard errors.

Text Mining as a Solution We propose text mining energy grant proposals to augment survey data and administrative records. Using text analytics, we can classify the grants by their text in an effort to determine which proposals were more similar. This allows us to cluster similar projects in a more accurate manner without the unnecessary measurement error.

Short Review of Text Mining Text mining is analogous to other exploratory statistical techniques. The primary method for both is cluster analysis. A second frequently used tool for text is singular value decomposition (SVD) is similar to principal components analysis. Text mining basically develops a numeric representation of the textual data and analyzes it with standard tools. The standard approach treats documents as rows and terms (e.g. words) as columns. This creates a very large, sparse matrix (lots of zeros) Text mining is not the same as data mining, although the two are often used in concert.

Transforming Text to Usable Data There are several steps that are commonly applied: Normalize case (make everything lowercase) Remove punctuation Remove white space Remove numbers Remove stopwords (the, in, a ) Stem words (chop off end of words- ing, es, er) This means finding the core of a word (city = cities) Choose a weighting scheme Often we weight words to adjust for frequencies and/or document length. The Euclidian distance (used in both clustering and factor analysis) is often not the best choice for text. Reduce dimensionality

Reducing Dimensionality Text mining suffers from too much information, so we want to reduce it down to something manageable. Words that appear in all the texts are useless at discriminating between them. For example, the word energy appears in every grant proposal, so it adds no value to choosing between them. Think of a regression variable that equals 1 for 98% of your cases and 0 otherwise. It would be unlikely to be predictive. Text mining usually begins by removing these too frequently occurring words. Words that appear in too few texts add little value also. Not frequent enough to compare between groups. Think of a regression variable that equals 1 for 3% of your cases and 0 otherwise. It would be unlikely to be predictive.

Text analysis After all the data cleansing, the document term matrix (dtm) is analyzed. For purposes of this presentation, we will cluster the texts. We use a Bayesian clustering Weighting is done by inverse document frequency (lowers the impact of frequently occurring words. Words occurring in less than 15% of the documents are excluded. All words except light that occur in over 85% of the texts are excluded. FYI, we have also used singular value decomposition and a Bayesian Dirichlet classifier but they are more difficult to interpret and are not presented here. The Bayesian clustering identifies 6 clusters.

Cluster Plot of EECBG Grant Text - 6 Clusters Component 2-5 0 5-30 -20-10 0 10 Component 1 These two components explain 76.19 % of the point variability.

Identifying Cluster Meaning We use words particular to each cluster to determine meaning. 1. Residential and business power efficiency (efficiency, home, residential, commercial, contractor, power usage) 2. Audit (audit, reduce waste, inform, window) 3. Solar (Solar, power, generate, house) 4. Retrofit (retrofit, conserve, heat, construct) 5. Economic Development (job, fuel, growth) 6. Management (budget, monitor, resource, no tribal )

Using Clustering to Investigate Economic Development After retrieving the clusters, we can include them in our regression analysis. If we use the respondent-defined categorization of the grant as clustering variables (fixed effects) we get: A large amount of unnecessary variation in the model from inconsistent application. Insignificant impact for economic development. After we switch to text mining developed clusters, economic development does become more precise.

Caveats Before I demonstrate the preliminary regression results: 1. The clusters were calculated separately than the rest of the data set and then merged up using fuzzy matching. No unique identifier to merge records. Grant number repeats and may have several different texts associated with it) There were software license difficulties that were not resolved early enough to correct this. I apologize. 2. The text analysis results are not publication ready Should use a training set and a final set, which I plan on doing once I can properly merge the data before beginning. I plan on using the smaller grants as a training set, then the larger grants in the final analysis.

Results Original Measurement Text Clusters Variables Estimate SE Estimate SE Satisfaction w/doe Application Process -0.079 0.024-0.087 0.026 Satisfaction w/doe Approval Process 0.021 0.033 0.028 0.037 Satisfaction w/doe Tech. Support -0.139 0.036-0.127 0.040 Administrative Capacity -0.016 0.063-0.011 0.059 External application assistance 0.034 0.033 0.034 0.034 Citizen application participation -0.098 0.057-0.105 0.075 Copied policies from other governments 0.096 0.046 0.091 0.067 Innovative (new) policies to implement 0.081 0.141 0.086 0.170 Citizen advocacy level -0.009 0.067-0.011 0.085 Number of prior sustainable policies -0.042 0.033-0.033 0.037 Green practices count 0.071 0.042 0.058 0.023 Green development in planning -0.054 0.053-0.069 0.089 Economic development tool -0.063 0.038-0.071 0.029 Budget (logged) 0.003 0.013 0.003 0.017 Unemployment 0.018 0.014 0.016 0.023 Manager Form of Government -0.236 0.235-0.238 0.201

Fixed effects DOE Clean energy policy 0.463 0.180 Financial incentives for energy efficiency and other covered investments 0.095 0.236 Government, school, institutional procurement -0.364 0.241 Loans and grants -0.010 0.148 Renewable energy market development -0.137 0.233 Workshops, training, education 0.191 0.183 Building energy audits 0.345 0.221 Other 0.243 0.114 Technical Assistance 0.459 0.145 Transportation -0.178 0.122 Fixed Effects Cluster 1.Residential and business power efficiency -0.174 0.072 2.Audit 0.178 0.051 3.Solar 0.257 0.104 4.Retrofit 0.395 0.200 5.Economic Development 0.318 0.112 6.Management -0.256 0.293

Discussion In addition to statistical significance (which is meaningless!), the model fit statistics (AIC, SBIC) clearly show the models with text based clustering increases the amount of variation explained. We believe these results demonstrate that text analysis could be used to better group and control for random variation in the model. The clusters predicted by text analysis provide a better sense of which projects are similar so that modeling can accurately account for heterogeneity.

Expanding the Methodology The EECBG grants are futile training grounds for expanding this methodology into other projects. 1. The publication we are working on using these methods will actually look at the interaction of economic development motivation with the different classes of grants. 2. Use key words as a training data set to analyze other sustainability grant text. Likely at the state level 3. Fuel for some qualitative analysis of outliers (Jessica). 4. Look at grant changes within the EECBG program. About 3% of the grants changed in this program. Look to see if they adopted language similar to original submissions.