Using Secondary Datasets for Research José J. Escarce January 26, 2015 Learning Objectives Understand what secondary datasets are and why they are useful for health services research Become familiar with specific issues related to the research use of secondary data Planning the study Importance of a conceptual framework Evaluating the data Conducting the analyses and interpreting the results Reporting the research Gain familiarity with several frequently used secondary datasets What Do We Mean By Secondary Data? Data collected for purposes other than the particular research project that you are planning Two types are most commonly used: General purpose health and health care surveys Administrative data Public health data Data from public programs Private sector data 1
Advantages and Disadvantages of Secondary Data Advantages Can address a wide range of research questions Relatively inexpensive Nationally representative data (often) Wide variation in contexts Large sample sizes Rich set of variables (often) Can be linked to other data sources Disadvantages Cross-sectional (usually) May not have variables you need/want Very limited clinical information Using Secondary Data for Research Planning the study Ideally, the research questions and study hypotheses come first, followed by looking for secondary datasets that can be used to address the questions In practice, there is almost always some iteration and both questions and hypotheses are refined based on available data elements and measures Key implication: A sound conceptual framework is essential i.e., you need to know how the phenomenon that interests you works before you start the research Both disciplinary and institutional knowledge matter Biggest risk is designing a study in a theory-free environment after you have run a bunch of associations What Is A Conceptual Framework? A conceptual framework, or model, provides a relatively simple description of a phenomenon of interest, usually the relationships between particular outcomes of interest and their determinants Tries to get inside the black box A conceptual framework often breaks down a complex system into its component parts Conceptual frameworks are usually concerned with specific types of behavior in specific contexts 2
Are Conceptual Models Important? but as important as a good data infrastructure is having sophisticated conceptual models that direct data collection and analysis, and that represent the true complexity of health care provision. much of the value of health services research comes through the conceptual work that guides how researchers understand and pose research and policy issues. --David Mechanic, Milbank Q 2001; 79: 459-477. Why Is A Conceptual Famework Essential to Good Research? A conceptual framework imposes discipline and rigor in your thinking about the phenomenon you are planning to study A good model acts like a map that gives coherence to empirical inquiry by: Forcing you to consider all the factors that might affect the outcomes of interest Providing an understanding of the causal linkages among these factors Why Is A Conceptual Famework Essential to Good Research? A conceptual model, therefore, is indispensable in: Generating plausible and justifiable research hypotheses Designing appropriate and defensible empirical analyses to test them Interpreting empirical findings and assessing their generalizability 3
Using Secondary Data for Research (cont.) Evaluating the data: Surveys Read the documentation Who sponsored the survey and why was it done? What is the sample design? Who is in or out? Have the data been cleaned and edited? Were imputations done? Read the questionnaire Become familiar with items related to the dimensions and constructs that interest you Why were these items chosen? How do they compare with related items in other surveys? Is there information on their validity and other measurement properties? Are there items with high non-response? What are the skip patterns in the questionnaire? Using Secondary Data for Research (cont.) Evaluating the data: Administrative data Read the documentation Who collects the data and why? How are different data elements collected i.e., what is their source? Who (What) is in the data and who (what) is out? Under what circumstances? Learn as much as you can about the data What types of studies have the data been used for? Is there information on the validity of different data elements? How have different data elements been used in research by others? Which data elements should not be used? Are there established and well-accepted methods for measuring constructs of interest using the data? Do these work in your application? How well do the measures capture what you want? Using Secondary Data for Research (cont.) Evaluating the data: General principles Run frequencies and distributions on all variables of interest Check every number and ask yourself whether it makes sense in light of what you know about: The dataset The phenomena and institutions you are studying The population group, area, state, or country When necessary, benchmark your data against other sources Assume that you ve made a mistake and that your job is to find it 4
Using Secondary Data for Research (cont.) Conducting the analyses and interpreting the results Understand potential sources of bias in your analyses and consider whether there are ways to mitigate them or at least to assess whether bias could overturn your findings Plan and conduct meaningful sensitivity analyses The best sensitivity analyses assess whether potential sources of bias could overturn your results Feel free to be creative in your sensitivity analyses Distinguish statistical from clinical or policy significance Stay humble: Don t over-interpret or over-conclude Using Secondary Data for Research (cont.) Reporting the research Describe the dataset Level of detail should be inversely proportional to how familiar your audience will be with the dataset and how much information about the dataset is readily accessible Include response rate and comparison of respondents and nonrespondents; mention weighting scheme Describe how you selected study sample Describe key measures and provide references Note advantages/disadvantages relative to alternatives Describe proxy measures and why you chose them Describe approach to missing data Describe data aggregation, scale construction, and data linkages Be clear about unit of analysis General Purpose Health and Health Care Surveys National Health Interview Survey California Health Interview Survey Medical Expenditure Panel Survey National Ambulatory Medical Care Survey National Ambulatory Medical Care Survey National Health and Nutrition Examination Survey National Longitudinal Study of Adolescent Health Medicare Current Beneficiary Survey Health and Retirement Study 5
National Health Interview Survey (NHIS) Principal source of information on health of noninstitutionalized population; face-to-face interviews; administered annually since 1957 Nationally representative; oversamples blacks and Hispanics; response rate > 90% One child and one adult from each sampled household; sample size can range up to 100,000, but often less Core questions include household composition, sociodemographics, insurance, basic health status indicators, health behaviors, access and utilization (limited), preventive care Supplementary questions of interest to co-sponsors or to respond to new public health data needs California Health Interview Survey (CHIS) Information on health of California population Biennial survey 2001-2011; starting in 2012, continuous survey model Random digit dial (RDD) telephone survey; one child, one adolescent, and one adult from each sampled household; sample size around 50,000; 000; response rate about 35% Administered in English, Spanish, Mandarin, Cantonese, Korean, and Vietnamese Household composition, socio-demographics, insurance, health status, health conditions, health behaviors, access to and use of services, health and development of children Many questions taken from NHIS Medical Expenditure Panel Survey (MEPS) Household component is principal source of information on health care utilization and expenditures for non-institutionalized population; overlapping panel design (2 years of data for each respondent); launched in 1996 Nationally representative; oversamples blacks and Hispanics Face-to-face interviews; all members of each sampled household; h sample size 15,000-20,000; response rate 65-70% Household data supplemented by Medical Provider Component (hospitals, physicians, home health agencies, pharmacies) Health care use and expenditures (office and ED visits, hospitalizations, prescription drugs, other), with diagnoses Household composition, socio-demographics, insurance, health status, health conditions, health behaviors, access barriers, satisfaction, health care ratings 6
National Health and Nutrition Examination Survey (NHANES) NHANES III (1988-1994); 1999-2012 NHANES Designed to assess health and nutritional status of population Combines interviews, physical examinations, and laboratory tests Nationally representative; oversamples blacks and Hispanics Face-to-face interview; one adult and one child in each sampled household Examinations and lab tests in Mobile Examination Centers Wave sample size 5,000; response rate >80% Socio-demographics, insurance, health status, health conditions, health behaviors; detailed dietary history Medical and dental examinations, anthropometrics, vision and hearing, fitness tests, bone mineral density Blood and urine tests National Ambulatory Medical Care Survey (NAMCS) Designed to provide information about the provision and use of ambulatory medical care services; annual survey since 1973 Nationally representative Sample of visits to nonfederal office-based physicians (all officebased specialties); physicians report data on special forms Each physician reports on sample of visits during randomly chosen one-week reporting period Patient socio-demographics Patients symptoms, physicians diagnoses, medications Services provided, diagnostic procedures, planned treatment Selected lab values (new in 2011) National Longitudinal Study of Adolescent Health (Add Health) Designed to provide information on the health habits and behaviors of adolescents as they make the transition to adulthood, as well as on their outcomes in young adulthood 20,000 subjects in grades 7-12 at the start of data collection Five waves: 1994, 1995, 1996, 2001-02, 2008 (age 24-32) Designed to allow analyses of influence of social contexts (families, friends, schools, neighborhoods) Nationally representative In-school questionnaires in Wave I; home interviews in subsequent waves; questionnaires for parents, siblings, friends, school administrators Diet, physical activity, health service use, injury, violence, sexual behavior, contraception, sexually transmitted infections, pregnancy, suicidal intentions/thoughts, substance use/abuse, runaway behavior, height and weight, chronic conditions, mental health 7
Medicare Current Beneficiary Survey (MCBS) Designed to assist CMS in administering, monitoring, and evaluating the Medicare program; began in 1991; linked to Medicare administrative data Nationally representative Uses rotating panel design in which subjects are interviewed every four months for up to four years Samples obtained from Medicare enrollment files and data collected through personal interviews; oversamples oldest old and disabled Design permits both cross-sectional and longitudinal analyses Socio-demographics, insurance, health services utilization and expenditures, sources of payment including out-of-pocket costs, health status and functioning, health behaviors Frequently Used Administrative Data State hospital discharge data Healthcare Cost and Utilization Project (HCUP) Medicare administrative data Enrollment files Hospital discharge files Physician services files Inpatient rehabilitation and skilled nursing facility files Hospital outpatient department files Medicaid administrative data Healthcare Cost and Utilization Project HCUP is a family of databases and related software tools and products developed through a Federal- State-Industry partnership and sponsored by AHRQ HCUP includes the largest collection of hospital care data in the U.S., with all-payer, encounter-level information beginning in 1988 8
Healthcare Cost and Utilization Project (cont.) Database components of HCUP: Nationwide Inpatient Sample (NIS): Inpatient data from a national sample of over 1,000 hospitals (starting in 1988) Kids' Inpatient Database (KID): Nationwide sample of pediatric (age < 21) inpatient discharges State Inpatient Databases (SID): Universe of inpatient discharge abstracts from participating states (starting in 1995) State Ambulatory Surgery Databases (SASD): Data from ambulatory care encounters from hospital-affiliated and sometimes freestanding ambulatory surgery sites (starting in 1997) State Emergency Department Databases (SEDD): Data from hospital-affiliated emergency departments for visits that do not result in hospitalizations (starting in 1999) Healthcare Cost and Utilization Project (cont.) Software components of HCUP: AHRQ Quality Indicators (QIs): Measures of health care quality that make use of hospital inpatient administrative data Consist of three modules measuring various aspects of quality: ACSCs, inpatient QIs, patient safety indicators Software and user guides for modules are available to assist users in applying the Quality Indicators to their own data Clinical Classifications Software (CCS): Provides a method for classifying diagnoses or procedures into clinically meaningful categories Other Secondary Datasets Behavioral Risk Factor Surveillance System (BRFSS) Surveillance, Epidemiology and End Results (SEER) Program Community Tracking Study Surveys Claims data from private health plans or employers American Hospital Association Annual Survey of Hospitals AND MANY, MANY OTHERS! 9