MERMAID SERIES: SECONDARY DATA ANALYSIS: TIPS AND TRICKS Sonya Borrero Natasha Parekh (Adapted from slides by Amber Barnato)
Objectives Discuss benefits and downsides of using secondary data Describe publicly available datasets Identify methodological techniques and considerations related to secondary data analysis Apply didactics to real-life problems that have arisen in secondary data analysis
What is secondary data analysis? Analysis of data collected by someone else In contrast to primary data analysis in which the same team of researchers designs, collects, and analyzes the data Health services and epidemiological research often rely on secondary data
Use of secondary data Types of secondary data: Interview/ Survey Administrative Some datasets are free Some charge fees
Use of secondary data Advantages Large populations represented Collecting primary data costly Detailed information available Can study rare conditions Don t need individual informed consent
Use of secondary data Disadvantages No choice in variables available Data use agreements are necessary Time frame might not be desired Population is predetermined Complex sampling frame on some
National Center for Health Statistics (NCHS) datasets Population Surveys: National Health and Nutrition Examination Survey (NHANES) National Health Interview Survey (NHIS) National Survey of Family Growth (NSFG) Vital Records: National Vital Statistics System National Death Index
National Center for Health Statistics (NCHS) datasets Provider surveys: National Ambulatory Medical Care Survey National Hospital Ambulatory Medical Care Survey National Hospital Care Survey National Study of Long-Term Care Providers
Other CDC datasets Behavioral Risk Factor Surveillance System (BRFSS): State-level, telephone surveys assessing health-related risk behaviors, chronic health conditions, and use of preventive services Pregnancy Risk Assessment Monitoring System (PRAMS): State-level surveillance on maternal attitudes and experiences around pregnancy Conducted in conjunction with state health departments Covers about 83% of all US births
Centers for Medicare and Medicaid Services (CMS) datasets Medicare data Administrative claims data (i.e., data generated by billing) for people whose health care is covered by Medicare (age 65; kidney failure, some disabilities) Medicaid data Claims data on all patients enrolled in the Medicaid program Summary files and MSIS Data Mart easier to use
Veterans Affairs (VA) databases VA is largest integrated health care system in the US Many databases: most administrative, few survey Very comprehensive Requires prior VA approval to access
SGIM Dataset Compendium Great resource to assist investigators conducting secondary data analysis Users guide and comprehensive list of public and proprietary datasets http://www.sgim.org/communities/research/datasetcompendium
Methodological techniques: Intro to analyses used Selection of analysis methods are based on two elements: Research Experiment Study Design Type of variables Research Questions Data Collection Methods Administrative Data (population based) Survey / Interview Data (sample based)
Intro to analyses used Study Design Cross-sectional, longitudinal (repeated measures) Multilevel (hierarchical) Cohort / case-control; prospective / retrospective; clinical trials Type of variables Time-to-event (survival) Binary, categorical, ordered, continuous outcome Instrumental variables
Intro to analyses used Research questions Descriptive Associations (regressions; linear, non-linear, logistic, Cox, mixed; structural equation modeling; etc.) Estimation / Inference (measures of effect: OR/RR; test for trend; etc.) Others (e.g.; cost-effectiveness, propensity score, etc.)
Intro to analyses used Data Collection Methods Administrative Data Population Record duplication Missing data Survey / Interview Data (Sampled) Sampling design Analysis adjusted for the sampling design (weights)
Case-Based Problems Lets apply what we learned to a case!
Case-Based Study Background Study objectives Assess cervical cancer screening guideline adherence after guideline changes Assess covariates associated with appropriate cervical cancer screening, under-screening, and over-screening
Problem 1: What level of analysis? ID # Age Race # of paps Patient-level? Pap 1 date Pap 1 provider Pap 2 date Pap 2 provider Provider-level? Physician # Specialty Gender # of paps performed Pap 1 date Pap 2 date Pap # Patient ID # Pap-level? Age Race Date Provider
Problem 1: Our Approach Pap-level Women could have multiple outcomes (appropriate screening, under-screening, over-screening); outcome based on pap Some variables changed with pap (patient age, date, provider) Some variables did not change with pap (patient race, ethnicity, comorbidities)
Problem 2: How do I define my outcome (rates of adherence)? Can look at # of paps during time periods Pros: Easier Consistent with other studies Cons Hard to assess >1 guideline in >1 time period What about paps right before and right after? Does not take into account actual interval between paps
Problem 2: How do I define my outcome? Can look at time between paps Pros Is more accurate since guidelines are based on intervals Cons Can be confusing to define Do you look at exact amount of time between paps? If I look at all paps for all women, how do I look at different guideline periods?
Problem 2: Our approach Solution: Look at time between paps for only women who have an index pap in set time periods Decreases sample size but is much cleaner Divide groups by age and time based off of guideline differences Group Age Index Period A1 18-29 yo 1/1/07-6/30/07 A2 18-29 yo 11/1/09-4/30/10 B1 30-65 yo 1/1/07-6/30/07 B2 30-65 yo 11/1/09-4/30/10
Problem 3: A covariate does not make sense! One covariate of interest was # of annual visits Hypothesis: women with more annual visits had more over-screening? Descriptive stats for visit# Mean SD Min Max Median 10.7 12 1 296!! 7.5 Ideas for what is going on?
Problem 3: A covariate does not make sense! What questions do you have? How did we define visits? We defined visits by outpatient location codes for visits from billing data What is going on with people with 296 visits/year? Person with max 296 had a visit every other day for procedure code H0020 (Alcohol and/or drug services; methadone administration and/or service; provision of the drug by a licensed program)
Problem 3: A covariate does not make sense! Options for how to handle this? Find out what proportion of people with >X visits per year have this procedure code? Determine how "visit count" summary statistics change if we exclude this procedure code? Hesitant to do this because some women get methadone or alcohol treatment at same offices as paps Talk to our statistician
Problem 3: Our approach 1 st : we excluded this procedure code Mean decreased from 10.7 7.5, Median decreased from 7.5 6. Still very high! Discussed with statistician we could keep exploring, or.. We could winsorize! transformation of statistics by limiting extreme values in the statistical data to reduce the effect of possibly spurious outliers. We set all visit counts >95 th percentile to the 95 th percentile We used it as a confounder/control variable rather than a covariate of interest. Regressions with and without this variable were similar
Problem 4: A main covariate of interest does not make sense! Patient Factors Cervical Cancer Screening Guideline Adherence Provider Factors
Problem 4: A main covariate of interest does not make sense! We assessed provider factors through linking NPI number of provider in billing data with AAMC data on specialty, race, gender, and years in practice
Descriptive Statistics for Providers n Total 14812 Provider Type Clinic 2142 Lab 8494 Physician 2476 Provider Specialty Family Planning Clinic 1709 Independent Laboratory 8494 Family Practice 230 General Practitioner 1437 Internal Medicine 1 Clinical Medical Laboratory 8235 Family Medicine 249 Internal Medicine 253 OBGYN 1076 Pathology 1047 Specialist 207
Descriptive Statistics for Providers Billing Provider Type Clinic 2200 Laboratory 8506 Physician 1796 Billing Provider Spec Family Planning Clinic 1714 Family Practice 143 General Practitioner 877 Internal Medicine 3 OBGYN 193
We needed to do some more exploring
Problem 4: Our approach 1 st : Went back to billing codes to assess if I chose the wrong billing codes for paps Codes verified as correct for paps Most of the lab-based provider coded paps did not have other providers associated with the bill So.. If we excluded these, we would exclude a LOT of paps
Problem 4: Our approach Contacted coders from our clinic to understand whether some procedure codes were more specific for performing a pap (more likely to be physicians) vs. interpreting pap results (more likely to be laboratories) Dawn: the majority of pap smear procedure codes are for interpreting paps; only 1 Medicare code exists for performing paps Medicare code used very infrequently What would you do next?
Problem 4: Possible Solutions Excluding these paps Determine usual source of primary care Use physician visits done within 3 days of the pap claim as proxies for the performing provider Dicey since a) patients usual source of care may not be the ones who are performing the inappropriate paps, and b) sometimes patients see many doctors in a 3-day period so it will be hard to tease out who performed the pap Attribution of paps was tricky
Problem 4: Our solution Accepted that we could not assess provider factors with our data Looked into other sources, but not feasible within our means Referred to this as a limitation and moved on
Problem 5: Moving on We finished the cervical cancer screening project, and needed to decide where to go next Wanted to look at mammogram guideline adherence patterns Issue: Medicaid population is ~30 years old Thoughts?
Problem 5: Moving on >50-year-olds in Medicaid are a special population, <generalizable >Disabled >Long-term residents Solution: Decided to not study mammogram screening patterns in Medicaid Studied STI screening instead
Take home points Secondary data has benefits and downsides Many publicly available data sets Special considerations are needed for analysis ALWAYS check descriptive statistics for every variable and outcome Explore why when these don t make sense Once you understand why, decide if it is reparable There are tips and tricks for troubleshooting Work closely with your statistician Sometimes changing your original plan is best option
Take home points If using billing codes, check with coders about which variables make the most sense Hypotheses can lead to more hypotheses Think creatively about troubleshooting problems and next steps of your research
Thank you! Questions?