Better Quasi-Experimental Design Thomas D. Cook Northwestern University and Mathematica, Inc. Stockholm, 2015 Introduction • The theory of cause we discuss today – manipulability theory • RCT as the best reflection of it because of • Primacy of manipulation in each theory • Statistical theory of comparability on all variables, in expectation at least • Assumptions clear and testable • Social consensus about when met or not RCTs not always possible; we need Alternatives • • • • Closest approximations are quasi-experiments QEs CAN work in theory –Rubin Causal Model BUT conditions where work are opaque Need a method to learn which QEs are usually effective or not in producing results like RCTs • Today will present a method for identifying QE design and analysis alternatives that “often” produce results similar to those of RCT What is at Stake? Evidence-Based Practice • Rhetoric of evidence-based practice • What is acceptable as “evidence” • In causal realm, RCTs are sometimes sole (Evidence-based Coalition), always preferred and often more heavily weighted than other causal designs (WWC) • Entails slow rate of progress and requires • Accepting all the limitations of RCTs –external validity At Stake: Future of Bigger Datasets • • • • • • • More variables, hence more constructs More versions of same construct --reliability More frequent assessments – more time series More cases More local sampling – better comparisons More linkage potential to other data sets Are better QE designs those with data attributes like that that are becoming more available? We will explore today • Which alternatives to the RCT are worth trusting and not trusting so that we can accumulate results faster over greater range of UTOS • One concept of trustworthy is theory; another is that the causal estimates are routinely close to those of an RCT. • today is about Empirical Criterion of Correspondence between RCT and QE results Overview of Day • First, present and discuss the method called Within Study Comparison (WSC) or design experiment. Then break • Then report results of WSCs for two versions of RD and for CITS. Then break. • Then report results of WSCs for simpler designs with no pretest time series nor a fully known selection mechanism. Then break • Finally, Conclusions and Open discussion WSC Design: Three-Arm Study Overall Population sampled/selected into Randomized Experiment randomly assigned to Control Group Treatment Group ITT of RCT Comparison Group ITT of OS WSC Design: Four-Arm Study POPULATION Randomly Assigned to Randomized Experiment Treatment Observational Study Control Treatment Control ? = ATE Within-Study Comparison aka Design Experiment • Have a benchmark– RCT in 64 of the 72 examples to date. Compute causal estimate and SE • Quasi-experiment of many different “types” – ITS, RD, NECGD with pretest and local match, NECGD with fully known selection process, with partially known, with scarcely known etc. – • Attach same treatment group to RCT and QE, adjust QE, and then compute QE estimate • Compare QE and RCT estimates, and conclude Evolution of WSC Purposes • Began : CAN we get similar results? existence proof; job training; Limitations of approach • Does widely used QE practice work? E.g, use of pretest, of local comparisons, combining the 2; of PS analysis • Bias-reduction potential of alternatives in separate studies – e.g, covariate choice vs analysis • Direct comparison of alternatives – Pretest when correlated with selection process or not • Novel alternatives – e.g., Stuart & Rubin (2008) • Later = more sophisticated questions bretter linked to statistical theory Why do WSCs? • All QE results can be disputed in terms of possible alternative interpretations, but not necessarily plausible • Single WSCs of limited value – if similar results by design, existence proof of what stat theory tells us; if not similar, would they have been by adding X, Y or Z? • Goal is not to identify QE will always replicate RCT finding; it is to identify designs that often do so • With multiple studies, (a) we could have external warrant for claims that a given QE practice often works, given no internal warrant from theory, as with RCT; (b) we will escape from QE and OS language to be more specific in language re designs Conditions for a good WSC • A well implemented RCT, with minimal sampling error • No third variable confounds – like from measurement • Comparable estimands – RD and RCT • Blinding to the RCT or adjusted QE results • Defensible criterion for correspondence of RCT and adjusted QE results Limitations of WSCs - I • Blinding is rare and so are protocols – is there reason to assume that folks most likely to do WSCs want RCT or QE to “win out”? • No perfect criterion for correspondence, given sampling error in RCT and QE – of similar signs, stat test results by design, of null and equivalence tests of design differences • Only be done on topics with benchmark, and 80%-90% of WSCs have an RCT benchmark. Don’t we want to know how QEs do when no RCT is possible? Discussion of WSCs • Break. WSC Results for Two Variants of the Regression Discontinuity Design (RD): Simple and Comparative RD RDD Visual Depiction Comparison Treatment RDD Visual Depiction Discontinuity, or treatment effect Counterfactual regression line Comparison Treatment Stress for simple RD that • Process of assignment into treatment completely known and perfectly measured • We should be surprised, therefore, if we do not get same results as RCT estimated at same point in sharp RD. • Trivial theory test BUT Non-trivial test of implementability of RD in real world Why same Causal Estimate in Simple RD and RCT? • Rationale 1: process of selection into treatment is completely known – “sharp” RD • Rationale 2 for “sharp RD” – like RCT around the cutoff and only there. • Rationale 3: If there is contamination – T cases in untreated area and/or C cases in treated area, the way of dealing with this is the same in RD and RCT – instrumental variables (IV) Why Larger Standard Errors, less Power, in Simple RD? • RD analysis requires measure of the treatment assignment (1/0) as treatment indicator, and of the assignment variable as selection control • The two are correlated, one a binary measure totally located within the assignment variable • Hence if any slope in the relationship of assignment variable and outcome, • The treatment cutoff score and the assignment variable will be CO-LINEAR 13 WSCs of RD vs RCT at the Cutoff • Almost all causal results at cutoff are similar • More so as N increases, and across parametric and non-parametric analyses • Results in many different substantive fields • No meta-analysis yet; no thorough examination of file drawer problem • Results look promising for internal validity of RD in the crucible of practice. • But SEs are about 3X larger for same sample size Conclusion re Simple RD • You can trust simple RD to give unbiased causal estimate at the cutoff despite great heterogeneity in how it is implemented • It is less efficient than RCT by a factor of about 3 • In practice, many RDs have larger sample sizes than a comparable RCT would, thus reducing the statistical power loss in practice. Alas, Simple RD is very limited • • • • Less Statistical Power than RCT Functional Form or Bandwidth Assumption Lesser causal generalization – LATE vs ATE Is there way to do RD better so that (a) support for extrapolation always needed; (b) more power, and especially (c) unbiased causal estimates in all the treated area and not just at the cutoff Now we will examine Comparative Regression Discontinuity (CRD) • Visually, what is it? Posttest regression Pretest regression Non-Equivalent Regression Function • • • • From Pretest From Non-Equivalent Comparison Group From Non-Equivalent Dependent variables Ludwig and Miller health results 5 years after Head Start – pretest health; local cohort too old for HS; health outcomes do or not affect little kids –pulmonary problems vs accidents • How do we study CRD? Via form of WSC Creating the synthetic RD from the RCT Now Imagine • Dropping the treated cases in the untreated part of the assignment variable • Dropping the untreated cases in the treated part of the assignment variable. • You are left with two groups instead of the four when the RCT is subdivided by cutoff score. This the very simplest RD. Posttest regression Pretest regression Walk you three examples, testing • How well does CRD do relative to both RCT and simple RD with respect to: • Functional form estimation • Statistical power • Bias in all Area away from Cutoff Example One Wing & Cook (JAPAM; 2013) Case 1: Cash and Counseling Demonstration: RCT • T=having control over Medicaid funds for disability services vs • Business as usual = Medicaid selecting providers of service • DV = total expenditures for disability services cos Medicaid dispenses less than the allotment usually • Question: DO families spend more when they have control over funds? CRD Design Specifics • Set an assignment variable = age • Set cutoff: 35, 50, and 70 – use only 70 here cos of age distribution • Comparison function is pretest spending • RDD analysis both parametric and non-para (LLR) – report only LLR here • 3 States: NJ, Ark, Fla. • About 1,000 cases per site Research questions again: • If you add pretest RDD function as in this example, do you • Have more confidence in functional form? -How comparable are the 3 untreated regression segments? • Have lower standard errors, how close to RCT • Get causal estimates for whole age range of treated from 70 to 90+ and not just at 70? • Begin with support What about standard errors of estimates away from cutoff? • For cutoff at age 70, higher than RCT by 1.3 across all the area away from the cutoff • At the cutoff, smaller than the RD across all comparisons of RD and CRD-Pre at age 70 (and other ages, too). What about Bias?: Comparisons State Estimation Cut-Off Bias at CutOff: Post-Test Only Bias at CutOff: Pre-Test Design Bias Above Cut-off: PreTest Design Arkansas LLR 70 -0.06 0.07 0.04 New Jersey LLR 70 0.01 0.08 0.12 Florida LLR 70 0.08 -0.02 -0.04 Example 2: Effects of Head Start – Tang & Cook (2014) • Random selection of HS centers (89% agree) followed by random assignment within centers of 3 year olds • Outcome = math, literacy; social behavior • CRD-Pre has pretest as no-treatment regression function, as Wing and Cook • CRD-CG has non-equivalent group of 4 year olds from same locations, not in W & C Two Forms of CRD tested • CRD-Pre – Supplement the basic RD design with pretest scores of the same individuals • CRD-CG – Supplement the basic RD design with a nonequivalent comparison group. – Two different cutoff scores for replication – a test scoreis one and date of testing is other Sample sizes • RCT is 2326 • RD with IRT-generated PPVT as the assignment variable: 1163 • CRD-Pre: 1163 subjects with 2326 observations • RD with date of assessment as the assignment variable: 1045 • CRD-CG: 1856 subjects (observations) What about support? 3 untreated segments of CRD-Pre – CRD-CG similar Results: Precision of CRD-Pre 2.10 1.90 1.90 1.94 1.70 1.52 1.50 Ratio of RD s.e. to RCT at the cutoff s.e. 1.40 1.35 1.30 1.10 Ratio of CRD at the cutoff s.e. to RCT at the cutoff s.e. 1.15 Ratio of CRD above the cutoff s.e. to RCT above the cutoff s.e. 1.02 0.90 0.95 0.89 0.70 0.50 Math Literacy Social-Emotion Results: Precision of CRD-CG Results: bias of CRD-Pre above the cutoff Results: bias of CRD-CG above the cutoff Summary: CRD-Pre above the cutoff Summary: CRD-CG above the cutoff Case 3: Stress Test Effects of Training Kisbu-Sakarya, Tang & Cook Shadish, Clark & Steiner (2008) N = 445 Undergraduate Psychology Students Randomly Assigned to Observational Study N = 210 Self-Selected into Randomized Experiment N = 235 Randomly Assigned to Mathematics Training N = 119 Vocabulary Training N = 116 Mathematics Training N = 79 ? = Vocabulary Training N = 131 ATE “Stress Test” due to Modest Ns • N for RCT is 235 • N for basic RD and CRD-Pre is 123 for math and 112 for vocabulary • N for CRD-CG is 254 for the math outcome (123+131) and 191 for vocab (112+79). • These are small sample sizes for regression techniques with individual data Support for Regression Assumption: CRD-Pre math outcome Support for Regression Assumption: CRD-CG math outcome: Lowess Above cutoff for math Support for Regression Assumption: CRD-Pre vocabulary outcome Support for Regression Assumption: CRD-CG vocabulary outcome: Lowess Above cutoff for vocab SEs: At cutoff for vocab Overall Conclusions about CRD • With either CRD-Pre or CRD-CG, the added functional form can help if the untreated functional forms are parallel-ish and if sample size large enough for reasonable stability. The addition will: • Increase confidence in functional form extrapolation • Increase power relative to RD and close to that of RCT • Lead to unbiased causal inference at the cutoff AND ALSO AWAY FROM IT. • CRD shrinks the advantages of RCT, but without entirely eliminating them Why do Simple RD? • Why tolerate its disadvantages if they are so easily mitigated by a non-treated regression function that can be observed and will be even more feasible in “big data‘ era? • Why is the design of choice not automatically some form of CRD rather than RD Analog here to the development of RCT. How many posttest–only RCT designs in social sciencepractice; most have covariates at least MORE PRETEST DATA POINTS: RCT VS. INTERRUPTED TIME SERIES (ITS) AND ESPECIALLY COMPARATIVE INTERRUPTED TIME SERIES (CITS) Interrupted Time Series Can Provide Strong Evidence for Causal Effects • Clear Intervention Time Point • Huge and Immediate Effect • Clear Pretest Functional Form + many Observations • No AlternatIve at Interventio Can Explain Change Limitations of Simple One-Group ITS • • • • • History, around the intervention point Instrumentation Stat Regression Functional form extrapolation needed Analysis has to account for correlated errors (we will not deal with this issue here) • Suggest the advisability of a comparative ITS WSCs on Simple ITS • All except one done by Frethelm. Now almost a dozen datasets comparing RCT and ITS • Inconsistency in ability to recreate RCT results • Why? Inherent weakness of design? • Let’s look at most feasible alternative/ NAEP Test Score Hypothetical NCLB effects on public (red) versus private schools (blue) 208 200 NCLB Time WSC and CITS • Six studies in medicine, four in education, one in environmental sciences • All claim causal inferences similar • No meta-analysis to date • No analysis of file drawer problem • Remarkable cos the internal validity threats of differential history, instrumentation and regression could have operated but did not St. Clair, Cook, & Hallberg (2014) • RCT: Study of Indiana’s system for feedback on student performance (schools as unit of assignment) • Comparative ITS comparison groups – Basically all schools in the state – Matched schools in the state -.6 -.4 -.2 0 .2 .4 Math (All schools) 1 2 3 4 Year All Other Schools in the State 5 6 Treatment 7 Math: WSC Results Naive comparison of post-test means 1 pre-test time point 2 pre-test time points 3 pre-test time points 4 pre-test time points 5 pre-test time points 6 pre-test time points 6 pre-test time points with slope terms -.6 -.4 -.2 Bias 0 .2 -.6 -.4 -.2 0 .2 .4 ELA (All Schools) 1 2 3 4 Year All Other Schools in the State 5 6 Treatment 7 ELA: WSC Results Naive comparison of post-test means 1 pre-test time point 2 pre-test time points 3 pre-test time points 4 pre-test time points 5 pre-test time points 6 pre-test time points 6 pre-test time points with slope terms -.6 -.4 -.2 Bias 0 .2 What about Matching C to T Units? • We can match C to T units, though this entails some case loss. Then no need to assume functional form is correct • Same results • Somers et al got the same results • Environmental science found replicate RCT only with matching • Matching safest analysis unless sure of FF CITS Summary • To date, CITS does well relative to RCT Matching is the most consistent to date • Models with the correct functional form do well; and one can observe the functional form • Similar effects despite possible group differences in (a) pre-treatment trend,(b) historical events at treatment; (c) changes in instrument; (d) stat regression– have never been confounds Less Elaborate QEs - NECGDs • NO known selection process and no pretest time trends • Probably the bulk of all current QEs, but will change with bigger data towards CITS • Within currently dominant practice, trick is: • (1) To reduce the size of initial difference through how the comparison case is sampled or comparison cases are sampled - overlap maximization; and then • (2) how to choose (a) covariates and (b) mode of data analysis to reduce remaining selection bias – most action with (b) and (c), though 1. likely more important, (b) next and (c) quite trivial. NEXT SECTION • Non-Equivalent Control Group Designs without RD or pretest time series • This is a matter of • How to select comparison population so as to reduce the initial group non-equivalence • How to select covariates so as to reduce selection • How to analyze the data Flavor of Two Positions • Rubin: Study the process of selection into treatment in one or many of many different ways and use this to select covariates. • Heckman and his students – choose local comparisons, choose pretest measure of outcome, choose “rich” collection of other covariates 1. SELECTING NON-EQUIVALENT COMPARISON GROUPS TO REDUCE INITIAL NONEQUIVALENCE The Trick with most QEs is • To select an intact C group as similar to T as possible to minimize selection difference thru sampling. Contrast is with making them seem similar through individual case matching • To use covariates in analysis that reduce any selection difference still remaining. This is where propensity scores, ANCOVA come in. What does Local “Mean”? • • • • • • Identical twins, non-identical, sibs, cousins Same grade cohort in schools, birth cohort Schools in same district vs other Job training sites in same local labor market Towns at border of different states vs all state More local the better since it matches on more unobservables as well as observables Local intact comparison groups • Past empirical research in Cook et al. (2008) shows 3 cases in different fields where local choice eliminated all bias. Two more WSCs since, and two others earlier with same result. • But some counter-cases in job training. Always reduces bias but DOES NOT ALWAYS ELIMINATE IT • Problem is: Not all local matches are good • How can we take advantage of its bias-reduction qualities without bias elimination? Come back to this later after discussing covariate choice 2. GIVEN AN OBSERVED PRETEST DIFFERENCE BETWEEN TREATMENT AND CONTROLS, HOW TO MODEL (A) STRONGLY SUSPECTED SELECTION PROCESS Statistical Theory • Knowing selection and measuring it perfectly gives unbiased causal inference • BUT rarely know it fully – RDD exception • Yet we often know major selection elements: why retained in grade; why self-select into divorce; why use emergency rooms? • How to make selection process better known? • Here’s one example – why students self-select into learning English or math Strongly suspected selection process Shadish, Clark & Steiner (2008) N = 445 Undergraduate Psychology Students Randomly Assigned to Observational Study N = 210 Self-Selected into Randomized Experiment N = 235 Randomly Assigned to Mathematics Training N = 119 Vocabulary Training N = 116 Mathematics Training N = 79 ? = ATE Vocabulary Training N = 131 23 Constructs and 5 Construct Domains assessed prior to Intervention Proxy-pretests (2 multi-item constructs): 36-item Vocabulary Test II, 15-item Arithmetic Aptitude Test • Prior academic achievement (3 multi-item constructs): High school GPA, current college GPA, ACT college admission score • Topic preference (6 multi-item constructs): Liking literature, liking mathematics, preferring mathematics over literature, number of prior mathematics courses, major field of study (math-intensive or not), 25-item mathematics anxiety scale Construct Domains • Psychological predisposition (6 multi-item constructs): Big five personality factors (50 items on extroversion, emotional stability, agreeableness, openness to experience, conscientiousness), Short Beck Depression Inventory (13 items) • Demographics (5 single-item constructs): Student‘s age, sex, race (Caucasian, Afro-American, Hispanic), marital status, credit hours Was there Bias in the QE with SelfSelection into Tracks? • RCT showed effects for each outcome. • Both math and vocab effects larger than in RCT when there was self-selection into T versus C – thus, bias in QE. • Our question is: How much of self-selection bias is reduced by use of covariates measuring several different possible selection processes? Bias Reduction: Construct Domains Mathematics 140 Bias Reduction (%) 120 100 4 2 2 4 1 3 1 3 2 4 1 3 2 4 1 3 1 40 2 20 0 -20 4 1 2 3 4 3 2 1 3 1 4 3 4 2 1 3 4 1 2 4 1 3 2 2 1 3 4 4 2 3 1 2 1 3 4 80 60 2 1 3 4 2 1 3 4 2 3 4 1 2 3 4 PS-stratification PS-ANCOVA PS-weighting ANCOVA dem dem dem dem pre psy dem aca pre top dem pre dem dem pre dem pre pre pre pre aca psy psy aca pre aca top top aca aca top top top psy Bias Reduction: Single Constructs Mathematics 140 2 4 100 4 2 3 1 80 3 1 2 1 3 4 2 4 2 3 1 3 1 4 60 1 2 3 4 2 1 3 4 all like.math all covariates except pref.math major numbmath like.lit topic preference mars math.pre proxy-pretest PS-stratification PS-ANCOVA PS-weighting ANCOVA -pref.math 4 3 1 2 2 1 3 4 0 -40 4 2 3 4 2 1 3 -like.math 20 -20 1 2 4 3 1 -like.math -pref.math 40 vocab.pre Bias Reduction (%) 120 2 1 4 3 Bias Reduction: Construct Domains Vocabulary 140 Bias Reduction (%) 120 100 3 80 60 40 20 0 3 2 1 4 1 3 2 4 1 4 3 2 3 4 2 1 2 4 3 1 2 1 3 4 2 3 1 4 4 3 2 1 3 4 1 2 4 2 3 4 2 4 2 1 3 2 1 3 4 1 3 3 1 2 4 2 4 1 2 4 3 1 1 1 2 3 4 PS-stratification PS-ANCOVA PS-weighting ANCOVA -20 dem dem dem dem pre psy aca dem pre top dem dem dem pre dem pre pre pre pre pre aca psy aca pre psy top aca top aca aca top top top psy Bias Reduction: Single Constructs Vocabulary 140 2 4 3 1 100 4 3 1 2 80 3 1 4 2 60 2 4 3 1 2 3 1 4 1 2 3 4 40 1 2 3 4 1 1 all pref.math all covariates except like.lit like.math major mars topic preference PS-stratification PS-ANCOVA PS-weighting ANCOVA -pref.math 4 3 2 proxy-pretest numbmath -40 3 4 2 1 vocab.pre -20 3 1 2 4 3 4 2 -vocab.pre 0 1 3 2 4 3 1 4 2 -vocab.pre -pref.math 20 math.pre Bias Reduction (%) 120 Given Initial Group Differences • 1. Choice of covariates is crucial • 2. Reliability counts, but secondary within bounds of 1 to .60. • 3. Mode of analyzing covariates (OLS and PS matching) makes little difference, though PS preferred in theory • 4. Replicated in Pohl et al. (2011) 2. GIVEN OBSERVED DIFFERENCE, HOW SPECIAL IS (B) PRETEST MEASURE OF STUDY OUTCOME FOR BIAS REDUCTION? Claims about Pretest • Claim that pretest is privileged for bias reduction; yet by itself did little for math in Shadish et al. • In studies modeling the outcome only, pretest often the most highly correlated single variable • But issue is cor of pretest with selection into T • Though we suspect selection on pretest to be frequent, not know how often and when • Next WSC studies vary when the pretest does and does not vary with selection Existing Empirical Evidence • WSCs support privileging true pretest because it is better than others at reducing bias - Heckman • Sometimes reduces all by itself -- Magnet school study (Bifulco, 2010) and earlier CITS studies here • But it does not always reduce all bias – e.g., Shadish et al. and workforce development lit • This study examines bias reduction due to pretest when we vary the correlation with selection both between and within studies Between-Studies: Kindergarten Retention • Hong and Raudenbush (2005; 2006) used rich covariates in ECLS-K to estimate the effect of kindergarten retention on math and reading • Two prior waves • Evidence of selection-maturation: Retained have lower mean and lower rate of change. • Selection process largely known: past perf and teacher ratings –both available at 2 pretest times Dataset 1: Correlation with Selection Correlation with Retention in Kindergarten Correlation Lower Bound Percent of lower bound Reading Pretest -0.185* -0.38 48.7% Math Pretest -0.179* -0.37 48.4% Data set 1: Analytic Approach • Broke 144 covariates into three groups: – One wave of pretest data (spring of K) – Two waves (fall and spring of K) – 140 other covariates • Created propensity scores with each cov set and estimated reading and math effects • Note: Bias reduction compared to benchmark model, not RCT! Dataset 1: Math Results Math No covariates One pretest covariate All covariates minus pretest Two or more pretest covariates All covariates -.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3 Treatment effect (in sd units) relative to benchmark Dataset 1: ELA Results ELA No covariates All covariates minus pretest One pretest covariate Two or more pretest covariates All covariates -.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3 Treatment effect (in sd units) relative to benchmark Dataset 2: Indiana Benchmark Assessment Study (Grade 5) • 56 K-8 schools 5th graders randomly assigned to: – Treatment: state benchmark assess system (n=34) – Control schools: business as usual (n=22) – Outcomes: Math and ELA ISAT scores • QE comparison group from all other schools in state serving 5th grade students (n = 681) • Rich set of student and school covariates with multiple waves of pretest data Dataset 2: Selection • Schools selected into study cos interested in implementing the program • Principals interviewed and cited – Taking advantage of free resource from the state – A commitment to data driven decision making – Knowledge of other schools implementing – No mention of participation due to school’s past academic performance – i.e., the pretest 2: No Correlation with Selection Correlation with Selection into Benchmark Assessment System Reading Pretest 0.041 Math Pretest -0.012 Dataset 2: Math Results Math No covariates One pretest covariate Two or more pretest covariates All covariates All covariates minus pretest -.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3 Treatment effect (in sd units) relative to benchmark Dataset 2: ELA Results ELA No covariates One pretest covariate Two or more pretest covariates All covariates All covariates minus pretest -.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3 Treatment effect (in sd units) relative to benchmark Shadish et al. Correlation with Selection Correlation with Selection into Vocabulary Training Reading Pretest 0.169* Math Pretest -0.090 Dataset 3: ELA Results where Pretest and Selection correlate ELA No covariates One pretest covariate Two or more pretest covariates All covariates All covariates minus pretest -.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3 Treatment effect (in sd units) relative to benchmark Math Results where Pretest and Selection not correlate Math No covariates One pretest covariate Two or more pretest covariates All covariates All covariates minus pretest -.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3 Treatment effect (in sd units) relative to benchmark Summary of Pretest Results • Cannot assume the pretest is always related to selection, even if it often is • You should probably always include it • But you are better guided by theoretical explication of all plausible selection processes • Better supplementing it with more waves and other covariates. 2. GIVEN PRETEST DIFFERENCE, (C) WHAT HAPPENS IF THE SELECTION PROCESS IS NOT KNOWN BUT HAVE “RICH’ SET OF COVARIATES? Steiner, Cook & Li (2015) • “Rich” covariates – more domains (presumptively independent constructs) and higher reliability (number of items assessing each construct) • Theory = pick up increasingly more parts of the true but unknown selection process • Two data sets – one with 156 covariates at one pretest and the other with 144 over two pretest time points. • Each has reasonably known theory of selection; • We identify it and then throw it away the variables to ask: How well do the remaining covariates function collectively, though they are individually imperfect? Remove effective single covariates Mathematics 140 2 4 100 4 2 3 1 80 3 1 2 1 3 4 2 4 2 3 1 3 1 4 60 1 2 3 4 2 1 3 4 all like.math all covariates except pref.math major numbmath like.lit topic preference mars math.pre proxy-pretest PS-stratification PS-ANCOVA PS-weighting ANCOVA -pref.math 4 3 1 2 2 1 3 4 0 -40 4 2 3 4 2 1 3 -like.math 20 -20 1 2 4 3 1 -like.math -pref.math 40 vocab.pre Bias Reduction (%) 120 2 1 4 3 All Covariates Critical Covariates Removed Conclusion: “Rich” Covariates w/o Independent Info on Selection • Helps reduce some bias • More so with more reliable assessments • Within limits we imposed of 12 domains, still 40% of bias remaining • If more domains, each of 5 items, who knows? “Rich” Covariates • Useful cos it increases chances of choosing the true selection variables • But no guarantee • If put together “rich” covariates, local comparison group choice and pretest (Heckman), each does mostly OK by self and the three together might be even better • But an even better option is possible Hybrid sampling model of Stuart and Rubin (2008) • Define caliper for adequacy of a match • Match all LOCAL Cs to T that fall within caliper • For others, perform a match using a PS predicated on analysis of selection processes • Result = mix of acceptably matched local Cs that control for more unobservables, and acceptably matched non-local Cs, but matched only on observables Hallberg, Wong, & Cook (in press) • This paper draws on a WSC to examines correspondence with the RCT benchmark (Indiana student feedback study) after matching – Within district as long as the schools do not differ by more than 0.75 standard deviations of the propensity score (Local) – For others match on observed school-level covariates known to be highly correlated with the outcome of interest (Focal) – Combine both T and C matched cases (Hybrid) Performance of local, focal and hybrid matching across two dependent variables Naive effect Local match Focal match Hybrid match -.3 -.2 -.1 0 .1 .2 Treatment effect (in sd units) relative to benchmark Math ELA .3 Percentage of times observational approach performed best across 1000 replications Naive effect Covariate match Within district Hybrid approach 0 20 40 Math 60 ELA 80 Summary • Intact group matching increases overlap. Useful first stage in a QE design strategy? • But have been counter-cases in job training • We will see focal matching is no guarantee either, though we know when it is better • Is this hybrid model best? • Too early to tell. Need more studies of it Conclusions re Weaker Designs than RD and ITS • It is not just a matter of analysis. Minor • It’s not just a matter of reliability of covariates • It’s a matter of how you select intact comparison groups – local and hybrid • Matter of how much you know about remaining selection bias • Matter of correspondence between your covariates and knowledge of selection Conclusions re Weaker Designs than RD and ITS • Heckman’s Advice? Pretest, local comparisons and rich covariates – probably OK but not yet tested • “Rich Covariates” alone - problematic? • “Which variables are on hand” Disaster • Demographics only – disaster • Best = hybrid matching? If so, more needed on caliper choice for local part, focal part needs all the care needed when initial difference – BIG PICTURE CONCLUSIONS • RD is advisable, but CRD is much preferred to it, though its assumptions need to be checked • CITS is advisable, but ITS is not • For NECGDs, Rubin’s advice is helpful but not complete and sometimes impossible • Heckman’s advice seems very likely to work • Hybrid Model may be better than all the others but not clear yet. • THINKING HELPS; JUST ANALYZING DOES NOT

1/--Pages