close

Se connecter

Se connecter avec OpenID

Better Quasi-Experimental Design

IntégréTéléchargement
Better Quasi-Experimental
Design
Thomas D. Cook
Northwestern University and
Mathematica, Inc.
Stockholm, 2015
Introduction
• The theory of cause we discuss today –
manipulability theory
• RCT as the best reflection of it because of
• Primacy of manipulation in each theory
• Statistical theory of comparability on all
variables, in expectation at least
• Assumptions clear and testable
• Social consensus about when met or not
RCTs not always possible; we need
Alternatives
•
•
•
•
Closest approximations are quasi-experiments
QEs CAN work in theory –Rubin Causal Model
BUT conditions where work are opaque
Need a method to learn which QEs are usually
effective or not in producing results like RCTs
• Today will present a method for identifying QE
design and analysis alternatives that “often”
produce results similar to those of RCT
What is at Stake? Evidence-Based
Practice
• Rhetoric of evidence-based practice
• What is acceptable as “evidence”
• In causal realm, RCTs are sometimes sole
(Evidence-based Coalition), always preferred
and often more heavily weighted than other
causal designs (WWC)
• Entails slow rate of progress and requires
• Accepting all the limitations of RCTs –external
validity
At Stake: Future of Bigger Datasets
•
•
•
•
•
•
•
More variables, hence more constructs
More versions of same construct --reliability
More frequent assessments – more time series
More cases
More local sampling – better comparisons
More linkage potential to other data sets
Are better QE designs those with data attributes
like that that are becoming more available?
We will explore today
• Which alternatives to the RCT are worth
trusting and not trusting so that we can
accumulate results faster over greater range
of UTOS
• One concept of trustworthy is theory; another
is that the causal estimates are routinely close
to those of an RCT.
• today is about Empirical Criterion of
Correspondence between RCT and QE results
Overview of Day
• First, present and discuss the method called
Within Study Comparison (WSC) or design
experiment. Then break
• Then report results of WSCs for two versions
of RD and for CITS. Then break.
• Then report results of WSCs for simpler
designs with no pretest time series nor a fully
known selection mechanism. Then break
• Finally, Conclusions and Open discussion
WSC Design: Three-Arm Study
Overall Population
sampled/selected into
Randomized Experiment
randomly assigned to
Control
Group
Treatment
Group
ITT of RCT
Comparison Group
ITT of OS
WSC Design: Four-Arm Study
POPULATION
Randomly Assigned to
Randomized Experiment
Treatment
Observational Study
Control
Treatment
Control
?
=
ATE
Within-Study Comparison aka Design
Experiment
• Have a benchmark– RCT in 64 of the 72 examples
to date. Compute causal estimate and SE
• Quasi-experiment of many different “types” – ITS,
RD, NECGD with pretest and local match, NECGD
with fully known selection process, with partially
known, with scarcely known etc. –
• Attach same treatment group to RCT and QE,
adjust QE, and then compute QE estimate
• Compare QE and RCT estimates, and conclude
Evolution of WSC Purposes
• Began : CAN we get similar results? existence proof;
job training; Limitations of approach
• Does widely used QE practice work? E.g, use of pretest,
of local comparisons, combining the 2; of PS analysis
• Bias-reduction potential of alternatives in separate
studies – e.g, covariate choice vs analysis
• Direct comparison of alternatives – Pretest when
correlated with selection process or not
• Novel alternatives – e.g., Stuart & Rubin (2008)
• Later = more sophisticated questions bretter linked to
statistical theory
Why do WSCs?
• All QE results can be disputed in terms of possible
alternative interpretations, but not necessarily plausible
• Single WSCs of limited value – if similar results by design,
existence proof of what stat theory tells us; if not similar,
would they have been by adding X, Y or Z?
• Goal is not to identify QE will always replicate RCT finding;
it is to identify designs that often do so
• With multiple studies, (a) we could have external warrant
for claims that a given QE practice often works, given no
internal warrant from theory, as with RCT; (b) we will
escape from QE and OS language to be more specific in
language re designs
Conditions for a good WSC
• A well implemented RCT, with minimal
sampling error
• No third variable confounds – like from
measurement
• Comparable estimands – RD and RCT
• Blinding to the RCT or adjusted QE results
• Defensible criterion for correspondence of
RCT and adjusted QE results
Limitations of WSCs - I
• Blinding is rare and so are protocols – is there
reason to assume that folks most likely to do
WSCs want RCT or QE to “win out”?
• No perfect criterion for correspondence, given
sampling error in RCT and QE – of similar signs,
stat test results by design, of null and equivalence
tests of design differences
• Only be done on topics with benchmark, and
80%-90% of WSCs have an RCT benchmark. Don’t
we want to know how QEs do when no RCT is
possible?
Discussion of WSCs
• Break.
WSC Results for Two Variants of the
Regression Discontinuity Design
(RD): Simple and Comparative RD
RDD Visual Depiction
Comparison
Treatment
RDD Visual Depiction
Discontinuity, or treatment effect
Counterfactual regression line
Comparison
Treatment
Stress for simple RD that
• Process of assignment into treatment
completely known and perfectly measured
• We should be surprised, therefore, if we do
not get same results as RCT estimated at same
point in sharp RD.
• Trivial theory test BUT Non-trivial test of
implementability of RD in real world
Why same Causal Estimate in Simple
RD and RCT?
• Rationale 1: process of selection into
treatment is completely known – “sharp” RD
• Rationale 2 for “sharp RD” – like RCT around
the cutoff and only there.
• Rationale 3: If there is contamination – T cases
in untreated area and/or C cases in treated
area, the way of dealing with this is the same
in RD and RCT – instrumental variables (IV)
Why Larger Standard Errors, less
Power, in Simple RD?
• RD analysis requires measure of the treatment
assignment (1/0) as treatment indicator, and
of the assignment variable as selection control
• The two are correlated, one a binary measure
totally located within the assignment variable
• Hence if any slope in the relationship of
assignment variable and outcome,
• The treatment cutoff score and the
assignment variable will be CO-LINEAR
13 WSCs of RD vs RCT at the Cutoff
• Almost all causal results at cutoff are similar
• More so as N increases, and across parametric
and non-parametric analyses
• Results in many different substantive fields
• No meta-analysis yet; no thorough examination
of file drawer problem
• Results look promising for internal validity of RD
in the crucible of practice.
• But SEs are about 3X larger for same sample size
Conclusion re Simple RD
• You can trust simple RD to give unbiased
causal estimate at the cutoff despite great
heterogeneity in how it is implemented
• It is less efficient than RCT by a factor of about
3
• In practice, many RDs have larger sample sizes
than a comparable RCT would, thus reducing
the statistical power loss in practice.
Alas, Simple RD is very limited
•
•
•
•
Less Statistical Power than RCT
Functional Form or Bandwidth Assumption
Lesser causal generalization – LATE vs ATE
Is there way to do RD better so that (a)
support for extrapolation always needed; (b)
more power, and especially (c) unbiased
causal estimates in all the treated area and
not just at the cutoff
Now we will examine Comparative
Regression Discontinuity (CRD)
• Visually, what is it?
Posttest regression
Pretest regression
Non-Equivalent Regression Function
•
•
•
•
From Pretest
From Non-Equivalent Comparison Group
From Non-Equivalent Dependent variables
Ludwig and Miller health results 5 years after
Head Start – pretest health; local cohort too
old for HS; health outcomes do or not affect
little kids –pulmonary problems vs accidents
• How do we study CRD? Via form of WSC
Creating the synthetic RD from the RCT
Now Imagine
• Dropping the treated cases in the untreated
part of the assignment variable
• Dropping the untreated cases in the treated
part of the assignment variable.
• You are left with two groups instead of the
four when the RCT is subdivided by cutoff
score. This the very simplest RD.
Posttest regression
Pretest regression
Walk you three examples, testing
• How well does CRD do relative to both RCT
and simple RD with respect to:
• Functional form estimation
• Statistical power
• Bias in all Area away from Cutoff
Example One
Wing & Cook (JAPAM; 2013)
Case 1: Cash and Counseling
Demonstration: RCT
• T=having control over Medicaid funds for
disability services vs
• Business as usual = Medicaid selecting
providers of service
• DV = total expenditures for disability services
cos Medicaid dispenses less than the
allotment usually
• Question: DO families spend more when they
have control over funds?
CRD Design Specifics
• Set an assignment variable = age
• Set cutoff: 35, 50, and 70 – use only 70 here
cos of age distribution
• Comparison function is pretest spending
• RDD analysis both parametric and non-para
(LLR) – report only LLR here
• 3 States: NJ, Ark, Fla.
• About 1,000 cases per site
Research questions again:
• If you add pretest RDD function as in this
example, do you
• Have more confidence in functional form? -How comparable are the 3 untreated
regression segments?
• Have lower standard errors, how close to RCT
• Get causal estimates for whole age range of
treated from 70 to 90+ and not just at 70?
• Begin with support
What about standard errors of
estimates away from cutoff?
• For cutoff at age 70, higher than RCT by 1.3
across all the area away from the cutoff
• At the cutoff, smaller than the RD across all
comparisons of RD and CRD-Pre at age 70
(and other ages, too).
What about Bias?: Comparisons
State
Estimation Cut-Off Bias at CutOff:
Post-Test Only
Bias at CutOff: Pre-Test
Design
Bias Above
Cut-off: PreTest Design
Arkansas
LLR
70
-0.06
0.07
0.04
New Jersey
LLR
70
0.01
0.08
0.12
Florida
LLR
70
0.08
-0.02
-0.04
Example 2: Effects of Head Start – Tang
& Cook (2014)
• Random selection of HS centers (89% agree)
followed by random assignment within
centers of 3 year olds
• Outcome = math, literacy; social behavior
• CRD-Pre has pretest as no-treatment
regression function, as Wing and Cook
• CRD-CG has non-equivalent group of 4 year
olds from same locations, not in W & C
Two Forms of CRD tested
• CRD-Pre
– Supplement the basic RD design with pretest
scores of the same individuals
• CRD-CG
– Supplement the basic RD design with a nonequivalent comparison group.
– Two different cutoff scores for replication – a test
scoreis one and date of testing is other
Sample sizes
• RCT is 2326
• RD with IRT-generated PPVT as the
assignment variable: 1163
• CRD-Pre: 1163 subjects with 2326
observations
• RD with date of assessment as the
assignment variable: 1045
• CRD-CG: 1856 subjects (observations)
What about support? 3 untreated segments of
CRD-Pre – CRD-CG similar
Results: Precision of CRD-Pre
2.10
1.90
1.90
1.94
1.70
1.52
1.50
Ratio of RD s.e. to RCT at the cutoff s.e.
1.40
1.35
1.30
1.10
Ratio of CRD at the cutoff s.e. to RCT at
the cutoff s.e.
1.15
Ratio of CRD above the cutoff s.e. to RCT
above the cutoff s.e.
1.02
0.90
0.95
0.89
0.70
0.50
Math
Literacy
Social-Emotion
Results: Precision of CRD-CG
Results: bias of CRD-Pre above the cutoff
Results: bias of CRD-CG above the cutoff
Summary: CRD-Pre above the cutoff
Summary: CRD-CG above the cutoff
Case 3: Stress Test
Effects of Training
Kisbu-Sakarya, Tang & Cook
Shadish, Clark & Steiner (2008)
N = 445 Undergraduate Psychology Students
Randomly Assigned to
Observational Study
N = 210
Self-Selected into
Randomized Experiment
N = 235
Randomly Assigned to
Mathematics
Training
N = 119
Vocabulary
Training
N = 116
Mathematics
Training
N = 79
?
=
Vocabulary
Training
N = 131
ATE
“Stress Test” due to Modest Ns
• N for RCT is 235
• N for basic RD and CRD-Pre is 123 for math
and 112 for vocabulary
• N for CRD-CG is 254 for the math outcome
(123+131) and 191 for vocab (112+79).
• These are small sample sizes for regression
techniques with individual data
Support for Regression Assumption:
CRD-Pre math outcome
Support for Regression Assumption:
CRD-CG math outcome: Lowess
Above cutoff for math
Support for Regression Assumption:
CRD-Pre vocabulary outcome
Support for Regression Assumption:
CRD-CG vocabulary outcome: Lowess
Above cutoff for vocab
SEs: At cutoff for vocab
Overall Conclusions about CRD
• With either CRD-Pre or CRD-CG, the added functional
form can help if the untreated functional forms are
parallel-ish and if sample size large enough for
reasonable stability. The addition will:
• Increase confidence in functional form extrapolation
• Increase power relative to RD and close to that of RCT
• Lead to unbiased causal inference at the cutoff AND
ALSO AWAY FROM IT.
• CRD shrinks the advantages of RCT, but without
entirely eliminating them
Why do Simple RD?
• Why tolerate its disadvantages if they are so
easily mitigated by a non-treated regression
function that can be observed and will be
even more feasible in “big data‘ era?
• Why is the design of choice not automatically
some form of CRD rather than RD
Analog here to the development of RCT. How
many posttest–only RCT designs in social
sciencepractice; most have covariates at least
MORE PRETEST DATA POINTS:
RCT VS. INTERRUPTED TIME
SERIES (ITS) AND ESPECIALLY
COMPARATIVE INTERRUPTED
TIME SERIES (CITS)
Interrupted Time Series Can Provide Strong Evidence
for Causal Effects
•
Clear
Intervention
Time Point
•
Huge and
Immediate
Effect
•
Clear Pretest
Functional
Form + many
Observations
•
No AlternatIve
at Interventio
Can Explain
Change
Limitations of Simple One-Group ITS
•
•
•
•
•
History, around the intervention point
Instrumentation
Stat Regression
Functional form extrapolation needed
Analysis has to account for correlated errors
(we will not deal with this issue here)
• Suggest the advisability of a comparative ITS
WSCs on Simple ITS
• All except one done by Frethelm. Now almost
a dozen datasets comparing RCT and ITS
• Inconsistency in ability to recreate RCT results
• Why? Inherent weakness of design?
• Let’s look at most feasible alternative/
NAEP Test Score
Hypothetical NCLB effects on public (red) versus
private schools (blue)
208
200
NCLB
Time
WSC and CITS
• Six studies in medicine, four in education, one
in environmental sciences
• All claim causal inferences similar
• No meta-analysis to date
• No analysis of file drawer problem
• Remarkable cos the internal validity threats of
differential history, instrumentation and
regression could have operated but did not
St. Clair, Cook, & Hallberg (2014)
• RCT: Study of Indiana’s system for feedback on
student performance (schools as unit of
assignment)
• Comparative ITS comparison groups
– Basically all schools in the state
– Matched schools in the state
-.6
-.4
-.2
0
.2
.4
Math (All schools)
1
2
3
4
Year
All Other Schools in the State
5
6
Treatment
7
Math: WSC Results
Naive comparison of post-test means
1 pre-test time point
2 pre-test time points
3 pre-test time points
4 pre-test time points
5 pre-test time points
6 pre-test time points
6 pre-test time points with slope terms
-.6
-.4
-.2
Bias
0
.2
-.6
-.4
-.2
0
.2
.4
ELA (All Schools)
1
2
3
4
Year
All Other Schools in the State
5
6
Treatment
7
ELA: WSC Results
Naive comparison of post-test means
1 pre-test time point
2 pre-test time points
3 pre-test time points
4 pre-test time points
5 pre-test time points
6 pre-test time points
6 pre-test time points with slope terms
-.6
-.4
-.2
Bias
0
.2
What about Matching C to T Units?
• We can match C to T units, though this entails
some case loss. Then no need to assume
functional form is correct
• Same results
• Somers et al got the same results
• Environmental science found replicate RCT
only with matching
• Matching safest analysis unless sure of FF
CITS Summary
• To date, CITS does well relative to RCT
Matching is the most consistent to date
• Models with the correct functional form do
well; and one can observe the functional form
• Similar effects despite possible group
differences in (a) pre-treatment trend,(b)
historical events at treatment; (c) changes in
instrument; (d) stat regression– have never
been confounds
Less Elaborate QEs - NECGDs
• NO known selection process and no pretest time
trends
• Probably the bulk of all current QEs, but will change
with bigger data towards CITS
• Within currently dominant practice, trick is:
• (1) To reduce the size of initial difference through how
the comparison case is sampled or comparison cases
are sampled - overlap maximization; and then
• (2) how to choose (a) covariates and (b) mode of data
analysis to reduce remaining selection bias – most
action with (b) and (c), though 1. likely more
important, (b) next and (c) quite trivial.
NEXT SECTION
• Non-Equivalent Control Group Designs
without RD or pretest time series
• This is a matter of
• How to select comparison population so as to
reduce the initial group non-equivalence
• How to select covariates so as to reduce
selection
• How to analyze the data
Flavor of Two Positions
• Rubin: Study the process of selection into
treatment in one or many of many different
ways and use this to select covariates.
• Heckman and his students – choose local
comparisons, choose pretest measure of
outcome, choose “rich” collection of other
covariates
1. SELECTING NON-EQUIVALENT
COMPARISON GROUPS
TO REDUCE INITIAL NONEQUIVALENCE
The Trick with most QEs is
• To select an intact C group as similar to T as
possible to minimize selection difference thru
sampling. Contrast is with making them seem
similar through individual case matching
• To use covariates in analysis that reduce any
selection difference still remaining. This is
where propensity scores, ANCOVA come in.
What does Local “Mean”?
•
•
•
•
•
•
Identical twins, non-identical, sibs, cousins
Same grade cohort in schools, birth cohort
Schools in same district vs other
Job training sites in same local labor market
Towns at border of different states vs all state
More local the better since it matches on
more unobservables as well as observables
Local intact comparison groups
• Past empirical research in Cook et al. (2008)
shows 3 cases in different fields where local
choice eliminated all bias. Two more WSCs since,
and two others earlier with same result.
• But some counter-cases in job training. Always
reduces bias but DOES NOT ALWAYS ELIMINATE IT
• Problem is: Not all local matches are good
• How can we take advantage of its bias-reduction
qualities without bias elimination? Come back to
this later after discussing covariate choice
2. GIVEN AN OBSERVED PRETEST
DIFFERENCE BETWEEN
TREATMENT AND CONTROLS,
HOW TO MODEL (A) STRONGLY
SUSPECTED SELECTION PROCESS
Statistical Theory
• Knowing selection and measuring it perfectly
gives unbiased causal inference
• BUT rarely know it fully – RDD exception
• Yet we often know major selection elements:
why retained in grade; why self-select into
divorce; why use emergency rooms?
• How to make selection process better known?
• Here’s one example – why students self-select
into learning English or math
Strongly suspected selection process
Shadish, Clark & Steiner (2008)
N = 445 Undergraduate Psychology Students
Randomly Assigned to
Observational Study
N = 210
Self-Selected into
Randomized Experiment
N = 235
Randomly Assigned to
Mathematics
Training
N = 119
Vocabulary
Training
N = 116
Mathematics
Training
N = 79
?
=
ATE
Vocabulary
Training
N = 131
23 Constructs and 5 Construct
Domains assessed prior to Intervention
Proxy-pretests (2 multi-item constructs):
36-item Vocabulary Test II, 15-item Arithmetic Aptitude Test
• Prior academic achievement (3 multi-item
constructs):
High school GPA, current college GPA, ACT college
admission score
• Topic preference (6 multi-item constructs):
Liking literature, liking mathematics, preferring mathematics
over literature, number of prior mathematics courses, major
field of study (math-intensive or not), 25-item mathematics
anxiety scale
Construct Domains
• Psychological predisposition (6 multi-item
constructs):
Big five personality factors (50 items on extroversion,
emotional stability, agreeableness, openness to experience,
conscientiousness), Short Beck Depression Inventory (13
items)
• Demographics (5 single-item constructs):
Student‘s age, sex, race (Caucasian, Afro-American,
Hispanic), marital status, credit hours
Was there Bias in the QE with SelfSelection into Tracks?
• RCT showed effects for each outcome.
• Both math and vocab effects larger than in
RCT when there was self-selection into T
versus C – thus, bias in QE.
• Our question is: How much of self-selection
bias is reduced by use of covariates measuring
several different possible selection processes?
Bias Reduction: Construct Domains
Mathematics
140
Bias Reduction (%)
120
100
4
2
2
4
1
3
1
3
2
4
1
3
2
4
1
3
1
40
2
20
0
-20
4
1
2
3
4
3
2
1
3
1
4
3
4
2
1
3
4
1
2
4
1
3
2
2
1
3
4
4
2
3
1
2
1
3
4
80
60
2
1
3
4
2
1
3
4
2
3
4
1
2
3
4
PS-stratification
PS-ANCOVA
PS-weighting
ANCOVA
dem
dem
dem dem
pre
psy dem aca pre top dem pre dem dem pre dem pre
pre
pre pre
aca
psy psy aca pre aca top top
aca
aca top
top
top
psy
Bias Reduction: Single Constructs
Mathematics
140
2
4
100
4
2
3
1
80
3
1
2
1
3
4
2
4
2
3
1
3
1
4
60
1
2
3
4
2
1
3
4
all
like.math
all covariates except
pref.math
major
numbmath
like.lit
topic preference
mars
math.pre
proxy-pretest
PS-stratification
PS-ANCOVA
PS-weighting
ANCOVA
-pref.math
4
3
1
2
2
1
3
4
0
-40
4
2
3
4
2
1
3
-like.math
20
-20
1
2
4
3
1
-like.math
-pref.math
40
vocab.pre
Bias Reduction (%)
120
2
1
4
3
Bias Reduction: Construct Domains
Vocabulary
140
Bias Reduction (%)
120
100
3
80
60
40
20
0
3
2
1
4
1
3
2
4
1
4
3
2
3
4
2
1
2
4
3
1
2
1
3
4
2
3
1
4
4
3
2
1
3
4
1
2
4
2
3
4
2
4
2
1
3
2
1
3
4
1
3
3
1
2
4
2
4
1
2
4
3
1
1
1
2
3
4
PS-stratification
PS-ANCOVA
PS-weighting
ANCOVA
-20
dem
dem
dem dem
pre
psy aca dem pre top dem dem dem pre dem pre pre
pre
pre pre
aca
psy aca pre psy top aca top
aca
aca top
top
top
psy
Bias Reduction: Single Constructs
Vocabulary
140
2
4
3
1
100
4
3
1
2
80
3
1
4
2
60
2
4
3
1
2
3
1
4
1
2
3
4
40
1
2
3
4
1
1
all
pref.math
all covariates except
like.lit
like.math
major
mars
topic preference
PS-stratification
PS-ANCOVA
PS-weighting
ANCOVA
-pref.math
4
3
2
proxy-pretest
numbmath
-40
3
4
2
1
vocab.pre
-20
3
1
2
4
3
4
2
-vocab.pre
0
1
3
2
4
3
1
4
2
-vocab.pre
-pref.math
20
math.pre
Bias Reduction (%)
120
Given Initial Group Differences
• 1. Choice of covariates is crucial
• 2. Reliability counts, but secondary within
bounds of 1 to .60.
• 3. Mode of analyzing covariates (OLS and PS
matching) makes little difference, though PS
preferred in theory
• 4. Replicated in Pohl et al. (2011)
2. GIVEN OBSERVED
DIFFERENCE, HOW SPECIAL IS (B)
PRETEST MEASURE OF STUDY
OUTCOME FOR BIAS
REDUCTION?
Claims about Pretest
• Claim that pretest is privileged for bias reduction;
yet by itself did little for math in Shadish et al.
• In studies modeling the outcome only, pretest
often the most highly correlated single variable
• But issue is cor of pretest with selection into T
• Though we suspect selection on pretest to be
frequent, not know how often and when
• Next WSC studies vary when the pretest does and
does not vary with selection
Existing Empirical Evidence
• WSCs support privileging true pretest because it
is better than others at reducing bias - Heckman
• Sometimes reduces all by itself -- Magnet school study
(Bifulco, 2010) and earlier CITS studies here
• But it does not always reduce all bias – e.g.,
Shadish et al. and workforce development lit
• This study examines bias reduction due to pretest
when we vary the correlation with selection both
between and within studies
Between-Studies: Kindergarten
Retention
• Hong and Raudenbush (2005; 2006) used rich
covariates in ECLS-K to estimate the effect of
kindergarten retention on math and reading
• Two prior waves
• Evidence of selection-maturation: Retained
have lower mean and lower rate of change.
• Selection process largely known: past perf and
teacher ratings –both available at 2 pretest
times
Dataset 1: Correlation with Selection
Correlation with
Retention in
Kindergarten
Correlation Lower
Bound
Percent of lower
bound
Reading Pretest
-0.185*
-0.38
48.7%
Math Pretest
-0.179*
-0.37
48.4%
Data set 1: Analytic Approach
• Broke 144 covariates into three groups:
– One wave of pretest data (spring of K)
– Two waves (fall and spring of K)
– 140 other covariates
• Created propensity scores with each cov set
and estimated reading and math effects
• Note: Bias reduction compared to benchmark
model, not RCT!
Dataset 1: Math Results
Math
No covariates
One pretest covariate
All covariates minus pretest
Two or more pretest covariates
All covariates
-.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3
Treatment effect (in sd units) relative to benchmark
Dataset 1: ELA Results
ELA
No covariates
All covariates minus pretest
One pretest covariate
Two or more pretest covariates
All covariates
-.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3
Treatment effect (in sd units) relative to benchmark
Dataset 2:
Indiana Benchmark Assessment Study (Grade 5)
• 56 K-8 schools 5th graders randomly assigned
to:
– Treatment: state benchmark assess system (n=34)
– Control schools: business as usual (n=22)
– Outcomes: Math and ELA ISAT scores
• QE comparison group from all other schools in
state serving 5th grade students (n = 681)
• Rich set of student and school covariates with
multiple waves of pretest data
Dataset 2: Selection
• Schools selected into study cos interested in
implementing the program
• Principals interviewed and cited
– Taking advantage of free resource from the state
– A commitment to data driven decision making
– Knowledge of other schools implementing
– No mention of participation due to school’s past
academic performance – i.e., the pretest
2: No Correlation with Selection
Correlation with Selection into
Benchmark Assessment System
Reading Pretest
0.041
Math Pretest
-0.012
Dataset 2: Math Results
Math
No covariates
One pretest covariate
Two or more pretest covariates
All covariates
All covariates minus pretest
-.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3
Treatment effect (in sd units) relative to benchmark
Dataset 2: ELA Results
ELA
No covariates
One pretest covariate
Two or more pretest covariates
All covariates
All covariates minus pretest
-.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3
Treatment effect (in sd units) relative to benchmark
Shadish et al. Correlation with
Selection
Correlation with Selection into
Vocabulary Training
Reading Pretest
0.169*
Math Pretest
-0.090
Dataset 3: ELA Results where Pretest
and Selection correlate
ELA
No covariates
One pretest covariate
Two or more pretest covariates
All covariates
All covariates minus pretest
-.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3
Treatment effect (in sd units) relative to benchmark
Math Results where Pretest and
Selection not correlate
Math
No covariates
One pretest covariate
Two or more pretest covariates
All covariates
All covariates minus pretest
-.7 -.6 -.5 -.4 -.3 -.2 -.1 0 .1 .2 .3
Treatment effect (in sd units) relative to benchmark
Summary of Pretest Results
• Cannot assume the pretest is always related to
selection, even if it often is
• You should probably always include it
• But you are better guided by theoretical
explication of all plausible selection processes
• Better supplementing it with more waves and
other covariates.
2. GIVEN PRETEST DIFFERENCE, (C) WHAT
HAPPENS IF THE SELECTION PROCESS IS NOT
KNOWN BUT HAVE “RICH’ SET OF COVARIATES?
Steiner, Cook & Li (2015)
• “Rich” covariates – more domains (presumptively
independent constructs) and higher reliability (number
of items assessing each construct)
• Theory = pick up increasingly more parts of the true
but unknown selection process
• Two data sets – one with 156 covariates at one pretest
and the other with 144 over two pretest time points.
• Each has reasonably known theory of selection;
• We identify it and then throw it away the variables to
ask: How well do the remaining covariates function
collectively, though they are individually imperfect?
Remove effective single covariates
Mathematics
140
2
4
100
4
2
3
1
80
3
1
2
1
3
4
2
4
2
3
1
3
1
4
60
1
2
3
4
2
1
3
4
all
like.math
all covariates except
pref.math
major
numbmath
like.lit
topic preference
mars
math.pre
proxy-pretest
PS-stratification
PS-ANCOVA
PS-weighting
ANCOVA
-pref.math
4
3
1
2
2
1
3
4
0
-40
4
2
3
4
2
1
3
-like.math
20
-20
1
2
4
3
1
-like.math
-pref.math
40
vocab.pre
Bias Reduction (%)
120
2
1
4
3
All Covariates
Critical Covariates Removed
Conclusion: “Rich” Covariates w/o
Independent Info on Selection
• Helps reduce some bias
• More so with more reliable assessments
• Within limits we imposed of 12 domains, still
40% of bias remaining
• If more domains, each of 5 items, who knows?
“Rich” Covariates
• Useful cos it increases chances of choosing the
true selection variables
• But no guarantee
• If put together “rich” covariates, local
comparison group choice and pretest
(Heckman), each does mostly OK by self and
the three together might be even better
• But an even better option is possible
Hybrid sampling model of Stuart and
Rubin (2008)
• Define caliper for adequacy of a match
• Match all LOCAL Cs to T that fall within caliper
• For others, perform a match using a PS
predicated on analysis of selection processes
• Result = mix of acceptably matched local Cs
that control for more unobservables, and
acceptably matched non-local Cs, but
matched only on observables
Hallberg, Wong, & Cook (in press)
• This paper draws on a WSC to examines
correspondence with the RCT benchmark
(Indiana student feedback study) after matching
– Within district as long as the schools do not differ by
more than 0.75 standard deviations of the propensity
score (Local)
– For others match on observed school-level covariates
known to be highly correlated with the outcome of
interest (Focal)
– Combine both T and C matched cases (Hybrid)
Performance of local, focal and hybrid matching
across two dependent variables
Naive effect
Local match
Focal match
Hybrid match
-.3
-.2
-.1
0
.1
.2
Treatment effect (in sd units) relative to benchmark
Math
ELA
.3
Percentage of times observational approach
performed best across 1000 replications
Naive effect
Covariate match
Within district
Hybrid approach
0
20
40
Math
60
ELA
80
Summary
• Intact group matching increases overlap.
Useful first stage in a QE design strategy?
• But have been counter-cases in job training
• We will see focal matching is no guarantee
either, though we know when it is better
• Is this hybrid model best?
• Too early to tell. Need more studies of it
Conclusions re Weaker Designs
than RD and ITS
• It is not just a matter of analysis. Minor
• It’s not just a matter of reliability of covariates
• It’s a matter of how you select intact
comparison groups – local and hybrid
• Matter of how much you know about
remaining selection bias
• Matter of correspondence between your
covariates and knowledge of selection
Conclusions re Weaker Designs
than RD and ITS
• Heckman’s Advice? Pretest, local comparisons
and rich covariates – probably OK but not yet
tested
• “Rich Covariates” alone - problematic?
• “Which variables are on hand” Disaster
• Demographics only – disaster
• Best = hybrid matching? If so, more needed on
caliper choice for local part, focal part needs
all the care needed when initial difference –
BIG PICTURE CONCLUSIONS
• RD is advisable, but CRD is much preferred to
it, though its assumptions need to be checked
• CITS is advisable, but ITS is not
• For NECGDs, Rubin’s advice is helpful but not
complete and sometimes impossible
• Heckman’s advice seems very likely to work
• Hybrid Model may be better than all the
others but not clear yet.
• THINKING HELPS; JUST ANALYZING DOES NOT
Auteur
Документ
Catégorie
Без категории
Affichages
4
Taille du fichier
3 548 Кб
Étiquettes
1/--Pages
signaler