Returned variables from -psmatch2-
Aim
Here, the convenience variables generated by the Stata command psmatch2 are explained using the example dataset cattaneo2.
At stephenporter.org an unknown international expert on matching is quoted for:
I can never figure out what psmatch2 is doing, and that’s actually one of
the reasons I don’t use it much and don’t tend to recommend it.
The documentation isn’t very clear and there are a number of things where
I think it should be doing something but then it’s not clear that it is.
This leads Stephen Porter to conclude:
Not exactly a compelling endorsement.
This is why I now recommend to my students that they avoid psmatch2:
how can you justify using a software procedure if you have no idea what it is doing?
Especially considering so many other procedures available for Stata, as well as matchit in R.
The main problem for Stephen Porter is the deriving of the weights, _weight.
It turns out that -psmatch2- quite elegant transforms trimming and caliper matching into simple weights.
From Stata version 13, there has been -teffects- in [TE] and
the standard errors for ATT are conservative
Note as written in the link above, that there now is an option ai(#) or
altvariance which makes -psmatch2- return the Abadie-Imbens standard errors.
The beauty of -psmatch2- is that the returned variables makes the user a lot freer in the following analysis. Hence, it is still interesting to use -psmatch2-.
The -psmatch2- help file
From psmatch2 version 4.0.12 30jan2016 E. Leuven, B. Sianesi, the help file says:
Convenience variables in Stata’s -psmatch2- (Caliper matching)
-psmatch2- creates a number of variables for the convenience of the user:
*_treated* is a variable that equals 0 for control observations and 1 for
treatment observations.
*_support* is an indicator variable with equals 1 if the observation is
on the common support and 0 if the observation is off the support.
*_pscore* is the estimated propensity score or a copy of the one
provided by pscore().
*_outcome_variable* for every treatment observation stores the value of
the matched outcome.
*_weight*. For nearest neighbor matching, it holds the frequency with
which the observation is used as a match;
with option ties and k-nearest neighbors matching it holds the normalized
weight;
for kernel matching, and llr matching with a weight other than
stata's tricube, it stores the overall weight given to the matched
observation.
When estimating att only *_weight* = 1 for the treated.
*_id* In the case of one-to-one and nearest-neighbors matching, a new
identifier created for all observations.
*_nk* In the case of one-to-one and nearest-neighbors matching, for every
treatment observation, it stores the observation number of the k-th
matched control observation.
Do not forget to sort by _id if you want to use the observation
number (id) of for example the 1st nearest neighbor as in
sort _id
g x_of_match = x[_n1]
*_nn* In the case of nearest-neighbors matching, for every treatment
observation, it stores the number of matched control observations.
This quoted version of the -psmatch2- help comes after the Porter post.
It might be that the help for -psmatch2- has been modified after the post.
With respect to the points mentioned in the Porter post, I think that the -psmatch2- help is quite clear.
What is returned from -psmatch2-
We now look into what is returned from -psmatch2- using the same data as Stephen Porter does. Propensities are added.
First we look into what is returned in the one-to-one case, then the one-to-many. In both cases, estimates and comparison tests are reported.
The Stephen Porter example data
use http://www.stata-press.com/data/r13/cattaneo2, clear
Creating the propensities
logit mbsmoke prenatal1 fbaby mmarried medu fedu mage fage mrace frace
predict p
The returned variables from -psmatch2-, the one to one case
Using the propensities in the psmatch2 command
summarize p
display `=0.25*r(sd)'
.0295236
psmatch2 mbsmoke, outcome(bweight) pscore(p) neighbor(1) common ///
caliper(`=0.25*r(sd)') noreplacement logit
---------------------------------------------------------------------------------------- Variable Sample | Treated Controls Difference S.E. T-stat ----------------------------+----------------------------------------------------------- bweight Unmatched | 3137.65972 3412.91159 -275.251871 21.4528037 -12.83 ATT | 3136.60303 3381.92899 -245.32596 27.3360463 -8.97 ----------------------------+----------------------------------------------------------- Note: S.E. does not take into account that the propensity score is estimated. psmatch2: | psmatch2: Common Treatment | support assignment | Off suppo On suppor | Total -----------+----------------------+---------- Untreated | 0 3,778 | 3,778 Treated | 5 859 | 864 -----------+----------------------+---------- Total | 5 4,637 | 4,642
label variable _treated "Treatment"
For the description of the convenience variables only the outcome, the exposure and the propensity score is needed.
keep bweight mbsmoke p _*
format p _pscore %6.3f
Five non-treated rows
preserve
keep if !_treated
(864 observations deleted)
list in 1/5, noobs
+----------------------------------------------------------------------------------------------------------------+ | bweight mbsmoke p _pscore _treated _support _weight _bweight _id _n1 _nn _pdif | |----------------------------------------------------------------------------------------------------------------| | 3459 nonsmoker 0.120 0.120 Untreated On support . . 1674 . 0 . | | 3260 nonsmoker 0.409 0.409 Untreated On support 1 . 3606 . 0 . | | 3572 nonsmoker 0.239 0.239 Untreated On support 1 . 2980 . 0 . | | 2948 nonsmoker 0.168 0.168 Untreated On support 1 . 2546 . 0 . | | 2410 nonsmoker 0.099 0.099 Untreated On support . . 983 . 0 . | +----------------------------------------------------------------------------------------------------------------+
restore
Five treated rows
preserve
keep if _treated
(3,778 observations deleted)
list in 1/5, noobs
+------------------------------------------------------------------------------------------------------------------+ | bweight mbsmoke p _pscore _treated _support _weight _bweight _id _n1 _nn _pdif | |------------------------------------------------------------------------------------------------------------------| | 3090 smoker 0.099 0.099 Treated On support 1 2693 3829 1015 1 9.847e-06 | | 3166 smoker 0.293 0.293 Treated On support 1 3657 4342 3271 1 .00026615 | | 3147 smoker 0.157 0.157 Treated On support 1 2595 3992 2153 1 .00022986 | | 2580 smoker 0.433 0.433 Treated On support 1 3459 4546 3651 1 .0004093 | | 2693 smoker 0.375 0.375 Treated On support 1 3345 4468 3536 1 .00032921 | +------------------------------------------------------------------------------------------------------------------+
restore
_id
The variable _id a new identifier created for all observations. The values are row numbers.
_nn
The variable _nn counts the number of neighbors found. Note variable _nn can be zero when no neighbors are found.
_n#
The variable _n1 is the _id for the first neighbor etc. Note variable _n# can be missing when no neighbors are found.
_treated
This is a copy of the treatment variable
compare _treated mbsmoke
---------- Difference ---------- Count Minimum Average Maximum ------------------------------------------------------------------------ _treated=mbsmoke 4642 ---------- Jointly defined 4642 0 0 0 ---------- Total 4642
_support
The non-treated are always having _support equal to one.
Depending on choices in eg options caliper and trim the treated can be in common support (_support equal one) or not (_support equal zero).
Note that _support also is zero if no match is found or already taken (_n1 is missing and _nn is zero).
list _* if _treated & !_nn, noobs
+----------------------------------------------------------------------------------+ | _pscore _treated _support _weight _bweight _id _n1 _nn _pdif | |----------------------------------------------------------------------------------| | 0.577 Treated Off support . . 4627 . 0 . | | 0.577 Treated Off support . . 4625 . 0 . | | 0.577 Treated Off support . . 4626 . 0 . | | 0.658 Treated Off support . . 4635 . 0 . | | 0.648 Treated Off support . . 4634 . 0 . | +----------------------------------------------------------------------------------+
_pscore
This is the same as the variable containing the propensity score.
compare _pscore p
---------- Difference ---------- Count Minimum Average Maximum ------------------------------------------------------------------------ _pscore=p 4642 ---------- Jointly defined 4642 0 0 0 ---------- Total 4642
_bweight
When there is only one matched value it is given in the variable _outcome, here _bweight. To see how _bweight is the matched value of bweight, the rows must be sorted _id. Then the matched value can be found using the variable _n1.
sort _id
generate matched_bweight = bweight[_n1]
compare matched_bweight _bweight
---------- Difference ---------- Count Minimum Average Maximum ------------------------------------------------------------------------ matched~t=_bweight 859 ---------- Jointly defined 859 0 0 0 Jointly missing 3783 ---------- Total 4642
drop matched_bweight
_weight
The treated always get value one if the variable _support is one, ie that the row is in the common support. When there is no replacement, the matched non-treated row also get the value 1. If the matching is with replacement, the value of _weight is the number of times the non-treated value is matched to a treated value.
The weights sums up to the same number (the number of treated on support) for non-treated and treated.
sumat _weight, statistics(sum) rowby(_treated) decimals(0)
------------------------- sum ------------------------- Treatment(Untreated) 859 Treatment(Treated) 859 -------------------------
_pdif
The variable _pdif is the absolute difference between the propensity scores for the treated and non-treated.
sort _id
generate double tmpdif = abs(_pscore - _pscore[_n1])
compare tmpdif _pdif
---------- Difference ---------- Count Minimum Average Maximum ------------------------------------------------------------------------ tmpdif=_pdif 859 ---------- Jointly defined 859 0 0 0 Jointly missing 3783 ---------- Total 4642
drop tmpdif
Reproducing the estimates and test
The ATT can be reproduced using the command -ttest- (compare with ATT estimates from -psmatch2- above)
ttest bweight == _bweight if _nn, unpaired
Two-sample t test with equal variances ------------------------------------------------------------------------------ Variable | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- bweight | 859 3136.603 19.15138 561.302 3099.014 3174.192 _bweight | 859 3381.929 19.506 571.6957 3343.644 3420.214 ---------+-------------------------------------------------------------------- Combined | 1,718 3259.266 13.98103 579.4963 3231.844 3286.688 ---------+-------------------------------------------------------------------- diff | -245.326 27.33605 -298.9414 -191.7105 ------------------------------------------------------------------------------ diff = mean(bweight) - mean(_bweight) t = -8.9744 H0: diff = 0 Degrees of freedom = 1716 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
or regression with frequency weights (compare with ATT estimates from -psmatch2- above).
regress bweight i._treated [aw=_weight], noheader
(sum of wgt is 1,718) ------------------------------------------------------------------------------ bweight | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- _treated | Treated | -245.326 27.336 -8.97 0.00 -298.941 -191.710 _cons | 3381.929 19.330 174.96 0.00 3344.017 3419.841 ------------------------------------------------------------------------------
The variable _weight from -psmatch2-, the one to many case
The data is (again)
use http://www.stata-press.com/data/r13/cattaneo2, clear
The scores are found the same way.
logit mbsmoke prenatal1 fbaby mmarried medu fedu mage fage mrace frace
predict p
And -psmatch2- is called similar as above now requring 5 neighbors and trimming 5 percent in each end.
summarize p
display `=0.25*r(sd)'
.0295236
psmatch2 mbsmoke, outcome(bweight) pscore(p) neighbor(5) caliper(`=0.25*r(sd)') logit
---------------------------------------------------------------------------------------- Variable Sample | Treated Controls Difference S.E. T-stat ----------------------------+----------------------------------------------------------- bweight Unmatched | 3137.65972 3412.91159 -275.251871 21.4528037 -12.83 ATT | 3137.65972 3372.793 -235.133275 24.657814 -9.54 ----------------------------+----------------------------------------------------------- Note: S.E. does not take into account that the propensity score is estimated. | psmatch2: psmatch2: | Common Treatment | support assignment | On suppor | Total -----------+-----------+---------- Untreated | 3,778 | 3,778 Treated | 864 | 864 -----------+-----------+---------- Total | 4,642 | 4,642
label variable _treated "Treatment"
keep bweight mbsmoke p _*
format p _pscore %6.3f
The sum of _weight is (again) the same for non-treated and treated, the number of treated in common support.
sumat _weight, statistics(sum) rowby(_treated) decimals(0)
------------------------- sum ------------------------- Treatment(Untreated) 864 Treatment(Treated) 864 -------------------------
To understand how the weights are calculated we look at the last of the treated.
sort _treated _id
local treated = _id[_N]
Then all the matched id's for the last treated is found and saved in the local untreat. A variable slct is generated to mark the rows where the matched id's appear.
local untreat ""
generate slct = .
forvalues val = 1/5 {
local untreated`val' = _n`val'[_N]
if `untreated`val'' < . {
local untreat "`untreat', `untreated`val''"
replace slct = inlist(`untreated`val'', _n1, _n2, _n3, _n4, _n5)
}
}
We see that the last treated with id 4642 is matched with the non-treated id's (3777, 3776, 3775).
list bweight _* if inlist(_id, 4642 , 3777, 3776, 3775), noobs
+-----------------------------------------------------------------------------------------------------------------------------+ | bweight _pscore _treated _support _weight _bweight _id _n1 _n2 _n3 _n4 _n5 _nn _pdif | |-----------------------------------------------------------------------------------------------------------------------------| | 3204 0.708 Untreated On support .73333333 . 3775 . . . . . 0 . | | 3289 0.719 Untreated On support .53333333 . 3776 . . . . . 0 . | | 3740 0.720 Untreated On support .33333333 . 3777 . . . . . 0 . | | 2350 0.737 Treated On support 1 3411 4642 3777 3776 3775 . . 3 .01672489 | +-----------------------------------------------------------------------------------------------------------------------------+
The value of _bweight for the treated id is the mean of the non-treated values.
summarize bweight if inlist(_id , 3777, 3776, 3775)
Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- bweight | 3 3411 288.0746 3204 3740
The _weight values for the non-treated are:
mkmat _weight if inlist(_id `untreat'), rownames(_id )
matrix roweq _weight = "non-treated id's"
_weight
------------------ ------ --------- non-treated id's 3775 0.7333
3776 0.5333
3777 0.3333
To understand the weights for the non-treated one must look where they are matched.
list bweight _* if slct, noobs
+----------------------------------------------------------------------------------------------------------------------------+ | bweight _pscore _treated _support _weight _bweight _id _n1 _n2 _n3 _n4 _n5 _nn _pdif | |----------------------------------------------------------------------------------------------------------------------------| | 3280 0.704 Treated On support 1 3476.4 4640 3773 3774 3772 3775 3771 5 .00034734 | | 2722 0.706 Treated On support 1 3396.2 4641 3775 3774 3773 3772 3776 5 .00193231 | | 2350 0.737 Treated On support 1 3411 4642 3777 3776 3775 . . 3 .01672489 | +----------------------------------------------------------------------------------------------------------------------------+
Eg, the non-treated with id 3775 appear in the means for the treated with id's 4640 (n = 5), 4641 (n = 5) and 4642 (n = 3).
So the weight for the non-treated with id 3775 is:
$$\frac{1}{5} + \frac{1}{5} + \frac{1}{3} = \frac{11}{15} = 0.7333$$
The weight for the non-treated with id 3776 is:
$$\frac{1}{5} + \frac{1}{3} = \frac{8}{15} = 0.5333$$
Finally, the weight for the non-treated with id 3777 is:
$$\frac{1}{3} = 0.3333$$
Quite simple.
Reproducing the estimates and test
Again, the ATT can be reproduced using the command -ttest- (compare with ATT estimates from -psmatch2- above)
ttest bweight == _bweight if _nn, unpaired
Two-sample t test with equal variances ------------------------------------------------------------------------------ Variable | Obs Mean Std. err. Std. dev. [95% conf. interval] ---------+-------------------------------------------------------------------- bweight | 864 3137.66 19.08197 560.8931 3100.207 3175.112 _bweight | 864 3372.793 9.298175 273.3094 3354.543 3391.043 ---------+-------------------------------------------------------------------- Combined | 1,728 3255.226 10.98101 456.472 3233.689 3276.764 ---------+-------------------------------------------------------------------- diff | -235.1333 21.22681 -276.7663 -193.5003 ------------------------------------------------------------------------------ diff = mean(bweight) - mean(_bweight) t = -11.0772 H0: diff = 0 Degrees of freedom = 1726 Ha: diff < 0 Ha: diff != 0 Ha: diff > 0 Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
or regression with frequency weights (compare with ATT estimates from -psmatch2- above).
regress bweight i._treated [aw=_weight], noheader
(sum of wgt is 1,728) ------------------------------------------------------------------------------ bweight | Coefficient Std. err. t P>|t| [95% conf. interval] -------------+---------------------------------------------------------------- _treated | Treated | -235.133 21.242 -11.07 0.00 -276.785 -193.482 _cons | 3372.793 15.021 224.54 0.00 3343.341 3402.245 ------------------------------------------------------------------------------
Last update: 2022-04-22, Stata version 17