Returned variables from -psmatch2-

Aim

Here, the convenience variables generated by the Stata command psmatch2 are explained using the example dataset cattaneo2.

At stephenporter.org an unknown international expert on matching is quoted for:

    I can never figure out what psmatch2 is doing, and that’s actually one of 
    the reasons I don’t use it much and don’t tend to recommend it. 
    The documentation isn’t very clear and there are a number of things where 
    I think it should be doing something but then it’s not clear that it is.

This leads Stephen Porter to conclude:

    Not exactly a compelling endorsement. 
    This is why I now recommend to my students that they avoid psmatch2: 
    how can you justify using a software procedure if you have no idea what it is doing? 
    Especially considering so many other procedures available for Stata, as well as matchit in R.

The main problem for Stephen Porter is the deriving of the weights, _weight.

It turns out that -psmatch2- quite elegant transforms trimming and caliper matching into simple weights.

From Stata version 13, there has been -teffects- in [TE] and
the standard errors for ATT are conservative Note as written in the link above, that there now is an option ai(#) or altvariance which makes -psmatch2- return the Abadie-Imbens standard errors.

The beauty of -psmatch2- is that the returned variables makes the user a lot freer in the following analysis. Hence, it is still interesting to use -psmatch2-.

The -psmatch2- help file

From psmatch2 version 4.0.12 30jan2016 E. Leuven, B. Sianesi, the help file says:

    Convenience variables in Stata’s -psmatch2- (Caliper matching)

    -psmatch2- creates a number of variables for the convenience of the user:

            *_treated* is a variable that equals 0 for control observations and 1 for 
            treatment observations.

            *_support* is an indicator variable with equals 1 if the observation is 
            on the common support and 0 if the observation is off the support.

            *_pscore* is the estimated propensity score or a copy of the one 
            provided by pscore().

            *_outcome_variable* for every treatment observation stores the value of 
            the matched outcome.

            *_weight*. For nearest neighbor matching, it holds the frequency with 
            which the observation is used as a match; 
            with option ties and k-nearest neighbors matching it holds the normalized
            weight; 
            for kernel matching, and llr matching with a weight other than 
            stata's tricube, it stores the overall weight given to the matched 
            observation. 
            When estimating att only *_weight* = 1 for the treated.

            *_id* In the case of one-to-one and nearest-neighbors matching, a new 
            identifier created for all observations.

            *_nk* In the case of one-to-one and nearest-neighbors matching, for every 
            treatment observation, it stores the observation number of the k-th 
            matched control observation. 
            Do not forget to sort by _id if you want to use the observation 
            number (id) of for example the 1st nearest neighbor as in

                    sort _id
                    g x_of_match = x[_n1]

            *_nn* In the case of nearest-neighbors matching, for every treatment 
            observation, it stores the number of matched control observations.

This quoted version of the -psmatch2- help comes after the Porter post.

It might be that the help for -psmatch2- has been modified after the post.

With respect to the points mentioned in the Porter post, I think that the -psmatch2- help is quite clear.

What is returned from -psmatch2-

We now look into what is returned from -psmatch2- using the same data as Stephen Porter does. Propensities are added.

First we look into what is returned in the one-to-one case, then the one-to-many. In both cases, estimates and comparison tests are reported.

The Stephen Porter example data

use http://www.stata-press.com/data/r13/cattaneo2, clear

Creating the propensities

logit mbsmoke prenatal1 fbaby mmarried medu fedu mage fage mrace frace 
predict p

The returned variables from -psmatch2-, the one to one case

Using the propensities in the psmatch2 command

summarize p
display `=0.25*r(sd)'

.0295236
psmatch2 mbsmoke, outcome(bweight) pscore(p) neighbor(1) common ///
        caliper(`=0.25*r(sd)') noreplacement logit

----------------------------------------------------------------------------------------
        Variable     Sample |    Treated     Controls   Difference         S.E.   T-stat
----------------------------+-----------------------------------------------------------
         bweight  Unmatched | 3137.65972   3412.91159  -275.251871   21.4528037   -12.83
                        ATT | 3136.60303   3381.92899   -245.32596   27.3360463    -8.97
----------------------------+-----------------------------------------------------------
Note: S.E. does not take into account that the propensity score is estimated.

 psmatch2: |   psmatch2: Common
 Treatment |        support
assignment | Off suppo  On suppor |     Total
-----------+----------------------+----------
 Untreated |         0      3,778 |     3,778 
   Treated |         5        859 |       864 
-----------+----------------------+----------
     Total |         5      4,637 |     4,642 
label variable _treated "Treatment"

For the description of the convenience variables only the outcome, the exposure and the propensity score is needed.

keep bweight mbsmoke p _*
format p _pscore %6.3f

Five non-treated rows

preserve
keep if !_treated

(864 observations deleted)
list in 1/5, noobs

  +----------------------------------------------------------------------------------------------------------------+
  | bweight     mbsmoke       p   _pscore    _treated     _support   _weight   _bweight    _id   _n1   _nn   _pdif |
  |----------------------------------------------------------------------------------------------------------------|
  |    3459   nonsmoker   0.120     0.120   Untreated   On support         .          .   1674     .     0       . |
  |    3260   nonsmoker   0.409     0.409   Untreated   On support         1          .   3606     .     0       . |
  |    3572   nonsmoker   0.239     0.239   Untreated   On support         1          .   2980     .     0       . |
  |    2948   nonsmoker   0.168     0.168   Untreated   On support         1          .   2546     .     0       . |
  |    2410   nonsmoker   0.099     0.099   Untreated   On support         .          .    983     .     0       . |
  +----------------------------------------------------------------------------------------------------------------+
restore

Five treated rows

preserve
keep if _treated

(3,778 observations deleted)
list in 1/5, noobs

  +------------------------------------------------------------------------------------------------------------------+
  | bweight   mbsmoke       p   _pscore   _treated     _support   _weight   _bweight    _id    _n1   _nn       _pdif |
  |------------------------------------------------------------------------------------------------------------------|
  |    3090    smoker   0.099     0.099    Treated   On support         1       2693   3829   1015     1   9.847e-06 |
  |    3166    smoker   0.293     0.293    Treated   On support         1       3657   4342   3271     1   .00026615 |
  |    3147    smoker   0.157     0.157    Treated   On support         1       2595   3992   2153     1   .00022986 |
  |    2580    smoker   0.433     0.433    Treated   On support         1       3459   4546   3651     1    .0004093 |
  |    2693    smoker   0.375     0.375    Treated   On support         1       3345   4468   3536     1   .00032921 |
  +------------------------------------------------------------------------------------------------------------------+
restore

_id

The variable _id a new identifier created for all observations. The values are row numbers.

_nn

The variable _nn counts the number of neighbors found. Note variable _nn can be zero when no neighbors are found.

_n#

The variable _n1 is the _id for the first neighbor etc. Note variable _n# can be missing when no neighbors are found.

_treated

This is a copy of the treatment variable

compare _treated mbsmoke

                                        ---------- Difference ----------
                            Count       Minimum      Average     Maximum
------------------------------------------------------------------------
_treated=mbsmoke             4642
                       ----------
Jointly defined              4642             0            0           0
                       ----------
Total                        4642

_support

The non-treated are always having _support equal to one.

Depending on choices in eg options caliper and trim the treated can be in common support (_support equal one) or not (_support equal zero).

Note that _support also is zero if no match is found or already taken (_n1 is missing and _nn is zero).

list _* if _treated & !_nn, noobs

  +----------------------------------------------------------------------------------+
  | _pscore   _treated      _support   _weight   _bweight    _id   _n1   _nn   _pdif |
  |----------------------------------------------------------------------------------|
  |   0.577    Treated   Off support         .          .   4627     .     0       . |
  |   0.577    Treated   Off support         .          .   4625     .     0       . |
  |   0.577    Treated   Off support         .          .   4626     .     0       . |
  |   0.658    Treated   Off support         .          .   4635     .     0       . |
  |   0.648    Treated   Off support         .          .   4634     .     0       . |
  +----------------------------------------------------------------------------------+

_pscore

This is the same as the variable containing the propensity score.

compare _pscore p

                                        ---------- Difference ----------
                            Count       Minimum      Average     Maximum
------------------------------------------------------------------------
_pscore=p                    4642
                       ----------
Jointly defined              4642             0            0           0
                       ----------
Total                        4642

_bweight

When there is only one matched value it is given in the variable _outcome, here _bweight. To see how _bweight is the matched value of bweight, the rows must be sorted _id. Then the matched value can be found using the variable _n1.

sort _id
generate matched_bweight = bweight[_n1]
compare matched_bweight _bweight

                                        ---------- Difference ----------
                            Count       Minimum      Average     Maximum
------------------------------------------------------------------------
matched~t=_bweight            859
                       ----------
Jointly defined               859             0            0           0
Jointly missing              3783
                       ----------
Total                        4642
drop matched_bweight

_weight

The treated always get value one if the variable _support is one, ie that the row is in the common support. When there is no replacement, the matched non-treated row also get the value 1. If the matching is with replacement, the value of _weight is the number of times the non-treated value is matched to a treated value.

The weights sums up to the same number (the number of treated on support) for non-treated and treated.

sumat _weight, statistics(sum) rowby(_treated) decimals(0)


-------------------------
                      sum
-------------------------
Treatment(Untreated)  859
Treatment(Treated)    859
-------------------------

_pdif

The variable _pdif is the absolute difference between the propensity scores for the treated and non-treated.

sort _id
generate double tmpdif = abs(_pscore - _pscore[_n1])
compare tmpdif _pdif

                                        ---------- Difference ----------
                            Count       Minimum      Average     Maximum
------------------------------------------------------------------------
tmpdif=_pdif                  859
                       ----------
Jointly defined               859             0            0           0
Jointly missing              3783
                       ----------
Total                        4642
drop tmpdif

Reproducing the estimates and test

The ATT can be reproduced using the command -ttest- (compare with ATT estimates from -psmatch2- above)

ttest bweight == _bweight if _nn, unpaired

Two-sample t test with equal variances
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
 bweight |     859    3136.603    19.15138     561.302    3099.014    3174.192
_bweight |     859    3381.929      19.506    571.6957    3343.644    3420.214
---------+--------------------------------------------------------------------
Combined |   1,718    3259.266    13.98103    579.4963    3231.844    3286.688
---------+--------------------------------------------------------------------
    diff |            -245.326    27.33605               -298.9414   -191.7105
------------------------------------------------------------------------------
    diff = mean(bweight) - mean(_bweight)                         t =  -8.9744
H0: diff = 0                                     Degrees of freedom =     1716

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

or regression with frequency weights (compare with ATT estimates from -psmatch2- above).

regress bweight i._treated [aw=_weight], noheader

(sum of wgt is 1,718)
------------------------------------------------------------------------------
     bweight | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    _treated |
    Treated  |   -245.326     27.336    -8.97    0.00     -298.941    -191.710
       _cons |   3381.929     19.330   174.96    0.00     3344.017    3419.841
------------------------------------------------------------------------------

The variable _weight from -psmatch2-, the one to many case

The data is (again)

use http://www.stata-press.com/data/r13/cattaneo2, clear

The scores are found the same way.

logit mbsmoke prenatal1 fbaby mmarried medu fedu mage fage mrace frace 
predict p

And -psmatch2- is called similar as above now requring 5 neighbors and trimming 5 percent in each end.

summarize p
display `=0.25*r(sd)'

.0295236
psmatch2 mbsmoke, outcome(bweight) pscore(p) neighbor(5) caliper(`=0.25*r(sd)') logit

----------------------------------------------------------------------------------------
        Variable     Sample |    Treated     Controls   Difference         S.E.   T-stat
----------------------------+-----------------------------------------------------------
         bweight  Unmatched | 3137.65972   3412.91159  -275.251871   21.4528037   -12.83
                        ATT | 3137.65972     3372.793  -235.133275    24.657814    -9.54
----------------------------+-----------------------------------------------------------
Note: S.E. does not take into account that the propensity score is estimated.

           | psmatch2:
 psmatch2: |   Common
 Treatment |  support
assignment | On suppor |     Total
-----------+-----------+----------
 Untreated |     3,778 |     3,778 
   Treated |       864 |       864 
-----------+-----------+----------
     Total |     4,642 |     4,642 
label variable _treated "Treatment"
keep bweight mbsmoke p _*
format p _pscore %6.3f

The sum of _weight is (again) the same for non-treated and treated, the number of treated in common support.

sumat _weight, statistics(sum) rowby(_treated) decimals(0)


-------------------------
                      sum
-------------------------
Treatment(Untreated)  864
Treatment(Treated)    864
-------------------------

To understand how the weights are calculated we look at the last of the treated.

sort _treated _id
local treated = _id[_N]

Then all the matched id's for the last treated is found and saved in the local untreat. A variable slct is generated to mark the rows where the matched id's appear.

local untreat ""
generate slct = .
forvalues val = 1/5 {
        local untreated`val' = _n`val'[_N]
        if `untreated`val'' < . {
                local untreat "`untreat', `untreated`val''"
                replace slct = inlist(`untreated`val'', _n1, _n2, _n3, _n4, _n5)
        }
}

We see that the last treated with id 4642 is matched with the non-treated id's (3777, 3776, 3775).

list bweight _* if inlist(_id, 4642 , 3777, 3776, 3775), noobs


  +-----------------------------------------------------------------------------------------------------------------------------+
  | bweight   _pscore    _treated     _support     _weight   _bweight    _id    _n1    _n2    _n3   _n4   _n5   _nn       _pdif |
  |-----------------------------------------------------------------------------------------------------------------------------|
  |    3204     0.708   Untreated   On support   .73333333          .   3775      .      .      .     .     .     0           . |
  |    3289     0.719   Untreated   On support   .53333333          .   3776      .      .      .     .     .     0           . |
  |    3740     0.720   Untreated   On support   .33333333          .   3777      .      .      .     .     .     0           . |
  |    2350     0.737     Treated   On support           1       3411   4642   3777   3776   3775     .     .     3   .01672489 |
  +-----------------------------------------------------------------------------------------------------------------------------+

The value of _bweight for the treated id is the mean of the non-treated values.

summarize bweight if inlist(_id , 3777, 3776, 3775)


    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
     bweight |          3        3411    288.0746       3204       3740

The _weight values for the non-treated are:

mkmat _weight if inlist(_id `untreat'), rownames(_id )
matrix roweq _weight = "non-treated id's" 

                              _weight

------------------ ------ --------- non-treated id's 3775 0.7333

                  3776         0.5333

                  3777         0.3333

To understand the weights for the non-treated one must look where they are matched.

list bweight _* if slct, noobs

  +----------------------------------------------------------------------------------------------------------------------------+
  | bweight   _pscore   _treated     _support   _weight   _bweight    _id    _n1    _n2    _n3    _n4    _n5   _nn       _pdif |
  |----------------------------------------------------------------------------------------------------------------------------|
  |    3280     0.704    Treated   On support         1     3476.4   4640   3773   3774   3772   3775   3771     5   .00034734 |
  |    2722     0.706    Treated   On support         1     3396.2   4641   3775   3774   3773   3772   3776     5   .00193231 |
  |    2350     0.737    Treated   On support         1       3411   4642   3777   3776   3775      .      .     3   .01672489 |
  +----------------------------------------------------------------------------------------------------------------------------+

Eg, the non-treated with id 3775 appear in the means for the treated with id's 4640 (n = 5), 4641 (n = 5) and 4642 (n = 3).

So the weight for the non-treated with id 3775 is:

$$\frac{1}{5} + \frac{1}{5} + \frac{1}{3} = \frac{11}{15} = 0.7333$$

The weight for the non-treated with id 3776 is:

$$\frac{1}{5} + \frac{1}{3} = \frac{8}{15} = 0.5333$$

Finally, the weight for the non-treated with id 3777 is:

$$\frac{1}{3} = 0.3333$$

Quite simple.

Reproducing the estimates and test

Again, the ATT can be reproduced using the command -ttest- (compare with ATT estimates from -psmatch2- above)

ttest bweight == _bweight if _nn, unpaired

Two-sample t test with equal variances
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
 bweight |     864     3137.66    19.08197    560.8931    3100.207    3175.112
_bweight |     864    3372.793    9.298175    273.3094    3354.543    3391.043
---------+--------------------------------------------------------------------
Combined |   1,728    3255.226    10.98101     456.472    3233.689    3276.764
---------+--------------------------------------------------------------------
    diff |           -235.1333    21.22681               -276.7663   -193.5003
------------------------------------------------------------------------------
    diff = mean(bweight) - mean(_bweight)                         t = -11.0772
H0: diff = 0                                     Degrees of freedom =     1726

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

or regression with frequency weights (compare with ATT estimates from -psmatch2- above).

regress bweight i._treated [aw=_weight], noheader

(sum of wgt is 1,728)
------------------------------------------------------------------------------
     bweight | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    _treated |
    Treated  |   -235.133     21.242   -11.07    0.00     -276.785    -193.482
       _cons |   3372.793     15.021   224.54    0.00     3343.341    3402.245
------------------------------------------------------------------------------

The do file for this document

Last update: 2022-04-22, Stata version 17