Howto hide small steps in a Kaplan Meyer plot

In Danish national registries it is forbidden to report smaller groups than 5. Since steps in Kaplan Meyer plots often are based on steps less than 5 reporting Kaplan Meyer in small dataset is a problem.

A solution to either use a lowess smoothed version of the Kaplan Meyer or to make Kaplan Meyer in steps of 5 is presented here.

The example data

We use a clasical Stata example dataset:


webuse drug2, clear
stset, clear

The variables are:


describe

Contains data from http://www.stata-press.com/data/r14/drug2.dta
  obs:            48                          Patient Survival in Drug Trial
 vars:             4                          3 Mar 2014 02:17
 size:           192                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage   display    value
variable name   type    format     label      variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
studytime       byte    %8.0g                 Months to death or end of exp.
died            byte    %8.0g                 1 if patient died
drug            byte    %8.0g                 Drug type (0=placebo)
age             byte    %8.0g                 Patient's age at start of exp.
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

And the data looks like (each row is a person):


list in 1/6, sepby(studytime) abbreviate(20)

     +-------------------------------+
     | studytime   died   drug   age |
     |-------------------------------|
  1. |         1      1      0    61 |
  2. |         1      1      0    65 |
     |-------------------------------|
  3. |         2      1      0    59 |
     |-------------------------------|
  4. |         3      1      0    52 |
     |-------------------------------|
  5. |         4      1      0    56 |
  6. |         4      1      0    67 |
     +-------------------------------+

Generating the data behind the Kaplan Meyer plots

First sts generate is used to find the failure probabilitSes from the survival probabilities.


stset studytime, failure(died) noshow

     failure event:  died != 0 & died < .
obs. time interval:  (0, studytime]
 exit on or before:  failure

------------------------------------------------------------------------------
         48  total observations
          0  exclusions
------------------------------------------------------------------------------
         48  observations remaining, representing
         31  failures in single-record/single-failure data
        744  total analysis time at risk and under observation
                                                at risk from t =         0
                                     earliest observed entry t =         0
                                          last observed exit t =        39

sts generate survival = s
generate failure = 1 - survival
label variable failure "KM failure"
format failure %6.2f

A lowess smoothed twoway graph of failure vs studytime is one way to report the Kaplan Meyer plot.

Making step size to 5

The variable n_prsns counts the the number of persons at each time (variable studytime). The count is only saved in the last row for each time.


bysort studytime: generate n_prsns = cond(_n == _N, _N, 0)

To get the accumulated number of persons over time one can use relative references and the function cond:


generate acc_prsns = n_prsns if _n == 1
replace acc_prsns = cond(acc_prsns[_n-1] < 5, n_prsns + acc_prsns[_n-1], n_prsns) if _n > 1

Only the failure values based on at least 5 persons are selected:


generate failure2 = failure if acc_prsns > 4
quietly summarize failure2
replace failure2 = `r(min)' if _n == 1
replace failure2 = `r(max)' if _n == _N
label variable failure2 "KM failure with steps of at least 5"
format failure2 %6.2f

A graph comparison

Finally a graphical comparison of the classical Kaplan Meyer, the lowess smoothed version and the Kaplan Meyer based on steps of at least 5 persons is presented:


twoway ///
        (line failure studytime, lcolor(black) connect(stairstep)) ///
        (lowess failure studytime, lcolor(blue) ) ///
        (line failure2 studytime, lcolor(red) connect(stairstep)) ///
        , legend(on position(5) ring(0)  cols(1) ///
                order(1 "Kaplan Meyer" 2 "Kaplan Meyer lowess" 3 "Kaplan Meyer steps of 5") ///
                ) ///
        name(km, replace)


The do file for this document

Last update: 2017-06-03, Stata version 14.2