/*
cls
cd "C:\Users\au24026\Documents\STATA\StataHacks\docs\Statistical Analysis\Prediction modelling in logistic regression"
do2md using icu_example
*/
//OFF
********************************************************************************
*** Code block *****************************************************************
********************************************************************************
capture program drop simple_logistic
program simple_logistic
syntax varlist(min=2 fv) [, style(passthru) decimals(passthru)]
quietly {
capture matrix drop _tbl
capture matrix drop tbl
logistic `1' , or
matrix tmp = r(table)
matrix tmp = tmp[1,1], tmp[5,1], tmp[6,1], ., e(ll), ., ., .
matrix rownames tmp = `1'
matrix tbl = tmp
forvalues v = 2/`=wordcount("`varlist'")' {
logistic `1' ``v'' , or
mata: simple_regression("tmp")
matrix _tbl = nullmat(_tbl) \ tmp
}
}
matrix tbl = _tbl \ tbl
matprint tbl, `style' `decimals'
end
capture mata mata drop simple_regression()
mata:
void simple_regression(string scalar matrixname)
{
real scalar r, R
tbl = st_matrix("r(table)")'[., (1,5,6,4)]
R = rows(tbl) - 1
tbl = tbl[1..R, .], J(R,1,( st_numscalar("e(ll)"),
st_numscalar("e(chi2)"),
st_numscalar("e(df_m)"),
st_numscalar("e(p)")
)
)
st_matrix(matrixname, tbl)
cn = J(cols(tbl), 2, "")
cn[.,2] = ("Estimate", "lb", "ub", "p", "Loglikelihood", "G", "f", "P")'
st_matrixcolstripe(matrixname, cn)
rn = st_matrixcolstripe("r(table)")[1..R,.]
for(r=1;r<=R;r++) {
if ( regexm(rn[r,2], "^([0-9]+)b?\.(.+)$") ) {
rn[r,2] = sprintf("%s(%s)", regexs(2), regexs(1))
//} else {
//rn[r,2] = rn[r,2]
}
rn[r,1] = ""
}
st_matrixrowstripe(matrixname, rn)
}
end
********************************************************************************
/*
run icu_example_generate_data.do
*/
capture graph drop *
//ON
/***
# Objective
To demonstrate how to build a prediction model in logistic regression using tools
available in Stata 12 and above.
# The data
The ICU data set consists of a sample of 200 subjects who were part of a much
larger study on survival of patients following admission to an adult intensive
care.
The major goal of this study was to develop a logistic regression model to
predict the probability of survival to hospital discharge of these patients and
to study the risk factors associated with ICU mortality based on the risk
factors in this dataset.
[The ICU dataset used in this document](icu.dta)
***/
/**/use icu, clear
/***
The dataset can be summarised as:
***/
/**/metadata *
/****/metadata *, style(html)
/***
# Screening for risk factors to use
In prediction modeling the objective is to obtain the most parsimonious and yet
the best fitting model possible.
Here the outcome is dichotome Alive/Dead at hospital discharge.
Based on significant main effects a base of risk factors are chosen at first.
Significance can be decided using eg t-test or similar when the risk factor is
continous and twosided chisquare tests when the risk factor is categorical or
by fitting simple logistic/logistic regression.
The first can be done eg using the command -basetable- and the second by running
through the a set of simple logistic regressions.
If necessary risk factors due to clinical reasons can also be chosen.
However, they will usually have no effect in the predictions if the aren't
significant in the final prediction model.
## Using basetable
The -command- can do comparisons between the outcome (Alive/Dead at hospital
discharge) and the risk factors.
When the risk factor is categorical chisquare is used and since the median is
reported a Kruskal-Wallis test is used.
***/
/**/basetable sta age(%6.1f) sex(male) race(c) can(yes) crn(yes) cpr(yes) sys(%6.1f) ///
hra(%6.1f) typ(emergency) po2(=<60) ph(=<7.25) pco(<=45) ///
locd(coma or deep stupor), continousreport(iqi)
/****/basetable sta age(%6.1f) sex(male) race(c) can(yes) crn(yes) cpr(yes) sys(%6.1f) ///
hra(%6.1f) typ(emergency) po2(=<60) ph(=<7.25) pco(<=45) ///
locd(coma or deep stupor), continousreport(iqi) style(html)
/***
Significant risk factors are here age, cpr, sys, typ and locd.
## Estimating crude main effects
Repeatedly using logistic or logistic eg by
```
foreach var of varlist age-locd {
logistic sta `var', or
}
logistic sta, or
```
The last logistic outside the foreach loop generates the crude estimate.
Summary below is inspired by powerpoint presentation of S. Lemeshow in 2015.
***/
/****/simple_logistic sta age sex i.race can crn cpr sys hra typ po2 ph pco locd ///
, style(html) decimals((3,3,3,3,3,3,0,3))
/***
The same risk factors as before (age, cpr, sys, typ and locd) are significant.
## Conclusion
**age**, **crn**, **cpr**, **sys**, **typ** and **locd** are included
due to statistical significance.
**can** and **pco** are indcluded due to clinical significance.
# Fitting the multivariable model
## Fitting the main effects
Based on the fitting of univariate models and clinical considerations
we identified as potential variables for a multivariate model.
The results of fitting a model containing these variables follow:
***/
/**/logistic sta age i.crn i.cpr sys i.typ i.locd i.can i.pco, or
/**/estimates store first
/**/estout first, cells("b(fmt(2)) ci(fmt(2)) p(fmt(2))") varwidth(30) eform drop(0.* _cons)
//OFF
matrix coefs = r(coefs)
matrix colnames coefs = b "lower 95%ci" "upper 95%ci" p
//ON
/****/matprint coefs, style(html) d(2)
/***
We try fitting a new model dropping crn and cpr since they are highly
nonsignificant.
***/
/**/logistic sta age sys i.typ i.locd i.can i.pco, or
/**/estimates store second
/***
A likelihood ratio test comparing the first and second is performed:
***/
lrtest first second
/****/display "The p-value from the likelihood ratio test is " %4.2f `r(p)' "."
/***
So there is no significant difference between the two models.
And hence we choose the smaller.
Due to clinical significance are the variables can and locd kept in the model.
Below are the likelihood ratio tests for adding one at a step of the remaining
variables.
The code used is:
```
capture matrix drop tbl
capture matrix drop row
quietly logistic sta age sys typ locd can pco, or
estimates store A
foreach var in sex i.race hra po2 ph {
quietly logistic sta `var' age sys typ locd can pco, or
estimates store B
lrtest B A
matrix row = r(chi2), r(df), r(p)
matrix rownames row = `=subinstr("`var'", "i.", "", .)'
matrix tbl = nullmat(tbl) \ row
}
matrix colnames tbl = "LR chi2" df P
```
And the results are:
***/
/*
/****/simple_logistic sta sex i.race hra po2 ph ///
, style(html) decimals((3,3,3,3,3,3,0,3)) ///
adjustments(age sys typ locd can pco)
*/
//OFF
capture matrix drop tbl
capture matrix drop row
quietly logistic sta age sys typ locd can pco, or
estimates store A
foreach var in sex i.race hra po2 ph {
quietly logistic sta `var' age sys typ locd can pco, or
estimates store B
lrtest B A
matrix row = r(chi2), r(df), r(p)
matrix rownames row = `=subinstr("`var'", "i.", "", .)'
matrix tbl = nullmat(tbl) \ row
}
matrix colnames tbl = "LR chi2" df P
//ON
/****/matprint tbl, style(html) decimals((2, 0, 2))
/***
The variable ph is significant and is hence added to the model.
## Assessing the scale of the continuous risk factors
Next the continuous risk factors age and sys are examined for non linearity.
There are several methods for doing
* Quick and Dirty Methods
- Quartile design variables
- Testing for quadratic effect
* More computer intensive methods
- Smoothed Scatter Plot using -lowess-
- Fractional polynomials using -fp-
- Restricted cubic splines, -mkspline-
### The continous risk factor age
#### Quartile design for age
The idea is to generate quartiles for the continous risk factor and look at the
development for odds ratios in the 4 groups.
The x-values are eg mean or median of age for each group.
**In logistic regression the log odds are linear as a function of the risk
factor (age). So the points (mean age, log odds) should appear on a line.**
The recipe is:
* Generate group variable by -xtile-
***/
/**/xtile age_grp = age, n(4)
/***
* Calculate and save log odds
***/
/**/quietly logistic sta i.age_grp sys i.typ i.locd i.can i.pco i.ph
/**/matrix logodds = r(table)'
/**/matrix logodds = logodds[1..4, "b"]
/***
* Calculate and save the median (or mean) age
***/
/**/sumat age, stat(p50) rowby(age_grp)
/**/matrix median_age = r(sumat)
/**/matrix tbl = median_age, logodds
/***
* Convert median age and log odds to variables to plot
***/
/**/svmat tbl, names(col)
/**/twoway (connected b p50, sort) (lfit b p50), name(age_quartile_design, replace)
//OFF
drop b p50
graph export age_quartile_design.png, replace width(1600) height(1200)
//ON
/***
![Quartile design for age](age_quartile_design.png)
The (age, log odds) appear to be linear.
#### Smoothed Scatter Plot for age
This analysis are for univariable models only. So there are no adjustments
possible.
The command -ksm- is replaced with the command -lowess-.
***/
/**/lowess sta age, logit name(age_smothed_scatter_plot, replace)
//OFF
graph export age_smothed_scatter_plot.png, replace width(1600) height(1200)
//ON
/***
![Smoothed Scatter Plot for age](age_smothed_scatter_plot.png)
Also here monotonic/linear progression is present.
#### Conclusion for age
The effect of risk factor age is assumed linear.
### The continous risk factor sys
***/
//OFF
xtile sys_grp = sys, n(4)
quietly logistic sta i.sys_grp age i.typ i.locd i.can i.pco i.ph
estimates store sys_grp
matrix logodds = r(table)'
matrix logodds = logodds[1..4, "b"]
sumat sys, stat(p50) rowby(sys_grp)
matrix median_sys = r(sumat)
matrix tbl = median_sys, logodds
svmat tbl, names(col)
twoway (connected b p50, sort) (lfit b p50 if p50 < 160), name(sys_quartile_design, replace)
graph export sys_quartile_design.png, replace width(1600) height(1200)
//ON
/***
#### Quartile design for sys
The plot below indicates be non linearity.
![Quartile design for sys](sys_quartile_design.png)
However looking at odds ratios for the quartiles it might only being the last
quartile that is signifcant.
Looking at estimates is seems fair to assume that only the fourth quarter is
necessary in the model:
***/
/**/estout sys_grp, cells("b(fmt(2)) ci(fmt(2)) p(fmt(2))") varwidth(30) ///
eform noabbrev keep(*.sys*)
//OFF
matrix coefs = r(coefs)
matrix colnames coefs = b "lower 95%ci" "upper 95%ci" p
//ON
/****/matprint coefs, style(html) d(2)
/***
This is confirmed below:
***/
/**/quietly logistic sta i.sys_grp age i.typ i.locd i.can i.pco i.ph
/**/estimates store sys_grp_all
/**/quietly logistic sta 4.sys_grp age i.typ i.locd i.can i.pco i.ph
/**/estimates store sys_grp_4
/**/estadd lrtest sys_grp_all
/**/estout sys_grp_all sys_grp_4, cells("b(fmt(2)) ci(fmt(2)) p(fmt(2))") ///
varwidth(30) noabbrev eform keep(*.sys*) stats(lrtest_p, labels("LR p"))
//OFF
matrix stats = r(stats) # (1,.,.,.)
matrix roweq stats = "sta"
matrix rownames stats = "LR p"
matrix coefs = r(coefs) \ stats
matrix colnames coefs = b "lower 95%ci" "upper 95%ci" p b "lower 95%ci" "upper 95%ci" p
//ON
/****/matprint coefs, style(html)
/***
One can choose just to keep the fourth quartile in the model to handle the non
linearity.
#### Smoothed Scatter Plot for sys
***/
//OFF
lowess sta sys, logit name(sys_smothed_scatter_plot, replace)
graph export sys_smothed_scatter_plot.png, replace width(1600) height(1200)
//ON
/***
![](sys_smothed_scatter_plot.png)
This appears not to be linear just like the Quartile design plot also indicated.
#### Quadratic effect for sys
Testing for quatradic effect can be done simply by adding the risk factor
squared (c.sys#c.sys) to the model and do a likelihood ratio test for the
squared term.
***/
/**/quietly logistic sta sys c.sys#c.sys age i.typ i.locd i.can i.pco i.ph
/**/estimates store sys_quadratic
/**/quietly logistic sta sys age i.typ i.locd i.can i.pco i.ph
/**/estimates store sys_linear
/**/estadd lrtest sys_quadratic
/**/estout sys_quadratic sys_linear, cells("b(fmt(2)) ci(fmt(2)) p(fmt(2))") ///
varwidth(30) noabbrev eform keep(*sys*) stats(lrtest_p, labels("LR P value"))
//OFF
matrix stats = r(stats) # (1,.,.,.)
matrix roweq stats = "sta"
matrix rownames stats = "LR p"
matrix coefs = r(coefs) \ stats
matrix colnames coefs = b "lower 95%ci" "upper 95%ci" p b "lower 95%ci" "upper 95%ci" p
//ON
/****/matprint coefs, style(html)
/***
There seems to be no quadratic effect.
#### Modelling non linearity with cubic splines
The dependency of sys might not be quadratic.
More complex dependencies can be modelled by restricted cubic splines.
First cubic splines are generated by -mkspline-.
Here the sys is modelled by 4 coherent cubic functions between the 5 knots.
outside the first and last knot the sys is assumed to be linear.
***/
mkspline sys_csp = sys, nknots(5) cubic
/***
Then risk factor sys is replaced by the 4 variables generated by -mkspline-.
However, the first variable is always the identity, so sys = sys_csp1.
Testing linearity is then at matter of testing the other generated variables
equal to 0.
***/
quietly logistic sta sys_csp* age i.typ i.locd i.can i.pco i.ph
estimates store sys_csp
quietly logistic sta sys_csp1 age i.typ i.locd i.can i.pco i.ph
estimates store sys_linear
/**/estadd lrtest sys_csp
/**/estout sys_csp sys_linear, cells("b(fmt(2)) ci(fmt(2)) p(fmt(2))") ///
varwidth(30) noabbrev eform keep(sys*) stats(lrtest_p, labels("LR p"))
//OFF
matrix stats = r(stats) # (1,.,.,.)
matrix roweq stats = "sta"
matrix rownames stats = "LR p"
matrix coefs = r(coefs) \ stats
matrix colnames coefs = b "lower 95%ci" "upper 95%ci" p ///
b "lower 95%ci" "upper 95%ci" p
//ON
/****/matprint coefs, style(html)
/****/display "The p-value from the likelihood ratio test is " %4.2f `r(p)' "."
/***
So there is no significant non linear effect according to this analysis.
#### Modelling non linearity with fractional polymials
In Stata there is a command for doing fractional polymials all in one step.
The idea is to generate an optimal set of transformed variables of sys
(default 2) in optimal powers.
The set of optimal variables replaces the original variable in a regression.
In the output there is a test of linearity
The p-value for the partial F test comparing the models and the lowest deviance
model is given in the P(*) column.
```
fp , replace: logistic sta age i.typ i.locd i.can i.pco
```
***/
/***/fp , replace: logistic sta age i.typ i.locd i.can i.pco
/***
From P(*) in the Fractional polynomial comparisons it is seen that sys can
ignored totally.
#### Conclusion for sys
There is no nonlinear effect of sys, but there seems to be an effect when is
in the highest quarter.
## Conclusion
After defining a variable for the upper quarter of systolic blood preasure:
***/
/**/generate sys_grp_4q = (sys_grp == 4) if !missing(sys_grp)
/**/label variable sys_grp_4q "4th quarter systolic BP"
/**/label define sys_grp_4q 0 "No" 1 "Yes"
/***
the final model is chosen to be
***/
/**/quietly logistic sta sys_grp_4q age i.typ i.locd i.can i.pco i.ph
/**/estimates store sys_grp_4
/**/estout sys_grp_4, cells("b(fmt(2)) ci(fmt(2)) p(fmt(2))") varwidth(30) ///
drop(0.*) eform noabbrev
//OFF
matrix coefs = r(coefs)
matrix colnames coefs = b "lower 95%ci" "upper 95%ci" p
//ON
/****/matprint coefs, style(html)
/***
# Examining the final fit of the model
## Pearson or Hosmer-Lemeshow goodness-of-fit test
The postestimation command -estat gof- performs the Hosmer-Lemeshow
goodness-of-fit test in Stata when the option group is set with a number.
A depreciated command doing the same is -lfit-.
***/
estat gof, table group(8)
/***
It is worth noting that the first 5 expected values in the event of death at
discharge from hospital all are below 5.
Usually in chisquare test the expected values are required to be above 5.
It the option group is not set -estat gof- does the Pearson test.
***/
estat gof
/***
The Hosmer-Lemeshow goodness-of-fit test is critised being dependent on the
number of groups chosen for the test.
Eg [Long & Freese](https://www.stata.com/bookstore/regression-models-categorical-dependent-variables/#)
shows on page 223 an example of opposite conclusions depending on whether the
numbers of groups are 9 or 11 (model does not fit, p ~ 3%) in contrast to if
the number is 10 (model fits, p ~ 8%).
In number of groups from 5 to 15 5 groups returns a p value of 25.7% whereas
12 groups returns a p value of 0.7%.
So there is quite a variation depending of group number.
However in this dataset the variation does not lead to different conclusions
with respect to fit.
***/
//OFF
capture matrix drop gof
numlist "10(5)40"
local numlist `r(numlist)'
foreach n in `numlist' {
estat gof, group(`n')
matrix gof = nullmat(gof) \ (r(df), r(chi2), chi2tail(r(df), r(chi2)))
}
matrix roweq gof = Number of groups
matrix rownames gof = `numlist'
matrix colnames gof = df chi2 p
//ON
/****/matprint gof, style(html) d((0,0,2)) title(Hosmer-Lemeshow goodness-of-fit ///
test depends on number of groups chosen)
/***
## Other measures of fit
In [Long & Freese](https://www.stata.com/bookstore/regression-models-categorical-dependent-variables/#)
a command -fitstat- is demonstrated.
It needs to be installed by `findit spost9` or `findit spost13` depending on
your Stata version.
As guidelines all [R squared](https://stats.idre.ucla.edu/other/mult-pkg/faq/general/faq-what-are-pseudo-r-squareds/)
has the same interpretation being between 0 and 1 and measures the variation
explained by the model.
The [Akaike information criterion (AIC)](https://en.wikipedia.org/wiki/Akaike_information_criterion)
and the [Bayesian information criterion (BIC)](https://en.wikipedia.org/wiki/Bayesian_information_criterion)
are both goodness of fit measures penalising the number of parameters used
in the model to use for comparison of different models (the lowest value
being best).
***/
fitstat
/***
## Residual plots
The recipe below is inspired by [Dupont](http://www.cambridge.org/dk/academic/subjects/medicine/epidemiology-public-health-and-medical-statistics/statistical-modeling-biomedical-researchers-simple-introduction-analysis-complex-data-2nd-edition?format=HB&isbn=9780521849524#d3EVCLGKD1gBYQSf.97).
First restore the estimates for the selected model
***/
/**/estimates restore sys_grp_4
/***
* Numbering the covariate patterns
***/
/**/predict J, number
label variable J "jth covariate pattern"
/***
* Estimate the pi's for the covariate patterns
***/
/**/predict p, p
/**/label variable p "Estimate of pi for the jth covariate pattern"
/***
* Calculate Pearson (standardized) residuals
***/
/**/predict rstandard, rstandard
/**/label variable rstandard "Pearson residuals"
/***
* Calculate the Delta chi-squared influence statistic
$$\Delta\chi_j^2 = \frac{Pearson \, residual_j^2}{1-leverage_j}$$
***/
/**/predict dx2, dx2
/**/label variable dx2 "Delta chi-squared influence"
/***
* Calculate the Delta-beta influence
***/
/**/predict dbeta, dbeta
/**/label variable dbeta "Delta-Beta influence"
/***
* For graph purposes separate the Delta chisquared depending on the sign of the Pearson residual
***/
/**/generate dx2_pos = dx2 if rstandard >= 0
/**/generate dx2_neg = dx2 if rstandard < 0
/***
* Do the graph
***/
/**/twoway ///
(scatter dx2_pos p [weight=dbeta], symbol(oh) mlwidth(thick) color(black)) ///
(scatter dx2_neg p [weight=dbeta], symbol(oh) mlwidth(thick) color(gs8)) ///
(scatter dx2_pos p [weight=dbeta] if dx2_pos > 3, color(black) mlabel(J) mlabposition(3) msymbol(none) mlabgap(medium) mcolor(black)) ///
(scatter dx2_neg p [weight=dbeta] if dx2_neg > 3, color(gs8) mlabel(J) mlabposition(3) msymbol(none) mlabgap(medium) mcolor(black)) ///
, yline(3.84) legend(order(1 "Positive" 2 "Negative")) ///
ytitle(Squared Std Pearson Residuals) ///
name(sqr_std_P_res, replace)
//OFF
graph export sqr_std_P_res.png, replace width(1600) height(1200)
//ON
/***
![Residual plot for logistic regression](sqr_std_P_res.png)
One could consider whether one or more of the influential covariate patterns
should be ignored from the list below.
Or maybe it is possible to identify covariate patterns that the model predicts
poorly.
***/
//OFF
preserve
//ON
/**/sort dx2
/**/keep sta sys_grp_4q age typ locd can pco ph dbeta dx2 p
list if dx2 > 3.84
//OFF
restore
//ON
/***
## Sensitivity and specificity
The command -estat classification- reports summary statistics like the
classification table for the model.
A depreciated command doing the same is -lstat-.
***/
estat classification
/*
*** Final **********************************************************************
*lstat// lstat has been replaced with the estat classification command
estat classification
lsens
*estat classification, cutoff(0.25)
*lroc
fitstat
********************************************************************************
*** residual plot
********************************************************************************
*linktest?
*/