-basetable-
Description
In any empirical work a description of the data used is necessary. In medical science, Table 1 is a standardized way presenting data.
Using command -basetable-, it is easy to interactively build a summary of data in a Table 1 format.
The final layout can be styled in different formats: smcl (the default), csv, tex/latex, html, or markdown (pandoc version). Or the outputs can be exported to Excel sheets.
Sometimes, to report information of individuals is not allowed due to legislation. In that case, -basetable- offers the possibility of blurring the data. How this is done for continuous data is explained in Pseudo percentiles.
Installation
To install use the command: ssc install basetable
Demonstration
The dataset
We use the Stata example dataset mheart5.dta:
https://www.stata-press.com/data/r17/mheart5.dta, clear
metadata
---------------------------------------------------------------------------------------------------------- Name Index Label Value Label Name Format Value Label Values n unique missing ---------------------------------------------------------------------------------------------------------- attack 1 Outcome (heart attack) %9.0g 154 2 0 smokes 2 Current smoker %9.0g 154 2 0 age 3 Age, in years %9.0g 142 142 12 bmi 4 Body mass index, kg/m^2 %9.0g 126 126 28 female 5 Gender %9.0g 154 2 0 hsgrad 6 High school graduate %9.0g 154 2 0 ----------------------------------------------------------------------------------------------------------
Using -basetable- for comparing groups
We want to compare data by the genders, i.e., the variable female.
Categorical data
To start just use the female variable as argument:
basetable female
--------------------------------------------------------------- Columns by: Gender 0 1 Total P-value --------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) ---------------------------------------------------------------
The result is a table reporting the counts and percentages of males, females, and totals.
To add categorical variable smokes (Being current smoker) just write the variable name followed e.g. by a "c" in braces:
basetable female smokes(c)
------------------------------------------------------------------ Columns by: Gender 0 1 Total P-value ------------------------------------------------------------------ n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker, n (%) 0 69 (59.5) 21 (55.3) 90 (58.4) 1 47 (40.5) 17 (44.7) 64 (41.6) 0.65 ------------------------------------------------------------------
For each combination of female and smokes and the totals of smokes, the the counts and percentages are reported. To the left is added a p-value from a Pearson chisquare test. So far, this is quite similar to:
tab smokes female, col chi2
+-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ Current | Gender smoker | 0 1 | Total -----------+----------------------+---------- 0 | 69 21 | 90 | 59.48 55.26 | 58.44 -----------+----------------------+---------- 1 | 47 17 | 64 | 40.52 44.74 | 41.56 -----------+----------------------+---------- Total | 116 38 | 154 | 100.00 100.00 | 100.00 Pearson chi2(1) = 0.2098 Pr = 0.647
It is possible to get the p-value from a Fisher's exact test instead using the option exact(#) where # is a positive integer just like the -tab- command
basetable female smokes(c), exact(1)
------------------------------------------------------------------ Columns by: Gender 0 1 Total P-value ------------------------------------------------------------------ n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker, n (%) 0 69 (59.5) 21 (55.3) 90 (58.4) 1 47 (40.5) 17 (44.7) 64 (41.6) 0.71 ------------------------------------------------------------------
which is similar to
tab smokes female, col exact(1)
+-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ Current | Gender smoker | 0 1 | Total -----------+----------------------+---------- 0 | 69 21 | 90 | 59.48 55.26 | 58.44 -----------+----------------------+---------- 1 | 47 17 | 64 | 40.52 44.74 | 41.56 -----------+----------------------+---------- Total | 116 38 | 154 | 100.00 100.00 | 100.00 Fisher's exact = 0.706 1-sided Fisher's exact = 0.392
Note, that Fisher's exact test does not always lead to a result. If that happens, it sometimes help to allocate more computer power by increasing the integer.
If row percentages are preferred just replace "c" with an "r"
basetable female smokes(r)
------------------------------------------------------------------ Columns by: Gender 0 1 Total P-value ------------------------------------------------------------------ n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker, n (%) 0 69 (76.7) 21 (23.3) 90 (100.0) 1 47 (73.4) 17 (26.6) 64 (100.0) 0.65 ------------------------------------------------------------------
One could say that one of the rows of smokes are redundant. To report current smokers (1) insert "1" in the braces
basetable female smokes(1)
---------------------------------------------------------------------- Columns by: Gender 0 1 Total P-value ---------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker (1), n (%) 47 (40.5) 17 (44.7) 64 (41.6) 0.65 ----------------------------------------------------------------------
Sometimes, e.g. when reporting adverse events, one would prefer to report n only
basetable female smokes(c), categoricalreport(n)
------------------------------------------- Columns by: Gender 0 1 Total P-value ------------------------------------------- n 116 38 154 Current smoker, n 0 69 21 90 1 47 17 64 0.65 -------------------------------------------
And some might prefer to report only percentages
basetable female smokes(c), ca(p)
---------------------------------------------- Columns by: Gender 0 1 Total P-value ---------------------------------------------- % 75.3 24.7 100.0 Current smoker, % 0 59.5 55.3 58.4 1 40.5 44.7 41.6 0.65 ----------------------------------------------
In some cases, it is a matter of reporting 95% confidence intervals for the proportions
basetable female smokes(ci)
------------------------------------------------------------------------------------------------ Columns by: Gender 0 1 Total P-value ------------------------------------------------------------------------------------------------ n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker (0), % (95% CI) 59.5 (50.5; 68.4) 55.3 (39.5; 71.1) 58.4 (50.7; 66.2) Current smoker (1), % (95% CI) 40.5 (31.6; 49.5) 44.7 (28.9; 60.5) 41.6 (33.8; 49.3) 0.65 ------------------------------------------------------------------------------------------------
Or ignoring redundant information by only reporting current smokers
basetable female smokes(1, ci)
------------------------------------------------------------------------------------------------ Columns by: Gender 0 1 Total P-value ------------------------------------------------------------------------------------------------ n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker (1), % (95% CI) 40.5 (31.6; 49.5) 44.7 (28.9; 60.5) 41.6 (33.8; 49.3) 0.65 ------------------------------------------------------------------------------------------------
Number of decimals can be changed for both percentages and p-values by options pctformat and pvformat, respectively. The argument to the options is a Stata format. To report percentages with 2 decimals and p-values with 3, one would write
basetable female smokes(c), pctformat(%6.2f) pvformat(%6.3f)
--------------------------------------------------------------------- Columns by: Gender 0 1 Total P-value --------------------------------------------------------------------- n (%) 116 (75.32) 38 (24.68) 154 (100.00) Current smoker, n (%) 0 69 (59.48) 21 (55.26) 90 (58.44) 1 47 (40.52) 17 (44.74) 64 (41.56) 0.647 ---------------------------------------------------------------------
The p-value can also be placed at the top instead of at the bottom
basetable female smokes(c), pvformat(, top)
------------------------------------------------------------------ Columns by: Gender 0 1 Total P-value ------------------------------------------------------------------ n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker, n (%) 0.65 0 69 (59.5) 21 (55.3) 90 (58.4) 1 47 (40.5) 17 (44.7) 64 (41.6) ------------------------------------------------------------------
Continuous data
A continuous variable like age is added with a Stata format in braces. To report age values with one decimal use e.g. the format "%6.1f". Default report is mean and standard deviation.
When the mean reported the p-value is from an ANOVA test. Note that t-test and ANOVA returns the same p-value when comparing two groups.
basetable female age(%6.1f)
------------------------------------------------------------------------ Columns by: Gender 0 1 Total P-value ------------------------------------------------------------------------ n (%) 116 (75.3) 38 (24.7) 154 (100.0) Age, in years, mean (sd) 56.0 (11.2) 57.8 (12.7) 56.4 (11.6) 0.42 ------------------------------------------------------------------------
When the median is reported the p-value is from a Kruskal-Wallis test comparing the ranks of the empirical distributions. Note that Mann-Whitney asymptotic p-value and Kruskal Wallis p-value are the same when comparing two groups.
Reporting the median and the interquartile interval (what some call interquartile range) can be reported by
basetable female age(%6.1f, iqi)
--------------------------------------------------------------------------------------------- Columns by: Gender 0 1 Total P-value --------------------------------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) Age, in years, median (iqi) 55.0 (49.5; 65.2) 57.6 (47.9; 68.7) 55.1 (48.3; 65.5) 0.66 ---------------------------------------------------------------------------------------------
There are several possibilities for reporting continuous data:
- sd (mean and sd, default)
- ci (mean and 95% confidence interval)
- gci (geometric mean and confidence interval)
- pi (mean and prediction interval)
- iqr (median and interquartile range)
- iqi (median and interquartile interval)
- idr (median and interdecentile range)
- idi (median and interdecentile interval)
- imr (median and range)
- imi (median, min, and max)
The default report for continuous data can be changed.
basetable female age(%6.1f), continuousreport(iqr)
--------------------------------------------------------------------------- Columns by: Gender 0 1 Total P-value --------------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) Age, in years, median (iqr) 55.0 (15.7) 57.6 (20.8) 55.1 (17.3) 0.66 ---------------------------------------------------------------------------
Reporting more variables
-basetable- can have any combination of variables as arguments. It is also possible to use varlist for variables that needs the same appearance.
To simplify use of varlists consider using the commands -rename- group and -order-.
The result could be:
basetable female smokes(c) age-bmi(%6.1f) hsgrad(c)
---------------------------------------------------------------------------------- Columns by: Gender 0 1 Total P-value ---------------------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker, n (%) 0 69 (59.5) 21 (55.3) 90 (58.4) 1 47 (40.5) 17 (44.7) 64 (41.6) 0.65 Age, in years, mean (sd) 56.0 (11.2) 57.8 (12.7) 56.4 (11.6) 0.42 Body mass index, kg/m^2, mean (sd) 25.2 (4.0) 25.2 (4.3) 25.2 (4.0) 0.95 High school graduate, n (%) 0 29 (25.0) 9 (23.7) 38 (24.7) 1 87 (75.0) 29 (76.3) 116 (75.3) 0.87 ----------------------------------------------------------------------------------
However, the value labels are not set, and hence making a poor appereance.
label define female 0 "male" 1 "female"
label values female female
label define n_y 0 "no" 1 "yes"
label values attack smokes hsgrad n_y
After setting the value labels, the table looks like:
basetable female smokes(c) age-bmi(%6.1f) hsgrad(c)
---------------------------------------------------------------------------------- Columns by: Gender male female Total P-value ---------------------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker, n (%) no 69 (59.5) 21 (55.3) 90 (58.4) yes 47 (40.5) 17 (44.7) 64 (41.6) 0.65 Age, in years, mean (sd) 56.0 (11.2) 57.8 (12.7) 56.4 (11.6) 0.42 Body mass index, kg/m^2, mean (sd) 25.2 (4.0) 25.2 (4.3) 25.2 (4.0) 0.95 High school graduate, n (%) no 29 (25.0) 9 (23.7) 38 (24.7) yes 87 (75.0) 29 (76.3) 116 (75.3) 0.87 ----------------------------------------------------------------------------------
The total and the p-value columns can be removed by the options nototal and nopvalue, respectively. Also, a missing report can be added using the option missing:
basetable female smokes(c) age-bmi(%6.1f) hsgrad(c), nototal nopvalue missing
-------------------------------------------------------------------------------- Columns by: Gender male female Missings / N (Pct) -------------------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 0 / 154 (0.0) Current smoker, n (%) no 69 (59.5) 21 (55.3) yes 47 (40.5) 17 (44.7) 0 / 154 (0.0) Age, in years, mean (sd) 56.0 (11.2) 57.8 (12.7) 12 / 154 (7.8) Body mass index, kg/m^2, mean (sd) 25.2 (4.0) 25.2 (4.3) 28 / 154 (18.2) High school graduate, n (%) no 29 (25.0) 9 (23.7) yes 87 (75.0) 29 (76.3) 0 / 154 (0.0) --------------------------------------------------------------------------------
Conditioning
The -basetable- report can e.g. be limited to the participant aged above 60.
basetable female smokes(yes) bmi(%6.1f) if age > 60
------------------------------------------------------------------------------- Columns by: Gender male female Total P-value ------------------------------------------------------------------------------- n (%) 53 (77.9) 15 (22.1) 68 (100.0) Current smoker (yes), n (%) 22 (41.5) 9 (60.0) 31 (45.6) 0.20 Body mass index, kg/m^2, mean (sd) 24.7 (4.1) 25.9 (3.3) 25.0 (3.9) 0.36 -------------------------------------------------------------------------------
But subtables by subconditions can also be inserted using titles in square brackets (# means add counts). Note that a condition can be added after a comma inside the square brackets. The condition is in scope until the next set of square brackets.
basetable female smokes(yes) bmi(%6.1f) [The elderly #, if age > 60] smokes(yes) bmi(%6.1f)
-------------------------------------------------------------------------------- Columns by: Gender male female Total P-value -------------------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker (yes), n (%) 47 (40.5) 17 (44.7) 64 (41.6) 0.65 Body mass index, kg/m^2, mean (sd) 25.2 (4.0) 25.2 (4.3) 25.2 (4.0) 0.95 The elderly n (%) 53 (77.9) 15 (22.1) 68 (100.0) Current smoker (yes), n (%) 22 (41.5) 9 (60.0) 31 (45.6) 0.20 Body mass index, kg/m^2, mean (sd) 24.7 (4.1) 25.9 (3.3) 25.0 (3.9) 0.36 --------------------------------------------------------------------------------
Using subtables and subconditions can produce rather complex tables easily.
local to_see smokes(yes) bmi(%6.1f)
basetable female ///
[High school #, if hsgrad] `to_see ' ///
[][No high school #, if !hsgrad] `to_see ' ///
[][total #] `to_see ', notopcount
-------------------------------------------------------------------------------- Columns by: Gender male female Total P-value -------------------------------------------------------------------------------- High school n (%) 87 (75.0) 29 (25.0) 116 (100.0) Current smoker (yes), n (%) 33 (37.9) 14 (48.3) 47 (40.5) 0.33 Body mass index, kg/m^2, mean (sd) 25.1 (4.0) 25.2 (4.5) 25.1 (4.1) 0.93 No high school n (%) 29 (76.3) 9 (23.7) 38 (100.0) Current smoker (yes), n (%) 14 (48.3) 3 (33.3) 17 (44.7) 0.43 Body mass index, kg/m^2, mean (sd) 25.6 (3.9) 25.2 (4.2) 25.5 (3.9) 0.80 total n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker (yes), n (%) 47 (40.5) 17 (44.7) 64 (41.6) 0.65 Body mass index, kg/m^2, mean (sd) 25.2 (4.0) 25.2 (4.3) 25.2 (4.0) 0.95 --------------------------------------------------------------------------------
Styling tables
-basetable- report can styled into the markups: smcl (the default), latex/tex, html, csv, or markdown (md). This way, it is easy to integrate -basetable- reports into a final document in one of the styles using log2markup.
To style in tex is done by:
basetable female smokes(yes) bmi(%6.1f), style(tex)
\begin{table}[h] \centering \begin{tabular}{lrrrr} \hline \hline Columns by: Gender & male & female & Total & P-value \\ \hline n (\%) & 116 (75.3) & 38 (24.7) & 154 (100.0) & \\ Current smoker (yes), n (\%) & 47 (40.5) & 17 (44.7) & 64 (41.6) & 0.65 \\ Body mass index, kg/m\^2, mean (sd) & 25.2 (4.0) & 25.2 (4.3) & 25.2 (4.0) & 0.95 \\ \hline \hline \end{tabular} \end{table}
Exporting tables to Excel
The string matrix can also be exported to Excel (and then maybe copied into Word).
basetable female smokes(yes) bmi(%6.1f), toxl(tables, tbl1, replace)
-------------------------------------------------------------------------------- Columns by: Gender male female Total P-value -------------------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker (yes), n (%) 47 (40.5) 17 (44.7) 64 (41.6) 0.65 Body mass index, kg/m^2, mean (sd) 25.2 (4.0) 25.2 (4.3) 25.2 (4.0) 0.95 -------------------------------------------------------------------------------- Table saved in "tables.xlsx", in sheet "tbl1"...
The column widths in Excel can be reduced. Below the first column have width 40 and the rest have width 15. The default setting for Excel column widths is (70, 20).
basetable female smokes(yes) bmi(%6.1f), toxl(tables, tbl2, replace, (40,15))
-------------------------------------------------------------------------------- Columns by: Gender male female Total P-value -------------------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) Current smoker (yes), n (%) 47 (40.5) 17 (44.7) 64 (41.6) 0.65 Body mass index, kg/m^2, mean (sd) 25.2 (4.0) 25.2 (4.3) 25.2 (4.0) 0.95 -------------------------------------------------------------------------------- Table saved in "tables.xlsx", in sheet "tbl2"...
The saved Excel file can be seen here
Bluring data
When working on public registries, it is necessary to blur information on individuals. -basetable- offers an approach for bluring continuous and categorical data. For continuous data this is done using pseudo percentiles by option pseudo and option small.
Numerically, there is little difference between report
basetable female age-bmi(%6.1f, iqi)
------------------------------------------------------------------------------------------------------- Columns by: Gender male female Total P-value ------------------------------------------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) Age, in years, median (iqi) 55.0 (49.5; 65.2) 57.6 (47.9; 68.7) 55.1 (48.3; 65.5) 0.66 Body mass index, kg/m^2, median (iqi) 24.9 (21.9; 27.6) 23.9 (22.5; 27.4) 24.7 (22.0; 27.6) 0.69 -------------------------------------------------------------------------------------------------------
and report
basetable female age-bmi(%6.1f, iqi), pseudo
------------------------------------------------------------------------------------------------------- Columns by: Gender male female Total P-value ------------------------------------------------------------------------------------------------------- n (%) 116 (75.3) 38 (24.7) 154 (100.0) Age, in years, median (iqi) 55.0 (49.0; 65.1) 56.7 (47.8; 68.4) 55.1 (48.4; 65.7) 0.66 Body mass index, kg/m^2, median (iqi) 24.9 (21.9; 27.7) 24.3 (22.2; 27.8) 24.7 (22.0; 27.6) 0.69 -------------------------------------------------------------------------------------------------------
However, every value in the latter is a mean of 5 neighboring values and hence an acceptable value to report by e.g. Registry Denmark. The value 5 can be changed using option small.
For categorical variables counts less than 5 are set to 5 in counts and totals by option hidesmall. Percentages are set to missing. The value 5 can be changed using option small.
To show this, a new dataset is used.
webuse lbw, clear
(Hosmer & Lemeshow data)
describe
Contains data from https://www.stata-press.com/data/r17/lbw.dta Observations: 189 Hosmer & Lemeshow data Variables: 11 15 Jan 2020 05:01 --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Variable Storage Display Value name type format label Variable label --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- id int %8.0g Identification code low byte %8.0g Birthweight<2500g age byte %8.0g Age of mother lwt int %8.0g Weight at last menstrual period race byte %8.0g race Race smoke byte %9.0g smoke Smoked during pregnancy ptl byte %8.0g Premature labor history (count) ht byte %8.0g Has history of hypertension ui byte %8.0g Presence, uterine irritability ftv byte %8.0g Number of visits to physician during 1st trimester bwt int %8.0g Birthweight (grams) --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Sorted by:
Showing "Number of visits to physician during 1st trimester" by "Smoked during pregnancy"
basetable smoke ftv(c), hidesmall
------------------------------------------------------------------------------------------------------ Columns by: Smoked during pregnancy Nonsmoker Smoker Total P-value ------------------------------------------------------------------------------------------------------ n (%) 115 (60.8) 74 (39.2) 189 (100.0) Number of visits to physician during 1st trimester, n (%) 0 55 (47.8) 45 (60.8) 100 (52.9) 1 35 (30.4) 12 (16.2) 47 (24.9) 2 19 (16.5) 11 (14.9) 30 (15.9) 3 < 5 (.) < 5 (.) < 10 (.) 4 < 5 (.) < 5 (.) < 10 (.) 6 0 (0.0) < 5 (.) < 5 (.) 0.16 ------------------------------------------------------------------------------------------------------
clearly demonstrates what is done.
Sometimes, it is too easy to find the missing value by sums and subtractions like below (1 = 189-(100+47+30+7+4)).
basetable _none ftv(c), hidesmall small(4)
---------------------------------------------------------------------- Variables Summary ---------------------------------------------------------------------- n (%) 189 (100.0) Number of visits to physician during 1st trimester, n (%) 0 100 (52.9) 1 47 (24.9) 2 30 (15.9) 3 7 (3.7) 4 4 (2.1) 6 < 4 (.) ----------------------------------------------------------------------
In that case, it is recommended to collapse cells, e.g. collapsing 3, 4, and 6 to "> 2"
Last update: 2022-04-21, Stata version 17