-basetable-

Description

In any empirical work a description of the data used is necessary. In medical science, Table 1 is a standardized way presenting data.

Using command -basetable-, it is easy to interactively build a summary of data in a Table 1 format.

The final layout can be styled in different formats: smcl (the default), csv, tex/latex, html, or markdown (pandoc version). Or the outputs can be exported to Excel sheets.

Sometimes, to report information of individuals is not allowed due to legislation. In that case, -basetable- offers the possibility of blurring the data. How this is done for continuous data is explained in Pseudo percentiles.

Installation

To install use the command: ssc install basetable

Demonstration

The dataset

We use the Stata example dataset mheart5.dta:

https://www.stata-press.com/data/r17/mheart5.dta, clear

metadata


----------------------------------------------------------------------------------------------------------
Name    Index  Label                    Value Label Name  Format  Value Label Values    n  unique  missing
----------------------------------------------------------------------------------------------------------
attack      1  Outcome (heart attack)                     %9.0g                       154       2        0
smokes      2  Current smoker                             %9.0g                       154       2        0
age         3  Age, in years                              %9.0g                       142     142       12
bmi         4  Body mass index, kg/m^2                    %9.0g                       126     126       28
female      5  Gender                                     %9.0g                       154       2        0
hsgrad      6  High school graduate                       %9.0g                       154       2        0
----------------------------------------------------------------------------------------------------------

Using -basetable- for comparing groups

We want to compare data by the genders, i.e., the variable female.

Categorical data

To start just use the female variable as argument:

basetable female


---------------------------------------------------------------
Columns by: Gender           0          1        Total  P-value
---------------------------------------------------------------
n (%)               116 (75.3)  38 (24.7)  154 (100.0)         
---------------------------------------------------------------

The result is a table reporting the counts and percentages of males, females, and totals.

To add categorical variable smokes (Being current smoker) just write the variable name followed e.g. by a "c" in braces:

basetable female smokes(c)


------------------------------------------------------------------
Columns by: Gender              0          1        Total  P-value
------------------------------------------------------------------
n (%)                  116 (75.3)  38 (24.7)  154 (100.0)         
Current smoker, n (%)                                             
  0                     69 (59.5)  21 (55.3)    90 (58.4)         
  1                     47 (40.5)  17 (44.7)    64 (41.6)     0.65
------------------------------------------------------------------

For each combination of female and smokes and the totals of smokes, the the counts and percentages are reported. To the left is added a p-value from a Pearson chisquare test. So far, this is quite similar to:

tab smokes female, col chi2

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

   Current |        Gender
    smoker |         0          1 |     Total
-----------+----------------------+----------
         0 |        69         21 |        90 
           |     59.48      55.26 |     58.44 
-----------+----------------------+----------
         1 |        47         17 |        64 
           |     40.52      44.74 |     41.56 
-----------+----------------------+----------
     Total |       116         38 |       154 
           |    100.00     100.00 |    100.00 

          Pearson chi2(1) =   0.2098   Pr = 0.647

It is possible to get the p-value from a Fisher's exact test instead using the option exact(#) where # is a positive integer just like the -tab- command

basetable female smokes(c), exact(1)


------------------------------------------------------------------
Columns by: Gender              0          1        Total  P-value
------------------------------------------------------------------
n (%)                  116 (75.3)  38 (24.7)  154 (100.0)         
Current smoker, n (%)                                             
  0                     69 (59.5)  21 (55.3)    90 (58.4)         
  1                     47 (40.5)  17 (44.7)    64 (41.6)     0.71
------------------------------------------------------------------

which is similar to

tab smokes female, col exact(1)

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

   Current |        Gender
    smoker |         0          1 |     Total
-----------+----------------------+----------
         0 |        69         21 |        90 
           |     59.48      55.26 |     58.44 
-----------+----------------------+----------
         1 |        47         17 |        64 
           |     40.52      44.74 |     41.56 
-----------+----------------------+----------
     Total |       116         38 |       154 
           |    100.00     100.00 |    100.00 

           Fisher's exact =                 0.706
   1-sided Fisher's exact =                 0.392

Note, that Fisher's exact test does not always lead to a result. If that happens, it sometimes help to allocate more computer power by increasing the integer.

If row percentages are preferred just replace "c" with an "r"

basetable female smokes(r)


------------------------------------------------------------------
Columns by: Gender              0          1        Total  P-value
------------------------------------------------------------------
n (%)                  116 (75.3)  38 (24.7)  154 (100.0)         
Current smoker, n (%)                                             
  0                     69 (76.7)  21 (23.3)   90 (100.0)         
  1                     47 (73.4)  17 (26.6)   64 (100.0)     0.65
------------------------------------------------------------------

One could say that one of the rows of smokes are redundant. To report current smokers (1) insert "1" in the braces

basetable female smokes(1)


----------------------------------------------------------------------
Columns by: Gender                  0          1        Total  P-value
----------------------------------------------------------------------
n (%)                      116 (75.3)  38 (24.7)  154 (100.0)         
Current smoker (1), n (%)   47 (40.5)  17 (44.7)    64 (41.6)     0.65
----------------------------------------------------------------------

Sometimes, e.g. when reporting adverse events, one would prefer to report n only

basetable female smokes(c), categoricalreport(n)


-------------------------------------------
Columns by: Gender    0   1  Total  P-value
-------------------------------------------
n                   116  38    154         
Current smoker, n                          
  0                  69  21     90         
  1                  47  17     64     0.65
-------------------------------------------

And some might prefer to report only percentages

basetable female smokes(c), ca(p)


----------------------------------------------
Columns by: Gender     0     1  Total  P-value
----------------------------------------------
%                   75.3  24.7  100.0         
Current smoker, %                             
  0                 59.5  55.3   58.4         
  1                 40.5  44.7   41.6     0.65
----------------------------------------------

In some cases, it is a matter of reporting 95% confidence intervals for the proportions

basetable female smokes(ci)


------------------------------------------------------------------------------------------------
Columns by: Gender                              0                  1              Total  P-value
------------------------------------------------------------------------------------------------
n (%)                                  116 (75.3)          38 (24.7)        154 (100.0)         
Current smoker (0), % (95% CI)  59.5 (50.5; 68.4)  55.3 (39.5; 71.1)  58.4 (50.7; 66.2)         
Current smoker (1), % (95% CI)  40.5 (31.6; 49.5)  44.7 (28.9; 60.5)  41.6 (33.8; 49.3)     0.65
------------------------------------------------------------------------------------------------

Or ignoring redundant information by only reporting current smokers

basetable female smokes(1, ci)


------------------------------------------------------------------------------------------------
Columns by: Gender                              0                  1              Total  P-value
------------------------------------------------------------------------------------------------
n (%)                                  116 (75.3)          38 (24.7)        154 (100.0)         
Current smoker (1), % (95% CI)  40.5 (31.6; 49.5)  44.7 (28.9; 60.5)  41.6 (33.8; 49.3)     0.65
------------------------------------------------------------------------------------------------

Number of decimals can be changed for both percentages and p-values by options pctformat and pvformat, respectively. The argument to the options is a Stata format. To report percentages with 2 decimals and p-values with 3, one would write

basetable female smokes(c), pctformat(%6.2f) pvformat(%6.3f)


---------------------------------------------------------------------
Columns by: Gender               0           1         Total  P-value
---------------------------------------------------------------------
n (%)                  116 (75.32)  38 (24.68)  154 (100.00)         
Current smoker, n (%)                                                
  0                     69 (59.48)  21 (55.26)    90 (58.44)         
  1                     47 (40.52)  17 (44.74)    64 (41.56)    0.647
---------------------------------------------------------------------

The p-value can also be placed at the top instead of at the bottom

basetable female smokes(c), pvformat(, top)


------------------------------------------------------------------
Columns by: Gender              0          1        Total  P-value
------------------------------------------------------------------
n (%)                  116 (75.3)  38 (24.7)  154 (100.0)         
Current smoker, n (%)                                         0.65
  0                     69 (59.5)  21 (55.3)    90 (58.4)         
  1                     47 (40.5)  17 (44.7)    64 (41.6)         
------------------------------------------------------------------

Continuous data

A continuous variable like age is added with a Stata format in braces. To report age values with one decimal use e.g. the format "%6.1f". Default report is mean and standard deviation.

When the mean reported the p-value is from an ANOVA test. Note that t-test and ANOVA returns the same p-value when comparing two groups.

basetable female age(%6.1f)


------------------------------------------------------------------------
Columns by: Gender                  0            1        Total  P-value
------------------------------------------------------------------------
n (%)                      116 (75.3)    38 (24.7)  154 (100.0)         
Age, in years, mean (sd)  56.0 (11.2)  57.8 (12.7)  56.4 (11.6)     0.42
------------------------------------------------------------------------

When the median is reported the p-value is from a Kruskal-Wallis test comparing the ranks of the empirical distributions. Note that Mann-Whitney asymptotic p-value and Kruskal Wallis p-value are the same when comparing two groups.

Reporting the median and the interquartile interval (what some call interquartile range) can be reported by

basetable female age(%6.1f, iqi)


---------------------------------------------------------------------------------------------
Columns by: Gender                           0                  1              Total  P-value
---------------------------------------------------------------------------------------------
n (%)                               116 (75.3)          38 (24.7)        154 (100.0)         
Age, in years, median (iqi)  55.0 (49.5; 65.2)  57.6 (47.9; 68.7)  55.1 (48.3; 65.5)     0.66
---------------------------------------------------------------------------------------------

There are several possibilities for reporting continuous data:

  • sd (mean and sd, default)
  • ci (mean and 95% confidence interval)
  • gci (geometric mean and confidence interval)
  • pi (mean and prediction interval)
  • iqr (median and interquartile range)
  • iqi (median and interquartile interval)
  • idr (median and interdecentile range)
  • idi (median and interdecentile interval)
  • imr (median and range)
  • imi (median, min, and max)

The default report for continuous data can be changed.

basetable female age(%6.1f), continuousreport(iqr)


---------------------------------------------------------------------------
Columns by: Gender                     0            1        Total  P-value
---------------------------------------------------------------------------
n (%)                         116 (75.3)    38 (24.7)  154 (100.0)         
Age, in years, median (iqr)  55.0 (15.7)  57.6 (20.8)  55.1 (17.3)     0.66
---------------------------------------------------------------------------

Reporting more variables

-basetable- can have any combination of variables as arguments. It is also possible to use varlist for variables that needs the same appearance.

To simplify use of varlists consider using the commands -rename- group and -order-.

The result could be:

basetable female smokes(c) age-bmi(%6.1f) hsgrad(c)


----------------------------------------------------------------------------------
Columns by: Gender                            0            1        Total  P-value
----------------------------------------------------------------------------------
n (%)                                116 (75.3)    38 (24.7)  154 (100.0)         
Current smoker, n (%)                                                             
  0                                   69 (59.5)    21 (55.3)    90 (58.4)         
  1                                   47 (40.5)    17 (44.7)    64 (41.6)     0.65
Age, in years, mean (sd)            56.0 (11.2)  57.8 (12.7)  56.4 (11.6)     0.42
Body mass index, kg/m^2, mean (sd)   25.2 (4.0)   25.2 (4.3)   25.2 (4.0)     0.95
High school graduate, n (%)                                                       
  0                                   29 (25.0)     9 (23.7)    38 (24.7)         
  1                                   87 (75.0)    29 (76.3)   116 (75.3)     0.87
----------------------------------------------------------------------------------

However, the value labels are not set, and hence making a poor appereance.

label define female 0 "male" 1 "female"
label values female female
label define n_y 0 "no" 1 "yes"
label values attack smokes hsgrad n_y

After setting the value labels, the table looks like:

basetable female smokes(c) age-bmi(%6.1f) hsgrad(c)


----------------------------------------------------------------------------------
Columns by: Gender                         male       female        Total  P-value
----------------------------------------------------------------------------------
n (%)                                116 (75.3)    38 (24.7)  154 (100.0)         
Current smoker, n (%)                                                             
  no                                  69 (59.5)    21 (55.3)    90 (58.4)         
  yes                                 47 (40.5)    17 (44.7)    64 (41.6)     0.65
Age, in years, mean (sd)            56.0 (11.2)  57.8 (12.7)  56.4 (11.6)     0.42
Body mass index, kg/m^2, mean (sd)   25.2 (4.0)   25.2 (4.3)   25.2 (4.0)     0.95
High school graduate, n (%)                                                       
  no                                  29 (25.0)     9 (23.7)    38 (24.7)         
  yes                                 87 (75.0)    29 (76.3)   116 (75.3)     0.87
----------------------------------------------------------------------------------

The total and the p-value columns can be removed by the options nototal and nopvalue, respectively. Also, a missing report can be added using the option missing:

basetable female smokes(c) age-bmi(%6.1f) hsgrad(c), nototal nopvalue missing


--------------------------------------------------------------------------------
Columns by: Gender                         male       female  Missings / N (Pct)
--------------------------------------------------------------------------------
n (%)                                116 (75.3)    38 (24.7)       0 / 154 (0.0)
Current smoker, n (%)                                                           
  no                                  69 (59.5)    21 (55.3)                    
  yes                                 47 (40.5)    17 (44.7)       0 / 154 (0.0)
Age, in years, mean (sd)            56.0 (11.2)  57.8 (12.7)      12 / 154 (7.8)
Body mass index, kg/m^2, mean (sd)   25.2 (4.0)   25.2 (4.3)     28 / 154 (18.2)
High school graduate, n (%)                                                     
  no                                  29 (25.0)     9 (23.7)                    
  yes                                 87 (75.0)    29 (76.3)       0 / 154 (0.0)
--------------------------------------------------------------------------------

Conditioning

The -basetable- report can e.g. be limited to the participant aged above 60.

basetable female smokes(yes) bmi(%6.1f) if age > 60


-------------------------------------------------------------------------------
Columns by: Gender                        male      female       Total  P-value
-------------------------------------------------------------------------------
n (%)                                53 (77.9)   15 (22.1)  68 (100.0)         
Current smoker (yes), n (%)          22 (41.5)    9 (60.0)   31 (45.6)     0.20
Body mass index, kg/m^2, mean (sd)  24.7 (4.1)  25.9 (3.3)  25.0 (3.9)     0.36
-------------------------------------------------------------------------------

But subtables by subconditions can also be inserted using titles in square brackets (# means add counts). Note that a condition can be added after a comma inside the square brackets. The condition is in scope until the next set of square brackets.

basetable female smokes(yes) bmi(%6.1f) [The elderly #, if age > 60] smokes(yes) bmi(%6.1f)


--------------------------------------------------------------------------------
Columns by: Gender                        male      female        Total  P-value
--------------------------------------------------------------------------------
n (%)                               116 (75.3)   38 (24.7)  154 (100.0)         
Current smoker (yes), n (%)          47 (40.5)   17 (44.7)    64 (41.6)     0.65
Body mass index, kg/m^2, mean (sd)  25.2 (4.0)  25.2 (4.3)   25.2 (4.0)     0.95
The elderly                                                                     
n (%)                                53 (77.9)   15 (22.1)   68 (100.0)         
Current smoker (yes), n (%)          22 (41.5)    9 (60.0)    31 (45.6)     0.20
Body mass index, kg/m^2, mean (sd)  24.7 (4.1)  25.9 (3.3)   25.0 (3.9)     0.36
--------------------------------------------------------------------------------

Using subtables and subconditions can produce rather complex tables easily.

local to_see smokes(yes) bmi(%6.1f)
basetable female ///
  [High school #, if hsgrad] `to_see ' ///
  [][No high school #, if !hsgrad] `to_see ' ///
  [][total #] `to_see ', notopcount


--------------------------------------------------------------------------------
Columns by: Gender                        male      female        Total  P-value
--------------------------------------------------------------------------------
High school                                                                     
n (%)                                87 (75.0)   29 (25.0)  116 (100.0)         
Current smoker (yes), n (%)          33 (37.9)   14 (48.3)    47 (40.5)     0.33
Body mass index, kg/m^2, mean (sd)  25.1 (4.0)  25.2 (4.5)   25.1 (4.1)     0.93

No high school                                                                  
n (%)                                29 (76.3)    9 (23.7)   38 (100.0)         
Current smoker (yes), n (%)          14 (48.3)    3 (33.3)    17 (44.7)     0.43
Body mass index, kg/m^2, mean (sd)  25.6 (3.9)  25.2 (4.2)   25.5 (3.9)     0.80

total                                                                           
n (%)                               116 (75.3)   38 (24.7)  154 (100.0)         
Current smoker (yes), n (%)          47 (40.5)   17 (44.7)    64 (41.6)     0.65
Body mass index, kg/m^2, mean (sd)  25.2 (4.0)  25.2 (4.3)   25.2 (4.0)     0.95
--------------------------------------------------------------------------------

Styling tables

-basetable- report can styled into the markups: smcl (the default), latex/tex, html, csv, or markdown (md). This way, it is easy to integrate -basetable- reports into a final document in one of the styles using log2markup.

To style in tex is done by:

basetable female smokes(yes) bmi(%6.1f), style(tex)

\begin{table}[h]
\centering
\begin{tabular}{lrrrr}
\hline
\hline
Columns by: Gender                 &       male &     female &       Total & P-value \\
\hline
n (\%)                              & 116 (75.3) &  38 (24.7) & 154 (100.0) &         \\
Current smoker (yes), n (\%)        &  47 (40.5) &  17 (44.7) &   64 (41.6) &    0.65 \\
Body mass index, kg/m\^2, mean (sd) & 25.2 (4.0) & 25.2 (4.3) &  25.2 (4.0) &    0.95 \\
\hline
\hline
\end{tabular}
\end{table}

Exporting tables to Excel

The string matrix can also be exported to Excel (and then maybe copied into Word).

basetable female smokes(yes) bmi(%6.1f), toxl(tables, tbl1, replace)


--------------------------------------------------------------------------------
Columns by: Gender                        male      female        Total  P-value
--------------------------------------------------------------------------------
n (%)                               116 (75.3)   38 (24.7)  154 (100.0)         
Current smoker (yes), n (%)          47 (40.5)   17 (44.7)    64 (41.6)     0.65
Body mass index, kg/m^2, mean (sd)  25.2 (4.0)  25.2 (4.3)   25.2 (4.0)     0.95
--------------------------------------------------------------------------------

Table saved in "tables.xlsx", in sheet "tbl1"... 

The column widths in Excel can be reduced. Below the first column have width 40 and the rest have width 15. The default setting for Excel column widths is (70, 20).

basetable female smokes(yes) bmi(%6.1f), toxl(tables, tbl2, replace, (40,15))


--------------------------------------------------------------------------------
Columns by: Gender                        male      female        Total  P-value
--------------------------------------------------------------------------------
n (%)                               116 (75.3)   38 (24.7)  154 (100.0)         
Current smoker (yes), n (%)          47 (40.5)   17 (44.7)    64 (41.6)     0.65
Body mass index, kg/m^2, mean (sd)  25.2 (4.0)  25.2 (4.3)   25.2 (4.0)     0.95
--------------------------------------------------------------------------------

Table saved in "tables.xlsx", in sheet "tbl2"... 

The saved Excel file can be seen here

Bluring data

When working on public registries, it is necessary to blur information on individuals. -basetable- offers an approach for bluring continuous and categorical data. For continuous data this is done using pseudo percentiles by option pseudo and option small.

Numerically, there is little difference between report

basetable female age-bmi(%6.1f, iqi)


-------------------------------------------------------------------------------------------------------
Columns by: Gender                                  male             female              Total  P-value
-------------------------------------------------------------------------------------------------------
n (%)                                         116 (75.3)          38 (24.7)        154 (100.0)         
Age, in years, median (iqi)            55.0 (49.5; 65.2)  57.6 (47.9; 68.7)  55.1 (48.3; 65.5)     0.66
Body mass index, kg/m^2, median (iqi)  24.9 (21.9; 27.6)  23.9 (22.5; 27.4)  24.7 (22.0; 27.6)     0.69
-------------------------------------------------------------------------------------------------------

and report

basetable female age-bmi(%6.1f, iqi), pseudo


-------------------------------------------------------------------------------------------------------
Columns by: Gender                                  male             female              Total  P-value
-------------------------------------------------------------------------------------------------------
n (%)                                         116 (75.3)          38 (24.7)        154 (100.0)         
Age, in years, median (iqi)            55.0 (49.0; 65.1)  56.7 (47.8; 68.4)  55.1 (48.4; 65.7)     0.66
Body mass index, kg/m^2, median (iqi)  24.9 (21.9; 27.7)  24.3 (22.2; 27.8)  24.7 (22.0; 27.6)     0.69
-------------------------------------------------------------------------------------------------------

However, every value in the latter is a mean of 5 neighboring values and hence an acceptable value to report by e.g. Registry Denmark. The value 5 can be changed using option small.

For categorical variables counts less than 5 are set to 5 in counts and totals by option hidesmall. Percentages are set to missing. The value 5 can be changed using option small.

To show this, a new dataset is used.

webuse lbw, clear

(Hosmer & Lemeshow data)
describe

Contains data from https://www.stata-press.com/data/r17/lbw.dta
 Observations:           189                  Hosmer & Lemeshow data
    Variables:            11                  15 Jan 2020 05:01
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
id              int     %8.0g                 Identification code
low             byte    %8.0g                 Birthweight<2500g
age             byte    %8.0g                 Age of mother
lwt             int     %8.0g                 Weight at last menstrual period
race            byte    %8.0g      race       Race
smoke           byte    %9.0g      smoke      Smoked during pregnancy
ptl             byte    %8.0g                 Premature labor history (count)
ht              byte    %8.0g                 Has history of hypertension
ui              byte    %8.0g                 Presence, uterine irritability
ftv             byte    %8.0g                 Number of visits to physician during 1st trimester
bwt             int     %8.0g                 Birthweight (grams)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by: 

Showing "Number of visits to physician during 1st trimester" by "Smoked during pregnancy"

basetable smoke ftv(c), hidesmall


------------------------------------------------------------------------------------------------------
Columns by: Smoked during pregnancy                         Nonsmoker     Smoker        Total  P-value
------------------------------------------------------------------------------------------------------
n (%)                                                      115 (60.8)  74 (39.2)  189 (100.0)         
Number of visits to physician during 1st trimester, n (%)                                             
  0                                                         55 (47.8)  45 (60.8)   100 (52.9)         
  1                                                         35 (30.4)  12 (16.2)    47 (24.9)         
  2                                                         19 (16.5)  11 (14.9)    30 (15.9)         
  3                                                           < 5 (.)    < 5 (.)     < 10 (.)         
  4                                                           < 5 (.)    < 5 (.)     < 10 (.)         
  6                                                           0 (0.0)    < 5 (.)      < 5 (.)     0.16
------------------------------------------------------------------------------------------------------

clearly demonstrates what is done.

Sometimes, it is too easy to find the missing value by sums and subtractions like below (1 = 189-(100+47+30+7+4)).

basetable _none ftv(c), hidesmall small(4)


----------------------------------------------------------------------
Variables                                                      Summary
----------------------------------------------------------------------
n (%)                                                      189 (100.0)
Number of visits to physician during 1st trimester, n (%)             
  0                                                         100 (52.9)
  1                                                          47 (24.9)
  2                                                          30 (15.9)
  3                                                            7 (3.7)
  4                                                            4 (2.1)
  6                                                            < 4 (.)
----------------------------------------------------------------------

In that case, it is recommended to collapse cells, e.g. collapsing 3, 4, and 6 to "> 2"


The do file for this document

Last update: 2022-04-21, Stata version 17