Pseudo percentiles

Introduction

It has been hard to report centiles, eg. the median on data from Statistics Denmark. The reason is that a calculated must be based on at least 5 observations. So a mean of 5 values is ok. But a median might be based on a single values or a linear combination of two values and hence it must not be reported.

One solution to this reporting problem is to use pseudo centiles. Similar requirements might be present other places than Statistics Denmark.

In this document pseudo percentiles are defined.

Pseudo percentiles are implemented in Stata commands -sumat- and -basetable-. These commands are demonstrated below.

It is important to have newest versions of the commands from ssc.

How pseudo centiles are defined

A variable x is generated to demonstrate the definition.

set seed 123
set obs 15
generate id = _n
generate x = runiformint(1,7)

First, a pseudo variable, pseudo_x, of x is generated as the moving average of five surrounding x values.

sort x
generate pseudo_x = (x[_n-2] + x[_n-1] + x[_n] + x[_n+1] + x[_n+2]) / 5 if inrange(_n, 3, _N-2)
format x pseudo_x %6.2f
list

+----------------------+
| id      x   pseudo_x |
|----------------------|
1. | 11   1.00          . |
2. |  5   2.00          . |
3. |  4   2.00       2.20 |
4. |  9   3.00       2.60 |
5. |  6   3.00       3.00 |
|----------------------|
6. |  1   3.00       3.60 |
7. |  8   4.00       4.00 |
8. | 10   5.00       4.40 |
9. |  2   5.00       4.80 |
10. | 15   5.00       5.20 |
|----------------------|
11. | 12   5.00       5.40 |
12. |  3   6.00       5.80 |
13. | 13   6.00       6.20 |
14. | 14   7.00          . |
15. |  7   7.00          . |
+----------------------+

The calculation can be summarised as:

• Value 3 is: (1 + 2 + 2 + 3 + 3) / 5 = 2.2
• Value 4 is: (2 + 2 + 3 + 3 + 3) / 5 = 2.6
• Value 5 is: (2 + 3 + 3 + 3 + 4) / 5 = 3
• Value 6 is: (3 + 3 + 3 + 4 + 5) / 5 = 3.6
• Value 7 is: (3 + 3 + 4 + 5 + 5) / 5 = 4
• Value 8 is: (3 + 4 + 5 + 5 + 5) / 5 = 4.4
• Value 9 is: (4 + 5 + 5 + 5 + 5) / 5 = 4.8
• Value 10 is: (5 + 5 + 5 + 5 + 6) / 5 = 5.2
• Value 11 is: (5 + 5 + 5 + 6 + 6) / 5 = 5.4
• Value 12 is: (5 + 5 + 6 + 6 + 7) / 5 = 5.8
• Value 13 is: (5 + 6 + 6 + 7 + 7) / 5 = 6.2

In other words, every value of pseudo_x is a average of 5 surrounding x values. Hence, they are allowed to be reported from Statistics Denmark as single values or as e.g. centiles.

The pseudo centiles are based on the variable pseudo_x and are calculated similar as described in Methods and formulas for Stata command -centile- as the default case with the difference that the number of observations (n) is that of variable of variable x (here 15).

A demonstration of pseudo percentiles using -sumat-

Suppose minimum, maximum, lower and upper percentetiles, quartiles and deciles are wanted.

Working outside Statistics Denmark one would simply do something like:

sumat x, statistics(min p01 p10 p25 p50 p75 p90 p99 max)

------------------------------------------------------
min  p01   p10   p25   p50   p75   p90   p99   max
------------------------------------------------------
x  1.00       1.60  3.00  5.00  6.00  7.00  7.00  7.00
------------------------------------------------------

To get pseudo percentiles based on a average of 5 one simply adds the option hide(5):

sumat x, statistics(min p01 p10 p25 p50 p75 p90 p99 max) hide(5)

------------------------------------------------------
min  p01   p10   p25   p50   p75   p90   p99   max
------------------------------------------------------
x  2.20       2.20  2.60  4.40  5.80  6.20  6.20  6.20
------------------------------------------------------

It is seen that there might be small differences in values, but the levels will be right. And pseudo percentiles will always be averages of (in this case 5) neighboring values.

pseudo centiles in commands -sumat- and -basetable-

An example on a proper dataset could be using

webuse lbw, clear
Meta data for the lbw dataset.
Name IndexLabel Value Label NameFormatValue Label Values nuniquemissing
id 1identification code %8.0g 189 189 0
low 2birthweight<2500g %8.0g 189 2 0
age 3age of mother %8.0g 189 24 0
lwt 4weight at last menstrual period %8.0g 189 76 0
race 5race race %8.0g 1 "white" 2 "black" 3 "other"189 3 0
smoke 6smoked during pregnancy smoke %9.0g 0 "nonsmoker" 1 "smoker" 189 2 0
ptl 7premature labor history (count) %8.0g 189 4 0
ht 8has history of hypertension %8.0g 189 2 0
ui 9presence, uterine irritability %8.0g 189 2 0
ftv 10number of visits to physician during 1st trimester %8.0g 189 6 0
bwt 11birthweight (grams) %8.0g 189 133 0

Using -sumat-

A summary of quartiles by mother smoking during pregnanacy and race

To get the real quartiles one could do:

sumat bwt, statistics(median iqi) decimals(1) rowby(smoke) colby(race) total label title(The real quartiles)

The real quartiles:
-----------------------------------------------------------------------------------------------------------------------------------------------
race(white)                  race(black)                  race(other)                   Total
median  iq 25%  iq 75%       median  iq 25%  iq 75%       median  iq 25%  iq 75%  median  iq 25%  iq 75%
-----------------------------------------------------------------------------------------------------------------------------------------------
smoked during pregnancy(nonsmoke       3593.0  3062.0  3899.0       2920.0  2452.3  3359.8       2807.0  2301.0  3274.0  3100.0  2495.0  3629.0
smoked during pregnancy(smoker)        2775.5  2410.0  3274.5       2381.0  2253.5  2971.5       3146.5  2274.8  3316.5  2775.5  2363.5  3270.8
Total                                  3076.0  2566.3  3651.0       2849.0  2349.3  3125.8       2835.0  2301.0  3274.0  2977.0  2412.0  3481.0
-----------------------------------------------------------------------------------------------------------------------------------------------

And reporting pseudo percentiles (note only difference from above is option hide(5)):

sumat bwt, statistics(median iqi) decimals(1) rowby(smoke) colby(race) total label title(Quartiles based on pseudo percentiles) hide(5)

Quartiles based on pseudo percentiles:
-----------------------------------------------------------------------------------------------------------------------------------------------
race(white)                  race(black)                  race(other)                   Total
median  iq 25%  iq 75%       median  iq 25%  iq 75%       median  iq 25%  iq 75%  median  iq 25%  iq 75%
-----------------------------------------------------------------------------------------------------------------------------------------------
smoked during pregnancy(nonsmoke       3562.0  3028.7  3895.4       2877.5  2356.7  3348.4       2801.0  2305.8  3297.4  3111.8  2502.8  3631.8
smoked during pregnancy(smoker)        2785.3  2383.0  3262.3       2534.2  2061.0  2947.0       2986.5  2120.3  3339.0  2785.3  2351.3  3254.7
Total                                  3074.0  2551.0  3657.8       2769.7  2309.7  3175.7       2826.4  2305.8  3286.4  2971.2  2414.9  3499.0
-----------------------------------------------------------------------------------------------------------------------------------------------

Pseudo percentiles in -basetable-

One regular use of -basetable-

basetable smoke bwt(%6.1f, iqi), title(The real quartiles)

The real quartiles:
-----------------------------------------------------------------------------------------------------------------------
Columns by: smoked during pregnancy                nonsmoker                   smoker                    Total  P-value
-----------------------------------------------------------------------------------------------------------------------
n (%)                                             115 (60.8)                74 (39.2)              189 (100.0)
birthweight (grams), median (iqi)    3100.0 (2495.0; 3629.0)  2775.5 (2363.5; 3270.8)  2977.0 (2412.0; 3481.0)     0.01
-----------------------------------------------------------------------------------------------------------------------

The command -basetable- also handles pseudo percentiles. But here there are two options required:

• pseudo: indicating that pseudo percentiles should be used
• small: specifying number of values to use in the moving average. Default is 5.

Reporting pseudo percentiles (average of 5) is done by:

basetable smoke bwt(%6.1f, iqi) bwt(%6.1f, idi) bwt(%6.1f, imi), title(Using pseudo percentiles) pseudo

Using pseudo percentiles:
-----------------------------------------------------------------------------------------------------------------------
Columns by: smoked during pregnancy                nonsmoker                   smoker                    Total  P-value
-----------------------------------------------------------------------------------------------------------------------
n (%)                                             115 (60.8)                74 (39.2)              189 (100.0)
birthweight (grams), median (iqi)    3111.8 (2502.8; 3631.8)  2785.3 (2351.3; 3254.7)  2971.2 (2414.9; 3499.0)     0.01
birthweight (grams), median (idi)    2003.4 (2502.8; 3970.7)  1972.1 (2351.3; 3654.9)  1988.8 (2414.9; 3880.0)     0.01
-----------------------------------------------------------------------------------------------------------------------

The do file for this document

Last update: 2019-09-13, Stata version 15.1