Generating a DREAM sample dataset

It is documented how the DREAM sample dataset is generated. Note that it only vagly is similar to the original.

The idea is to have a dataset of a number of persons of different age and gender followed weekly over a long period.

At age 65 these people retire. Before age 65 they can be in four states:

After age 65 the states are Retired or Died.

Sticky states are such that once reached the state is not left before death (also a sticky state).

Initialisation

We follow 10000 persons (R) from the first week (week 32 in 1991) in the DREAM database to the last week in 2017 (lstwk).

Note that for simplicity we use the builtin (American) week notation.

The european notation operates with week 53 in some years.

The generation is mainly done in Mata.

mata {
        frstwk = 31*53  // 1991w32
        lstwk = 58*52-1 // 2017w52
        wknames = "y" :+ strofreal(frstwk..lstwk, "%tw")
        //select(wknames', regexm(wknames', "53$"))
        C = cols(wknames)
        rseed(123)
        R = 10000
}

Generating age and gender

By design the males are 5 years older on average than the females, on average age 40 versus age 35.

mata {
        id = 1::R
        male = rbinomial(R, 1, 1, 0.48)
        age = rpoisson(R, 1, 15) :* !male + rpoisson(R, 1, 20) :* male :+ 20
}

Generating and age and gender dependent score and setting pension at 65

It would be expected that with higher age the lower is the probability of being working.

Further, males are modelled to have a higher score than females.

The score is saved in the variable scr. Retirement week is saved in the variable retired_at.

mata {
        scr = J(R, C, .)
        retired_at = J(R,1,.)
        for(r=1;r<=R;r++) for(c=1;c<=C;c++) {
                curr_age = age[r] + trunc(c / 52)       // age at start plus time in years
                scr[r,c] = male[r] + curr_age / 10
                if ( curr_age >= 65 & retired_at[r] == .) retired_at[r] = c
        }
}

Setting working life states

The score is added a normal random component with mean zero and a standard deviation of 5.

The state is saved in the variable state with values:

To ease calculations the function std_matrix01 is created.

real matrix std_matrix01(real matrix M)
{
        real scalar level, variation
        level = min(M)
        variation = max(M) - min(M)
        return((M :- level) :/ variation)
}

It turns a matrix of scores into a matrix of scores between zero and one.

mata {
        scr = std_matrix01(scr + rnormal(R,C,0,5) )
        state = (scr :> 0.65) + rbinomial(1,1,1, (scr :/ 1500))
}

Early retirement and pension

Early retirement must happen before pension.

First find the first week of early retirement. A matrix of R rows of values one to C (J(R,1,1..C) is divided with a zero/one variable (state :== 2). The result is either a number between one and C or missing (.).

mata: early_retirement_at = rowmin( (J(R,1,1..C) :/ (state :== 2), J(R, 1, .)) )

For each person the weeks after first week with early retirement is set to early retirement.

mata { 
for(r=1;r<=R;r++) {
        if ( early_retirement_at[r] < . ) {
                state[r, early_retirement_at[r]..C] = J(1, C - early_retirement_at[r] + 1, 2)
        }
}       
}

Likewise for pension. And since pension is set after early retirement there is no one on early retirement after they have reached the pension age of 65.

mata {
for(r=1;r<=R;r++) {
        if ( retired_at[r] < . ) {
                state[r, retired_at[r]..C] = J(1, C - retired_at[r] + 1, 4)
        }
}
}

A score of death (diedscr) is created from the age and gender score (scr) added some random noise with standard deviation of 15.

mata: diedscr = std_matrix01(scr + rnormal(R,C,0,15) )

The week of ("first") death is found.

mata: died_at = rowmin( (J(R,1,1..C) :/ (diedscr :>= 0.85), J(R, 1, .)) )

Finally, every week after a death is set to the state Died.

mata {
for(r=1;r<=R;r++) {
        if ( died_at[r] < . ) {
                state[r, died_at[r]..C] = J(1, C - died_at[r] + 1, 3)   
        }
}
}

These variables are inserted into a dataset.

mata {
        nhb_sae_addvars(("id", "male", "age"), (id, male, age))
        nhb_sae_addvars(wknames, state)
}

Labels etc are added.

label define state 0 "Working" 1 "On benefits" 2 "Early retirement" 3 "Died" 4 "Retired"
label values y* state
label data "DREAM like dataset"
label variable male "Gender"
label define male 0 "Female" 1 "Male"
label values male male
label variable age "Age at week 32, 1991 (Years)"
label variable id "Id"
notes: Registrations (Working/On benefits/Early retirement/Died/Retired) each week for 10000 participants from start (week 32, 1991) until week 52, 2017
save dream, replace

Validation at start and at end

The distribution of states for the first and last week are shown below

crossmat y1991w32 y2017w52
Events at first and last week
y2017w52
WorkingOn benefitsEarly retirementDiedRetiredTotal
y1991w32Working 4498 452 1951180 3202 9527
On benefits 181 19 9 56 206 471
Died 0 0 0 2 0 2
Total 4679 471 2041238 340810000

The DREAM example data


The do file for this document

Last update: 2018-12-11, Stata version 15.1