Using Mata code for data management for datasets with two dimensions

Datasets like DREAM has two dimensions, rows are individuals and columns are mainly many measurements over time.

Handling datasets with two dimensions are described in (2014) Jakob Hjort.

What is needed is to retrieve information regarding one or more events in a time window.

In essence, there is two challenges:

1. Getting a time window, eg. 52 weeks (a year) from a time point It is important for speed of calculations to limit the time window to what is needed

2. Getting information regarding events in that window

Using Mata is much faster than using Stata.

The DREAM example data is used. It is described on this website.

use dream.dta, clear


Most of the columns represent time points (one per week) for the participants.

A random start point (diagnosed) is added for later.

mata: rseed(123)
mata: nhb_sae_addvars("diagnosed", st_varname(runiformint(1, st_nobs(), 4, st_nvar()))')


It is very important that time point variables are ordered. Otherwise, the code examples below will lead to wrong results.

order *, alphabetic
order id male age diagnosed


Getting a time window

Time window with fixed starting/ending point

Assume that the starting point is week 7 in 2005 and that a time window of one year (52 weeks) is wanted.

Using the naming convention of the data first week is named y2005w07.

To get the index of variable y2005w07:

mata: start = st_varindex("y2005w07")
mata: time_window = st_data(., start..(start+51))


By subtracting 51 instead of adding it is easy to look at the year ending at y2005w07.

The first 10 rows for the first quarter (13 weeks) can be looked upon using:

mata: time_window[1..10,1..13]


1    2    3    4    5    6    7    8    9   10   11   12   13
+------------------------------------------------------------------+
1 |   2    2    2    2    2    2    2    2    2    2    2    2    2  |
2 |   0    0    0    0    0    0    1    1    0    0    0    0    0  |
3 |   0    0    0    0    0    1    0    0    0    0    1    0    1  |
4 |   0    0    0    0    0    0    0    1    0    0    0    1    0  |
5 |   0    0    0    0    0    0    0    0    0    1    0    0    0  |
6 |   0    1    0    0    0    0    0    1    0    0    0    0    0  |
7 |   1    0    0    0    0    0    0    0    0    0    0    0    0  |
8 |   0    0    0    0    1    0    0    0    0    0    0    0    0  |
9 |   0    0    0    0    0    0    1    1    0    0    0    0    0  |
10 |   0    1    0    0    0    0    0    0    0    0    0    0    0  |
+------------------------------------------------------------------+


Time window with a floating starting/ending point to the end of time

To find the starting point requires a little bit more code, since each participant has an individual starting point.

mata: tw_start = st_varindex(st_sdata(., "diagnosed")')'


To speed up the code the space required to save data needs to be specified. So, the maximal time window width (max) needs to be found.

mata: max = colmax( (st_nvar() :- tw_start) ) + 1


Hence, a matrix of the required size can be specified. The default value is set to a missing value (.m). Most participants will have a smaller time window than max.

Getting matrix column size (assuming last variable is the last time observed).

mata: time_window = J(st_nobs(), max, .m)


The end of the time window is found by:

mata: tw_end = J(st_nobs(), 1, st_nvar())


With the code below all time windows are aligned to the same starting point in matrix.

mata: for(r=1;r<=st_nobs();r++) time_window[r,1..(tw_end[r] - tw_start[r] + 1)] = st_data(r, tw_start[r]..tw_end[r])


Time window with a floating starting/ending point and fixed time interval

Finding the individual starting point and calculating the end point for the time window.

mata: tw_start = st_varindex(st_sdata(., "diagnosed")')'
mata: tw_end  = rowmin( ( tw_start :+ 51, J(st_nobs(), 1, st_nvar())) )


To speed up calculations a time window matrix of width 52 is defined.

mata: time_window = J(st_nobs(), 52, .m)


Individual time windows are aligned below, possibly having missings (.m) at the end of the time windows.

mata: for(r=1;r<=st_nobs();r++) time_window[r,1..(tw_end[r] - tw_start[r] + 1)] = st_data(r, tw_start[r]..tw_end[r])


Data management using Mata

Finding rows with a specific value at the last column

Below the rows not having a full (52) time window is found. Such a row will have the value .m in the last column.

mata: has_m = select(st_data(.,"id"), time_window[., 52] :== .m)


The first 5 are (note that has_m is a column vector so only one index is needed):

mata: has_m[1..5]


1
+-------+
1 |   39  |
2 |   61  |
3 |   85  |
4 |  100  |
5 |  101  |
+-------+


Count the number of a value for each participant

The number of missings .m can be found by

mata: countm = rowsum(time_window :== .m)


The first 5 counts are:

mata: countm[1..5]


1
+-----+
1 |  0  |
2 |  0  |
3 |  0  |
4 |  0  |
5 |  0  |
+-----+


The non-zero counts with ids

mata: count_nz = (has_m, select((st_data(.,"id"), countm), countm :> 0))


The first 15 counts with ids

mata: count_nz[1..15,.]


1     2     3
+-------------------+
1 |   39    39    33  |
2 |   61    61    15  |
3 |   85    85    25  |
4 |  100   100    19  |
5 |  101   101    38  |
6 |  132   132     9  |
7 |  152   152    42  |
8 |  197   197     8  |
9 |  201   201     9  |
10 |  213   213    49  |
11 |  216   216    23  |
12 |  249   249    20  |
13 |  280   280     4  |
14 |  283   283    42  |
15 |  326   326    41  |
+-------------------+


Looking at a single participant

Participant 213 has 49 missing values. The time window for 213 is:

mata time_window[213, .]


1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48   49
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1 |   3    3    3   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m   .m
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
50   51   52
----------------+
1    .m   .m   .m  |
----------------+


Below it is seen that participant 213 only is followed for three weeks before the dataset ends.

mata: st_varname(tw_start[213]), st_varname(tw_end[213]), st_varname(st_nvar())


1          2          3
+----------------------------------+
1 |  y2017w50   y2017w52   y2017w52  |
+----------------------------------+


The last three weeks for participant 213 matches what is found in the time window.

mata: st_data(213, tw_start[213]..st_nvar())


Finding unique values

First, the time window values are made into one column and the unique values are found.

mata: dta = colshape(time_window,1)
mata: uniq = uniqrows(dta)


To see the labels

label list state


state:
0 Working
1 On benefits
2 Early retirement
3 Died
4 Retired


Do count table for unique values

mata: uniq, rowsum(J(rows(uniq),1, dta') :== uniq)


1        2
+-------------------+
1 |       0   416875  |
2 |       1    30311  |
3 |       2     6783  |
4 |       3    31640  |
5 |       4    25273  |
6 |      .m     9118  |
+-------------------+


Finding first event

First number of rows and columns for the time window is found.

mata: R = rows(time_window); C = cols(time_window)


To find first event simply write

mata: early_retirement_at_1 = rowmin( J(R,1,1..C) :/ (time_window :== 2) )


The matrix rowmin( J(R,1,1..C) specifies the column indeces for each participant in the time window.

The matrix is one or zero depending on whether the time window value is 2 or not.

By dividing the two martrices cellwise the result is either a column index if the time window value is 2 or else missing due to division with zero.

In Stata the missing values are numerically plus infinity. By taking the rowmin the lowest column index different from missing is returned. In other words, for each participant the first column index where the time window value is 2.

To verify

1. The participants with events are found
2. Looking at selected participants with an event

The rows/participant having an event are found.

mata: has_1_ev = select(((1::R), early_retirement_at_1), early_retirement_at_1 :< .)


The first three rows with an event are

mata: has_1_ev[1..3,.]


1     2
+-------------+
1 |    1    35  |
2 |   53    43  |
3 |  170     1  |
+-------------+


The id of the participants for first three rows with an event are

mata: st_data(has_1_ev[1..3,1], "id")


1
+-------+
1 |    1  |
2 |   53  |
3 |  170  |
+-------+


The participant id value is the rownumber in this dataset.

Looking at the time window rows for the first 45 weeks, it is seen that the first event appears as calculated.

mata: time_window[has_1_ev[1..3,1],1..45]


1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 |   0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    0    0    1    0    0    0    0    2    2    2    2    2    2    2    2    2    2    2  |
2 |   0    0    0    0    0    0    1    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    1    0    0    1    0    0    0    0    0    0    2    2    2  |
3 |   2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2  |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


The do file for this document

Last update: 2019-09-11, Stata version 15.1