Regular expressions in Stata

Introduction

Regular expressions are a relatively easy, flexible method of searching strings. You can use them to search any string (e.g. variables, macros).

In Stata, there are three functions that use regular expressions.

Regular expressions can be very effective in cleansing string data.

Regular expression functions in Stata

Stata has the following regular expression functions:

So a regular expression is a string which is used as a filter for another string.

Regular expressions rules

A regular expression is a string working as a string filter. The filter is based on a set of characters:

The function =regexm(string, "[0-9]") below evaluates string acordingly to the regular expression "[0-9]", that says that a ciffer (from 0 to 9) is present in the text.

More special regular expression characters are:

Examples

Howto test a regular expression in Stata

It is important to test regular expressions before full scale usage.

The easiest way to do so is to use the command -display-:


display =regexm("This test will return a 1", "t[eo]")

1

display =regexm("This will return a 0", "t[eo]")

0

Howto use regexm and regexs to generate a grouping variable

We use the auto data:


sysuse auto, clear

List the make of cars containing either of the strings Datsun, Pont or Toyota


list make if regexm(make, "Datsun|Pont|Toyota")

     +------------------+
     | make             |
     |------------------|
 47. | Pont. Catalina   |
 48. | Pont. Firebird   |
 49. | Pont. Grand Prix |
 50. | Pont. Le Mans    |
 51. | Pont. Phoenix    |
     |------------------|
 52. | Pont. Sunbird    |
 56. | Datsun 200       |
 57. | Datsun 210       |
 58. | Datsun 510       |
 59. | Datsun 810       |
     |------------------|
 67. | Toyota Celica    |
 68. | Toyota Corolla   |
 69. | Toyota Corona    |
     +------------------+

Define a grouping variable for the strings Datsun, Pont or Toyota. Note that what is in soft brackets () can be extrated by the function regexs with a integer between 1 and 9 as argument:


generate grp = regexs(1) if regexm(make, "(Datsun|Pont|Toyota)")

And the result is:


list make grp if regexm(make, "Datsun|Pont|Toyota")

     +---------------------------+
     | make                  grp |
     |---------------------------|
 47. | Pont. Catalina       Pont |
 48. | Pont. Firebird       Pont |
 49. | Pont. Grand Prix     Pont |
 50. | Pont. Le Mans        Pont |
 51. | Pont. Phoenix        Pont |
     |---------------------------|
 52. | Pont. Sunbird        Pont |
 56. | Datsun 200         Datsun |
 57. | Datsun 210         Datsun |
 58. | Datsun 510         Datsun |
 59. | Datsun 810         Datsun |
     |---------------------------|
 67. | Toyota Celica      Toyota |
 68. | Toyota Corolla     Toyota |
 69. | Toyota Corona      Toyota |
     +---------------------------+

The variable grp is set to missing if the make does not match one of the 3 strings Datsun, Pont, or Toyota:


codebook grp

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
grp                                                                                                                                                                                                                                                 (unlabeled)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                  type:  string (str6)

         unique values:  3                        missing "":  61/74

            tabulation:  Freq.  Value
                            61  ""
                             4  "Datsun"
                             6  "Pont"
                             3  "Toyota"

Howto standardise strings by using the regex replace function regexr

All variants of first name below must must be change to Niels Henrik


list

     +---------------------------+
     |         name   age    sex |
     |---------------------------|
  1. |     nh Bruun    52   male |
  2. | Henrik Bruun    52   male |
  3. | henrik Bruun    52   male |
     +---------------------------+

The solution:


replace name = regexr(name, "nh|[H|h]enrik", "Niels Henrik")  

And the new data are:


list

     +---------------------------------+
     |               name   age    sex |
     |---------------------------------|
  1. | Niels Henrik Bruun    52   male |
  2. | Niels Henrik Bruun    52   male |
  3. | Niels Henrik Bruun    52   male |
     +---------------------------------+

PS space matters!!!


replace name = regexr(name, "nh | [H|h]enrik", "Niels Henrik")  

(1 real change made)

And now the changes are:


list

     +--------------------------------+
     |              name   age    sex |
     |--------------------------------|
  1. | Niels HenrikBruun    52   male |
  2. |      Henrik Bruun    52   male |
  3. |      henrik Bruun    52   male |
     +--------------------------------+

Grouping strings, education at Denmark Statistics, into years of education

The variable AFSP4E must be transformed into another variable edu_time by the rules:

Here are some example values:


list

     +----------+
     |   AFSP4E |
     |----------|
  1. | 0C525000 |
  2. | 1C525000 |
  3. | 2C525000 |
  4. | 3A525000 |
  5. | 3B525000 |
     |----------|
  6. | 3C525000 |
  7. | 4A525000 |
  8. | 4B525000 |
  9. | 4C525000 |
 10. | 5A525000 |
     |----------|
 11. | 5B525000 |
     +----------+

And the code could be (capture added due to Stata version 12):


capture generate str edu_time = "<=10 year" * regexm(AFSP4E, "^[0-2]") ///
                                + ">10 year & <=15 year" * regexm(AFSP4E, "^[3-4]|^5B") ///
                                + ">15 year" * regexm(AFSP4E, "^5A")

And the result:


list

     +---------------------------------+
     |   AFSP4E               edu_time |
     |---------------------------------|
  1. | 0C525000              <=10 year |
  2. | 1C525000              <=10 year |
  3. | 2C525000              <=10 year |
  4. | 3A525000   >10 year & <=15 year |
  5. | 3B525000   >10 year & <=15 year |
     |---------------------------------|
  6. | 3C525000   >10 year & <=15 year |
  7. | 4A525000   >10 year & <=15 year |
  8. | 4B525000   >10 year & <=15 year |
  9. | 4C525000   >10 year & <=15 year |
 10. | 5A525000               >15 year |
     |---------------------------------|
 11. | 5B525000   >10 year & <=15 year |
     +---------------------------------+

capture drop edu_time

Another version of the code could be:


generate edu_time = 1 * regexm(AFSP4E, "^[0-2]") ///
                                                + 2 * regexm(AFSP4E, "^[3-4]|^5B") ///
                                                + 3 * regexm(AFSP4E, "^5A")

combined with:


label define edu_time 1 "<=10 year" 2 ">10 year & <=15 year" 3 ">15 year"
label values edu_time edu_time

And the result is the same (almost):


list

     +---------------------------------+
     |   AFSP4E               edu_time |
     |---------------------------------|
  1. | 0C525000              <=10 year |
  2. | 1C525000              <=10 year |
  3. | 2C525000              <=10 year |
  4. | 3A525000   >10 year & <=15 year |
  5. | 3B525000   >10 year & <=15 year |
     |---------------------------------|
  6. | 3C525000   >10 year & <=15 year |
  7. | 4A525000   >10 year & <=15 year |
  8. | 4B525000   >10 year & <=15 year |
  9. | 4C525000   >10 year & <=15 year |
 10. | 5A525000               >15 year |
     |---------------------------------|
 11. | 5B525000   >10 year & <=15 year |
     +---------------------------------+

Now the variable test is labeled number just like it is prefered in Stata.

Grouping numbers, social group at Denmark Statistics

Now assume that a number variable SOCIO02 has to be grouped into a new variable employment by the 3 leading digits. The grouping is:

    1. 111 112 113 114 120 131 132 133 134 135 139 310
    2. 210 220 321 330
    3. 410

First thing is that the variable must be a string variable in order to handled by regexm.

Second the note that "^111-114" is not a regex for 111, 112, 113 and 114. This is how one would formulate it for numbers, but we handle strings here. So to get the string headings 111, 112, 113 and 114 note that they all start with "11" and are followed by 1, 2, 3, or 4 or in regex "^11[1-4]".

Thirdly note that the regex "^11[1-38]" selects strings starting with 111, 112, 113 and 118 since square brackets means 1-3 or 8 and 1-3 means 1, 2 or 3.

A solution is shown below:


generate employment = 1 if regexm(string(SOCIO02), "^11[1-4]|^120|^13[1-59]|^310") 
replace employment = 2 if regexm(string(SOCIO02), "^210|^220|^321|^330")
replace employment = 3 if regexm(string(SOCIO02), "^410")

And the result is:


sort employment SOCIO02
list employment SOCIO02

     +--------------------+
     | employ~t   SOCIO02 |
     |--------------------|
  1. |        1       111 |
  2. |        1       112 |
  3. |        1       113 |
  4. |        1       114 |
  5. |        1       120 |
     |--------------------|
  6. |        1       131 |
  7. |        1       132 |
  8. |        1       133 |
  9. |        1       134 |
 10. |        1       135 |
     |--------------------|
 11. |        1       139 |
 12. |        1       310 |
 13. |        2       210 |
 14. |        2       220 |
 15. |        2       321 |
     |--------------------|
 16. |        2       330 |
 17. |        3       410 |
 18. |        .       115 |
 19. |        .       118 |
     +--------------------+

Getting the birthday from a danish social security number and more on testing

A danish social security number consist of 10 digits. The first 2 digits are day of birth, the next 2 digits are month of birth and the next 2 are the last 2 digits in the year of birth.

First generate a sample set to test the regular expressions:


clear
input str10 dksecnum

       dksecnum
  1. 2305123456
  2. 1210728998
  3. 121223
  4. end

list

     +------------+
     |   dksecnum |
     |------------|
  1. | 2305123456 |
  2. | 1210728998 |
  3. |     121223 |
     +------------+

To get birth dates from dksecnum (when it has proper values, ie 10 digits) simply do:


generate bday = mdy(real(regexs(2)), real(regexs(1)), 1900 + real(regexs(3))) ///
        if regexm(dksecnum, "^([0-9][0-9])([0-9][0-9])([0-9][0-9])[0-9][0-9][0-9][0-9]")
format %tdCCYY-NN-DD bday
list

     +-------------------------+
     |   dksecnum         bday |
     |-------------------------|
  1. | 2305123456   1912-05-23 |
  2. | 1210728998   1972-10-12 |
  3. |     121223            . |
     +-------------------------+

References

  1. Stata: What are regular expressions and how can I use them in Stata?
  2. UCLA: How can I extract a portion of a string variable using regular expressions?
  3. Rose Anne Medeiros: Using regular expressions for data management in Stata

The do file for this document

Last update: 2017-06-03, Stata version 14.2