Regular expressions in Stata
Introduction
Regular expressions are a relatively easy, flexible method of searching strings. You can use them to search any string (e.g. variables, macros).
In Stata, there are three functions that use regular expressions.
Regular expressions can be very effective in cleansing string data.
Regular expression functions in Stata
Stata has the following regular expression functions:

regexm(s, re) performs a match of a regular expression and evaluates to 1 if regular expression re (a string) is satisfied by the string s, otherwise returns 0

regexr(s1,re,s2) replaces the first substring within s1 that matches re with s2 and returns the resulting string. If s1 contains no substring that matches re, the unaltered s1 is returned.

regexs(n) returns subexpression n from a previous regexm() match, where 0 < n < 10. Subexpression 0 is reserved for the entire string that satisfied the regular expression.
So a regular expression is a string which is used as a filter for another string.
Regular expressions rules
A regular expression is a string working as a string filter. The filter is based on a set of characters:

The dash operator () in az means "match a range of characters or numbers". The "a" and "z" are merely an example. It could also be 09, 58, FM, etc.

Period (.) means "match any character"

A backslash (/) is used as an escape character to match characters that would otherwise be interpreted as a regularexpression operator

The pipe character () signifies a logical "or" that is often used in character sets (see square brackets just below)

Square brackets ([]) denote a set of allowable characters/expressions to use in matching, such as [azAZ09] for all alphanumeric characters
The function =regexm(string, "[09]") below evaluates string acordingly to the regular expression "[09]", that says that a ciffer (from 0 to 9) is present in the text.

regexm("a", "[09]") gives 0 since a character is not a ciffer

regexm("abc", "[09]") gives 0 since text is not a ciffer

regexm("4", "[09]") gives 1 since 4 is a ciffer

regexm("4 ", "[09]") gives 1 since 4 + space contains a ciffer

regexm("44", "[09]") gives 1 since 44 has 2 ciffers
More special regular expression characters are:

Asterisk (*) means "match zero or more" of the preceding expression

Plus sign (+) means "match one or more" of the preceding expression

When caret (^) placed at the beginning of a regular expression, the caret means "match expression at beginning of string". This character can be thought of as an "anchor" character since it does not directly match a character, only the location of the match

When the dollar sign ($) is placed at the end of a regular expression, it means "match expression at end of string". This is the other anchor character

regexm("123.44", "^[09]+$") gives 0 since a character is not a ciffer

regexm("123.44", "^[09]+$") gives 0 since text is not a ciffer

regexm("123.44", "^[09]+$") gives 0 since 4 is a ciffer

regexm("123.44", "^[09]+$") gives 0 since 4 + space is not one or more ciffers

regexm("123.44", "^[09]+$") gives 0 since 44 has 2 ciffers

regexm("123.44", "^[09]+$") gives 0 since dot "." is not a ciffer
Examples
Howto test a regular expression in Stata
It is important to test regular expressions before full scale usage.
The easiest way to do so is to use the command display:
display =regexm("This test will return a 1", "t[eo]")
1
display =regexm("This will return a 0", "t[eo]")
0
Howto use regexm and regexs to generate a grouping variable
We use the auto data:
sysuse auto, clear
List the make of cars containing either of the strings Datsun, Pont or Toyota
list make if regexm(make, "DatsunPontToyota")
++  make   47.  Pont. Catalina  48.  Pont. Firebird  49.  Pont. Grand Prix  50.  Pont. Le Mans  51.  Pont. Phoenix   52.  Pont. Sunbird  56.  Datsun 200  57.  Datsun 210  58.  Datsun 510  59.  Datsun 810   67.  Toyota Celica  68.  Toyota Corolla  69.  Toyota Corona  ++
Define a grouping variable for the strings Datsun, Pont or Toyota. Note that what is in soft brackets () can be extrated by the function regexs with a integer between 1 and 9 as argument:
generate grp = regexs(1) if regexm(make, "(DatsunPontToyota)")
And the result is:
list make grp if regexm(make, "DatsunPontToyota")
++  make grp   47.  Pont. Catalina Pont  48.  Pont. Firebird Pont  49.  Pont. Grand Prix Pont  50.  Pont. Le Mans Pont  51.  Pont. Phoenix Pont   52.  Pont. Sunbird Pont  56.  Datsun 200 Datsun  57.  Datsun 210 Datsun  58.  Datsun 510 Datsun  59.  Datsun 810 Datsun   67.  Toyota Celica Toyota  68.  Toyota Corolla Toyota  69.  Toyota Corona Toyota  ++
The variable grp is set to missing if the make does not match one of the 3 strings Datsun, Pont, or Toyota:
codebook grp
 grp (unlabeled)  Type: String (str6) Unique values: 3 Missing "": 61/74 Tabulation: Freq. Value 61 "" 4 "Datsun" 6 "Pont" 3 "Toyota"
Howto standardise strings by using the regex replace function regexr
All variants of first name below must must be change to Niels Henrik
list
++  name age sex   1.  nh Bruun 52 male  2.  Henrik Bruun 52 male  3.  henrik Bruun 52 male  ++
The solution:
replace name = regexr(name, "nh[Hh]enrik", "Niels Henrik")
And the new data are:
list
++  name age sex   1.  Niels Henrik Bruun 52 male  2.  Niels Henrik Bruun 52 male  3.  Niels Henrik Bruun 52 male  ++
PS space matters!!!
replace name = regexr(name, "nh  [Hh]enrik", "Niels Henrik")
(1 real change made)
And now the changes are:
list
++  name age sex   1.  Niels HenrikBruun 52 male  2.  Henrik Bruun 52 male  3.  henrik Bruun 52 male  ++
Grouping strings, education at Denmark Statistics, into years of education
The variable AFSP4E must be transformed into another variable edu_time by the rules:
 If AFSP4E starts with 0, 1, 2 it must be "<=10 year"
 If AFSP4E starts with 3, 4 or 5B it must be ">10 year & <=15 year"
 If AFSP4E starts with 5A it must be ">15 year"
Here are some example values:
list
++  AFSP4E   1.  0C525000  2.  1C525000  3.  2C525000  4.  3A525000  5.  3B525000   6.  3C525000  7.  4A525000  8.  4B525000  9.  4C525000  10.  5A525000   11.  5B525000  ++
And the code could be (capture added due to Stata version 12):
capture generate str edu_time = "<=10 year" * regexm(AFSP4E, "^[02]") ///
+ ">10 year & <=15 year" * regexm(AFSP4E, "^[34]^5B") ///
+ ">15 year" * regexm(AFSP4E, "^5A")
And the result:
list
++  AFSP4E edu_time   1.  0C525000 <=10 year  2.  1C525000 <=10 year  3.  2C525000 <=10 year  4.  3A525000 >10 year & <=15 year  5.  3B525000 >10 year & <=15 year   6.  3C525000 >10 year & <=15 year  7.  4A525000 >10 year & <=15 year  8.  4B525000 >10 year & <=15 year  9.  4C525000 >10 year & <=15 year  10.  5A525000 >15 year   11.  5B525000 >10 year & <=15 year  ++
capture drop edu_time
Another version of the code could be:
generate edu_time = 1 * regexm(AFSP4E, "^[02]") ///
+ 2 * regexm(AFSP4E, "^[34]^5B") ///
+ 3 * regexm(AFSP4E, "^5A")
combined with:
label define edu_time 1 "<=10 year" 2 ">10 year & <=15 year" 3 ">15 year"
label values edu_time edu_time
And the result is the same (almost):
list
++  AFSP4E edu_time   1.  0C525000 <=10 year  2.  1C525000 <=10 year  3.  2C525000 <=10 year  4.  3A525000 >10 year & <=15 year  5.  3B525000 >10 year & <=15 year   6.  3C525000 >10 year & <=15 year  7.  4A525000 >10 year & <=15 year  8.  4B525000 >10 year & <=15 year  9.  4C525000 >10 year & <=15 year  10.  5A525000 >15 year   11.  5B525000 >10 year & <=15 year  ++
Now the variable test is labeled number just like it is prefered in Stata.
Grouping numbers, social group at Denmark Statistics
Now assume that a number variable SOCIO02 has to be grouped into a new variable employment by the 3 leading digits. The grouping is:
1. 111 112 113 114 120 131 132 133 134 135 139 310
2. 210 220 321 330
3. 410
First thing is that the variable must be a string variable in order to handled by regexm.
Second the note that "^111114" is not a regex for 111, 112, 113 and 114. This is how one would formulate it for numbers, but we handle strings here. So to get the string headings 111, 112, 113 and 114 note that they all start with "11" and are followed by 1, 2, 3, or 4 or in regex "^11[14]".
Thirdly note that the regex "^11[138]" selects strings starting with 111, 112, 113 and 118 since square brackets means 13 or 8 and 13 means 1, 2 or 3.
A solution is shown below:
generate employment = 1 if regexm(string(SOCIO02), "^11[14]^120^13[159]^310")
replace employment = 2 if regexm(string(SOCIO02), "^210^220^321^330")
replace employment = 3 if regexm(string(SOCIO02), "^410")
And the result is:
sort employment SOCIO02
list employment SOCIO02
++  employ~t SOCIO02   1.  1 111  2.  1 112  3.  1 113  4.  1 114  5.  1 120   6.  1 131  7.  1 132  8.  1 133  9.  1 134  10.  1 135   11.  1 139  12.  1 310  13.  2 210  14.  2 220  15.  2 321   16.  2 330  17.  3 410  18.  . 115  19.  . 118  ++
Getting the birthday from a danish social security number and more on testing
A danish social security number consist of 10 digits. The first 2 digits are day of birth, the next 2 digits are month of birth and the next 2 are the last 2 digits in the year of birth.
First generate a sample set to test the regular expressions:
clear
input str10 dksecnum
dksecnum 1. 2305123456 2. 1210728998 3. 121223 4. end
list
++  dksecnum   1.  2305123456  2.  1210728998  3.  121223  ++
To get birth dates from dksecnum (when it has proper values, ie 10 digits) simply do:
generate bday = mdy(real(regexs(2)), real(regexs(1)), 1900 + real(regexs(3))) ///
if regexm(dksecnum, "^([09][09])([09][09])([09][09])[09][09][09][09]")
format %tdCCYYNNDD bday
list
++  dksecnum bday   1.  2305123456 19120523  2.  1210728998 19721012  3.  121223 .  ++
References
 Stata: What are regular expressions and how can I use them in Stata?
 UCLA: How can I extract a portion of a string variable using regular expressions?
 Rose Anne Medeiros: Using regular expressions for data management in Stata
Last update: 20220418, Stata version 17