利用Stata的正则表达式实现提取字符串的任务

有位坛友在日常工作中碰到一个提取字符串的难题：提取两字符“\”和“.”之间的字符串，然后提取其中的信息。举例如下：路径名称为“D:/项目/2022年/项目工作/区县报表/更新纠错后\九龙区-2022年10月-二次供水.xlsx”，欲提取的信息为“九龙区-2022年10月-二次供水”，然后还想提取区名：九龙，年份：2022，月份：10，水样类型：二次供水。

生成演示数据的Stata命令如下：

clear
input str100 路径名称
"D:/项目/2022年/项目工作/区县报表/更新纠错后\九龙区-2022年10月-二次供水.xlsx"
"D:/项目/2022年/项目工作/区县报表/更新纠错后\九龙区-2022年10月-二次供水.xlsx"
"D:/项目/2022年/项目工作/区县报表/更新纠错后\九龙区-2022年10月-二次供水.xlsx"
"D:/项目/2022年/项目工作/区县报表/更新纠错后\九龙区-2022年10月-末梢水.xlsx"
"D:/项目/2022年/项目工作/区县报表/更新纠错后\九龙区-2022年10月-末梢水.xlsx"
"D:/项目/2022年/项目工作/区县报表/更新纠错后\九龙区-2022年10月-末梢水.xlsx"
end
list

利用正则表达式来提取字符串的Stata命令如下：

 //Stata Regular Expression
gen 欲提取= ustrregexs(2) if(ustrregexm(路径名称, "(后\\)(.*)[.]"))
gen 区名 = ustrregexs(2) if(ustrregexm(路径名称, "(后\\)(.*)[区]"))
gen 年份 = ustrregexs(1) if(ustrregexm(路径名称, "([0-9][0-9][0-9][0-9])[年]"))
gen 月份 = ustrregexs(1) if(ustrregexm(路径名称, "([0-9][0-9])[月]"))
gen 水样类型 = ustrregexs(2) if(ustrregexm(路径名称, "(月-)(.*)[.]"))
list  路径名称 欲提取 区名 年份 月份 水样类型 in 1/5

正则表达式的相关资源：

What are regular expressions and how can I use them in Stata? https://www.stata.com/support/faqs/data-management/regular-expressions
How Can I Extract a Portion Of a String Variable Using Rregular Expressions? | Stata FAQ. https://stats.oarc.ucla.edu/stata/faq/how-can-i-extract-a-portion-of-a-string-variable-using-regular-expressions
Stata Regular Expressions – An Introduction. https://www.techtips.surveydesign.com.au/post/stata-regular-expressions-an-introduction
Regular Expressions in Stata Cheat Sheet. https://jamesthomas.uk/pdf/Regular_expressions_cheat_sheet.pdf

完，本文内容交流请移步：https://www.epiman.cn/thread-260866-1-1.html