1 准备工作

加载包
library(tidyverse)
library(stringr)

1.1 stringr介绍

stringr包被定义为一致的、简单易用的字符串工具集。所有的函数和参数定义都具有一致性，比如，用相同的方法进行NA处理和0长度的向量处理。
字符串处理虽然不是R语言中最主要的功能，却也是必不可少的，数据清洗、可视化等的操作都会用到。对于R语言本身的base包提供的字符串基础函数，随着时间的积累，已经变得很多地方不一致，不规范的命名，不标准的参数定义，很难看一眼就上手使用。字符串处理在其他语言中都是非常方便的事情，R语言在这方面确实落后了。stringr包就是为了解决这个问题，让字符串处理变得简单易用，提供友好的字符串操作接口。

stringr的项目主页：https://cran.r-project.org/web/packages/stringr/index.html

1.2 stringr的分类

1.2.1 字符串拼接函数
str_c: 字符串拼接。
str_join: 字符串拼接，同str_c。
str_trim: 去掉字符串的空格和TAB(\t)
str_pad: 补充字符串的长度
str_dup: 复制字符串
str_wrap: 控制字符串输出格式
str_sub: 截取字符串
str_sub<- 截取字符串，并赋值，同str_sub

1.2.2 字符串计算函数
str_count: 字符串计数
str_length: 字符串长度
str_sort: 字符串值排序
str_order: 字符串索引排序，规则同str_sort

1.2.3 字符串匹配函数
str_split: 字符串分割
str_split_fixed: 字符串分割，同str_split
str_subset: 返回匹配的字符串
word: 从文本中提取单词
str_detect: 检查匹配字符串的字符
str_match: 从字符串中提取匹配组。
str_match_all: 从字符串中提取匹配组，同str_match
str_replace: 字符串替换
str_replace_all: 字符串替换，同str_replace
str_replace_na:把NA替换为NA字符串
str_locate: 找到匹配的字符串的位置。
str_locate_all: 找到匹配的字符串的位置,同str_locate
str_extract: 从字符串中提取匹配字符
str_extract_all: 从字符串中提取匹配字符，同str_extract

1.2.3 字符串变换函数
str_conv: 字符编码转换
str_to_upper: 字符串转成大写
str_to_lower: 字符串转成小写,规则同str_to_upper
str_to_title: 字符串转成首字母大写,规则同str_to_upper

1.2.3 参数控制函数，仅用于构造功能的参数，不能独立使用
boundary: 定义使用边界
coll: 定义字符串标准排序规则。
fixed: 定义用于匹配的字符，包括正则表达式中的转义符
regex: 定义正则表达式

1.3 `stringr`包中的重要函数

函数	功能说明	R Base中对应函数
使用正则表达式的函数
`str_extract()`	提取首个匹配模式的字符	`regmatches()`
`str_extract_all()`	提取所有匹配模式的字符	`regmatches()`
`str_locate()`	返回首个匹配模式的字符的位置	`regexpr()`
`str_locate_all()`	返回所有匹配模式的字符的位置	`gregexpr()`
`str_replace()`	替换首个匹配模式	`sub()`
`str_replace_all()`	替换所有匹配模式	`gsub()`
`str_split()`	按照模式分割字符串	`strsplit()`
`str_split_fixed()`	按照模式将字符串分割成指定个数	-
`str_detect()`	检测字符是否存在某些指定模式	`grepl()`
`str_count()`	返回指定模式出现的次数	-
其他重要函数
`str_sub()`	提取指定位置的字符	`regmatches()`
`str_dup()`	丢弃指定位置的字符	-
`str_length()`	返回字符的长度	`nchar()`
`str_pad()`	填补字符	-
`str_trim()`	丢弃填充，如去掉字符前后的空格	-
`str_c()`	连接字符	`paste(),paste0()`

1.4 特殊符号

.，^，$，*，+，?，[，]，(，)，{，}，\和/必须使用\作为转义
可以使用?'"' 或?"'"调出帮助文件来查看完整的特殊字符列表，匹配\n需要构建"\n"正则表达式

?"'"
?'"'

# \n     newline
# \r     carriage return
# \t     tab
# \b     backspace
# \a     alert (bell)
# \f     form feed
# \v     vertical tab
# \\     backslash \
# \'     ASCII apostrophe '
# \"     ASCII quotation mark "
# \`     ASCII grave accent (backtick) `
# \nnn   character with given octal code (1, 2 or 3 digits)
# \xnn   character with given hex code (1 or 2 hex digits)
# \unnnn     Unicode character with given code (1--4 hex digits)
# \Unnnnnnnn     Unicode character with given code (1--8 hex digits)

2 字符串基础

# 可以使用单引号或双引号来创建字符串。与其他语言不同，单引号和双引号在R 中没有区
别。我们推荐使用"
string1 <- "This is a string"
string2 <- 'To put a "quote" inside a string, use single quotes'


# 如果忘记了结尾的引号，你会看到一个续行符`+`，如果遇到了这种情况，可以按`Esc`键，然后重新输入

# 如果想要在字符串中包含一个单引号或双引号，可以使用`\` 对其进行“转义”：

(double_quote <- "\"" )
# [1] "\""
('"')
# [1] "\""

(single_quote <- '\'')
# [1] "'"
("'")
# [1] "'"

# 如果想要在字符串中包含一个反斜杠，就需要使用两个反斜杠：\\
(x <- c("\"", "\\"))
# [1] "\"" "\\"
writeLines(x)
# "
# \

# 字符串的打印形式与其本身的内容不是相同的，因为**打印形式中会显示出转义字符**。如果想要查看字符串的初始内容，可以使用`writelines() `函数
x <- "\u00b5"
x
# [1] "µ"

2.1 字符串长度`str_length()`

str_length(c("a", "R for data science", NA))
# [1]  1 18 NA

2.3 字符串组合`str_c()`

# 要想组合两个或更多字符串，可以使用str_c() 函数
str_c("x", "y")
# [1] "xy"

str_c("x", "y", "z")
# [1] "xyz"

# 可以使用sep 参数来控制字符串间的分隔方式：
str_c("x", "y", sep = "_")
# [1] "x_y"

# 和多数R 函数一样，缺失值是可传染的。如果想要将它们输出为"NA"，可以使用str_
replace_na()：
x <- c("abc", NA)
x
# [1] "abc" NA

str_c("|-", x, "-|")
# [1] "|-abc-|" NA 

str_c("|-", str_replace_na(x), "-|")
# [1] "|-abc-|" "|-NA-|" 


# str_c() 函数是向量化的，它可以自动循环短向量，使得其与最长的向量具有相同的长度
str_c("prefix-", c("a", "b", "c"), "-suffix")
# [1] "prefix-a-suffix" "prefix-b-suffix" "prefix-c-suffix"
# 要想将字符向量合并为字符串，可以使用collapse() 函数：
str_c(c("x", "y", "z"), collapse = ", ")
[1] "x, y, z"

2.4 字符串取子集`str_sub()`

可以使用str_sub() 函数来提取字符串的一部分。除了字符串参数外，str_sub() 函数中还
有start 和end 参数，它们给出了子串的位置（包括start 和end 在内）：

x <- c("Apple", "Banana", "Pear")
str_sub(x, 1, 3)
# [1] "App" "Ban" "Pea"

# 负数表示从后往前数
str_sub(x, -3, -1)
# [1] "ple" "ana" "ear"

# 即使字符串过短，str_sub() 函数也不会出错，它将返回尽可能多的字符：
str_sub("a", 1, 5)
# [1] "a"

# 还可以使用str_sub() 函数的赋值形式来修改字符串：
str_sub(x, 1, 1) <- str_to_lower(str_sub(x, 1, 1))
x
#> [1] "apple" "banana" "pear"

2.5 区域设置

# 文本大小写转换 str_to_lower() ,str_to_upper() ,str_to_title() 函数
# 因为不同的语言有不同的转换规则，使得大小写转换比较复杂
# 土耳其语中有带点和不带点的两个i，它们在转换为大写时是不同的：
str_to_upper(c("i", "ı"))
#> [1] "I" "I"
str_to_upper(c("i", "ı"), locale = "tr")
#> [1] "İ" "I"
x <- c("apple", "eggplant", "banana")
# 英语
str_sort(x, locale = "en") 
#> [1] "apple" "banana" "eggplant"
# 夏威夷语
str_sort(x, locale = "haw") 
#> [1] "apple" "eggplant" "banana"

2.6 练习

2.6.1 在没有使用stringr 的那些代码中，你会经常看到paste() 和paste0() 函数，这两个函数的区别是什么？ stringr 中的哪两个函数与它们是对应的？这些函数处理NA 的方式有什么不同？

用法
paste (..., sep = " ", collapse = NULL)
paste0(..., collapse = NULL)
用法等同于str_c()
实例

# sep，"字符串"分隔符，用于字符串间连接
# collapse，"向量"分隔符，用于向量元素连接

# paste()默认使用空格" "连接
paste("abc","def","ghi")
# [1] "abc def ghi"
paste("abc","def","ghi",sep = ".")
# [1] "abc.def.ghi"
# collapse未发挥作用
paste("abc","def","ghi",collapse = ".")
# [1] "abc def ghi"
paste(c("abc","def","ghi"),collapse = ".")
# [1] "abc.def.ghi"

# paste0()默认无间隔连接
paste0("abc","def","ghi")
# [1] "abcdefghi"
# sep参数不适用
paste0("abc","def","ghi",sep = ".")
# [1] "abcdefghi."
# paste0()函数collapse参数不适用单个对象
paste0("abc","def","ghi",collapse = ".")
# [1] "abcdefghi"
# paste0()函数collapse参数适用于对象
paste0(c("abc","def","ghi"),collapse = ".")
# "abc.def.ghi"


# str_c()函数NA会传染，paste/paste0不会
paste("abc","def","ghi",NA)
# [1] "abc def ghi NA"
paste0("abc","def","ghi",NA)
# [1] "abcdefghiNA"
str_c("abc","def","ghi",NA)
# [1] NA

2.6.2 用自己的语言描述一下str_c() 函数的sep 和collapse 参数有什么区别？

# sep，"字符串"分隔符，用于字符串间连接
# collapse，"向量"分隔符，用于向量元素连接
# str_c()默认无间隔连接
str_c("abc","def","ghi",sep = ".")
# [1] "abc.def.ghi"
str_c("abc","def","ghi",collapse = ".")
# [1] "abcdefghi"
str_c(c("abc","def","ghi"),sep = ".")
# [1] "abc" "def" "ghi"
str_c(c("abc","def","ghi"),collapse = ".")
# [1] "abc.def.ghi"

2.6.3 使用str_length() 和str_sub() 函数提取出一个字符串最中间的字符。如果字符串中的

字符数是偶数，你应该怎么做？

# floor：向下取整，即不大于该数字的最大整数
# ceiling：向上取整，即不小于该数字的最小整数
# trunc：取整数部分
# round：保留几位小数
# signif：保留几位有效数字，常用于科学技术

x <- c("a", "abc", "abcd", "abcde", "abcdef")
# 统计字符串长度
(len <- str_length(x))
(m <- ceiling(l/2))
(n <- floor(l/2))
# 利用求余符号"%%"判断字符串奇偶
ifelse (len%%2 !=0,str_sub(x, m, m),str_sub(x,n,n+1))
# if_else (len%%2 !=0,str_sub(x, m, m),str_sub(x,n,n+1))
# [1] "a"  "b"  "bc" "c"  "cd"

2.6.4 str_wrap() 函数的功能是什么？应该在何时使用这个函数？

转换字符串输出格式
str_wrap(string, width = 80, indent = 0, exdent = 0)

参数	Arguments
string	重新格式化字符串的字符向量
width	目标行的字符宽度
indent	每段首行缩进
exdent	每个段落缩进
Value	字符向量的格式化字符串

thanks <- str_c(readLines(R.home("doc/THANKS")), collapse = "\n")
thanks <- word(thanks, 1, 3, fixed("\n\n"))
cat(str_wrap(thanks), "\n")
cat(str_wrap(thanks, width = 40), "\n")
cat(str_wrap(thanks, width = 60, indent = 2), "\n")
cat(str_wrap(thanks, width = 60, exdent = 2), "\n")

2.6.5 str_trim() 函数的功能是什么？其逆操作是哪个函数？

# 去除空格
str_trim("   ab cd  ")
# [1] "ab cd"
str_trim("   ab cd  ","both")
# [1] "ab cd"
str_trim("   ab cd  ","left")
# [1] "ab cd  "
str_trim("   ab cd  ","right")
# [1] "   ab cd"

2.6.6 编写一个函数将字符向量转换为字符串，例如，将字符向量c("a", "b", "c") 转换为字符串a、b 和c。仔细思考一下，如果给定一个长度为0、1 或2 的向量，那么这个函数应该怎么做？

str_commasep <- function(x, delim = ",") {
  n <- length(x)
  if (n == 0) {
    ""
  } else if (n == 1) {
    x
  } else if (n == 2) {
    # no comma before and when n == 2
    str_c(x[[1]], "and", x[[2]], sep = " ")
  } else {
    # commas after all n - 1 elements
    not_last <- str_c(x[seq_len(n - 1)], delim)
    # prepend "and" to the last element
    last <- str_c("and", x[[n]], sep = " ")
    # combine parts with spaces
    str_c(c(not_last, last), collapse = " ")
  }
}
str_commasep("")
#> [1] ""
str_commasep("a")
#> [1] "a"
str_commasep(c("a", "b"))
#> [1] "a and b"
str_commasep(c("a", "b", "c"))
#> [1] "a, b, and c"
str_commasep(c("a", "b", "c", "d"))
#> [1] "a, b, c, and d"

3 使用正则表达式进行模式匹配

3.1 `str_view()` , `str_view_all()`

str_view() 和str_view_all() 函数来学习正则表达式。这两个函数接受一个字符向量和一个正则表达式，并显示出它们是如何匹配的

str_view() 单个匹配
str_view_all() 全部匹配

3.1 基础匹配

x <- c("apple", "banana", "pear")
str_view(x, "an")

image.png

# 通配符"."，它可以匹配任意字符（除了换行符）：
str_view(x, ".a.")

和字符串一样，正则表达式也使用反斜杠来去除某些字符的特殊含义。因此，如果要匹配.，那么你需要的正则表达式就是\.。但是\ 在字符串中也用作转义字符，所以正则表达式\. 的字符串形式应是\\.：
# 检索向量中的"a.c"
str_view(c("abc", "a.c", "bef"), "a\\.c")

image.png

# "\"在正则表达式中用作转义字符，为匹配"\"这个字符需要建立形式为"\\"的正则表达式
(x <- "a\\b")
# [1] "a\\b"
writeLines(x)
#> a\b

# "\"需要转义，因此为了匹配赋值表达式的"\\"，需要转义两个"\",也即"\\\\"
str_view(x, "\\\\")
#> a\b

# 第二个"被\转义,所以在输入的时候会提示续行符`+`
a1 <- "\"

a2 <- "\\"
writeLines(a2)
# \

# 奇数个反斜杠‘\’,结尾的"被\转义,所以在输入的时候依然会提示续行符`+`
a3 <- "\\\"

# 
a4 <- "\\\\"
writeLines(a4)
# \\

3.2 练习

3.2.1 解释一下为什么这些字符串不能匹配一个反斜杠\："\"、"\\"、"\\\"。

奇数个反斜杠"\"最后一个反斜杠"\"会对结尾的双引号进行转义，然后会提示续行符"+"
而为了匹配反斜杠\，需要建立\的正则表达式，每个反斜杠均需要转义，因此匹配一个反斜杠，需要输入四个反斜杠

3.2.2 如何匹配字符序列"'\ ？

x <- "\"\'\\"
x
# [1] "\"'\\"
writeLines(x)
# "'\

3.2.3 正则表达式...... 会匹配哪种模式？如何用字符串来表示这个正则表达式？

 y <- "\..\..\.."
# Error: '\.' is an unrecognized escape in character string starting ""\."

3.3 锚点(`^`,`$`)

^ 从字符串开头进行匹配。
$ 从字符串末尾进行匹配。

x <- c("apple", "banana", "pear")
str_view(x, "^a")
str_view(x, "a$")
str_view(x, "^apple$")

3.4 练习

3.4.1 如何匹配字符串 "$^$" ？

# 字符串赋值
y <- "???$^$**&&"
# 锚点"^","$"为特殊符号，需要转义"\"，但是转义符号也是特殊符号，也是需要转义
str_view(y,"\\$\\^\\$")

3.4.2 给定stringr::words 中的常用单词语料库，创建正则表达式以找出满足下列条件的所有单词。

a. 以y 开头的单词。
b. 以x 结尾的单词。
c. 长度正好为3 个字符的单词。（不要使用str_length() 函数，这是作弊！）
d. 具有7个或更多字符的单词。
因为这个列表非常长，所以你可以设置str_view() 函数的match 参数，只显示匹配的
单词（match = TRUE）或未匹配的单词（match = FALSE）。

# 以y 开头的单词
str_view(words,"^y",match = T)
table(str_count(words,"^y"))
#  0   1 
#  974   6

# 以x 结尾的单词
str_view(words,"x$",match = T)
table(str_count(words,"x$"))
#  0   1 
#  976   4 

# 长度正好为3 个字符的单词
str_view(words,"^(...)$",match = T)
table(str_count(words,"^(...)$"))
# 0   1 
# 870 110 
table(str_length(words))
# 1   2   3   4   5   6   7   8   9  10  11 
# 1  18 110 263 200 169 119  57  30   9   4 

# 具有7个或更多字符的单词,括号内连续七个点"."
str_view(words,"(.......)",match = T)

3.5 字符类与字符选项

# "."，它可以匹配除"换行符"外的任意字符
# \d 可以匹配任意数字
# \s 可以匹配任意空白字符（如空格、制表符和换行符）
# [abc] 可以匹配a、b 或c
# [^abc] 可以匹配除a、b、c 外的任意字符
# "|" 的优先级很低，所以abc|xyz 匹配的是abc 或xyz
# 可以使用括号让"|"表达得更清晰一些,str_view(c("grey", "gray"), "gr(e|a)y")

要想创建包含\d 或\s 的正则表达式，你需要在字符串中对\进行转义，因此需要输入"\\d" 或"\\s"

3.6　练习

3.6.1 创建正则表达式来找出符合以下条件的所有单词。

a. 以元音字母开头的单词。
b. 只包含辅音字母的单词（提示：考虑一下匹配“非”元音字母）。
c. 以ed 结尾，但不以eed 结尾的单词。
d. 以ing 或ize 结尾的单词。

#  以元音字母开头的单词
str_view(words,"^[aoeiu]",match = T)

# 只包含辅音字母的单词（提示：考虑一下匹配“非”元音字母）
str_view(words,"^[^aoeiu]",match = T)


# 以ed 结尾，但不以eed 结尾的单词
str_view(words,"([^e]ed)$",match = T)
str_view(words,"[^e]ed$",match = T)

# 以ing 或ize 结尾的单词
str_view(words,"((ing)|(ize))$",match = T)

3.6.2 实际验证一下规则：i 总是在e 前面，除非i 前面有c。

str_view(words,"((cei)|[^c]ie)",match = T)

3.6.3 q 后面总是跟着一个u 吗？

table(ifelse(str_detect(words,"qu"),"Yes","NO"))
# NO Yes 
# 970  10

3.6.4 编写一个正则表达式来匹配英式英语单词，排除美式英语单词。

3.6.5 创建一个正则表达式来匹配你所在国家的电话号码。

# 赋值
(x <- c("+86-18217047048","1223333333","10545384333"))
# 正则表达式匹配，所有的特殊字符需要转义，如"+"需要构建正则表达式"\+"，但是\也需要转义，所以需要输入"\\+"，同理"\\d"
# str_view(x,".86-1[89]\\d\\d\\d\\d\\d\\d\\d\\d\\d")
str_view(x,"\\+86-1[89]\\d\\d\\d\\d\\d\\d\\d\\d\\d")

3.7 重复

正则表达式的另一项强大功能是，其可以控制一个模式能够匹配多少次。

?：0 次或1 次。
+：1 次或多次。
*：0 次或多次。
{n}：匹配n 次。
{n,}：匹配n 次或更多次。
{,m}：最多匹配m 次。
{n, m}：匹配n 到m 次。

x <- "MDCCCCCCCLXXXCCCCLXXXVIII"

str_view(x, "CC?")

image.png

str_view(x, "CC+")

image.png

str_view_all(x, "CC+")

image.png

str_view(x, "C[LX]+")

image.png

str_view_all(x, "C[LX]+")

image.png

str_view(x, "C{2}")
str_view(x, "C{2,}")
str_view(x, "C{2,3}")
# 默认的正则表达式会匹配尽可能长的字符串，但是在正则表达式后加"?"则可以将匹配方式改为匹配尽可能短的字符串
str_view(x, 'C{2,3}?')
str_view(x, 'C[LX]+?')

3.8 练习

3.8.1 给出与?、+ 和* 等价的{m, n} 形式的正则表达式。

# `?`：0 次或1 次
str_view("aaaaabbaabb","a?")
str_view("aaaaabbaabb","a{0,1}")

#`+`：1 次或多次
str_view("aaaaaabbaabb","a+")
str_view("aaaaaabbaabb","a{1,}")

# `* `：0 次或多次
str_view("aaaaaabbaabb","a*")
str_view("aaaaaabbaabb","a{0,}")

3.8.2 用语言描述以下正则表达式匹配的是何种模式（仔细阅读来确认我们使用的是正则表达

式，还是定义正则表达式的字符串）？
a. ^.*$
b. "\{.+\}"
c. \d{4}-\d{2}-\d{2}
d. "\\{4}"

3.8.3 创建正则表达式来找出满足以下条件的所有单词。

a. 以3 个辅音字母开头的单词。
b. 有连续3 个或更多元音字母的单词。
c. 有连续2 个或更多元音—辅音配对的单词。

str_view(words,"^([^aoieu]{3})",match = T)
str_view(words,"[aoieu]{3,}",match = T)
str_view(words,"([aoieu][^aoieu]){2,}",match = T)

3.9 分组与回溯引用

括号还可以定义“分组”，可以通过回溯引用（如\1、\2 等）来引用这些分组

# "\1"代表正则表达式(..)，为匹配"\1"，需输入 "\\1"
str_view(fruit, "(..)\\1", match = TRUE)
# 括号()代表分组，\\n代表引用第n个括号()
# (..)\\1(...)(...)\\3，其中\\1代表重复引用第1个括号(..)，\\3重复引用第3个括号(...)
str_view("abcdcdxyzefgefgefghijk", "(..)\\1", match = TRUE)
str_view("abcdcdxyzefgefgefghijk", "(..)\\1(...)(...)\\3", match = TRUE)

3.10 练习

3.10.1 用语言描述以下正则表达式会匹配何种模式？

a. (.)\1\1
b. "(.)(.)\2\1"
c. (..)\1
d. "(.).\1.\1"
e. "(.)(.)(.).*\3\2\1"

3.10.2 创建正则表达式来匹配出以下单词。

a. 开头字母和结尾字母相同的单词。
b. 包含一对重复字母的单词（例如，church 中包含了重复的ch）。
c. 包含一个至少重复3 次的字母的单词（例如，eleven 中的e 重复了3 次）。

# (\\1?$)为匹配含一个字符的字符串
str_view(words,"^([A-Za-z])((.*(\\1$))|(\\1?$))", match = TRUE)
str_view(words,"([A-Za-z][A-Za-z])(.*)\\1", match = TRUE)
str_view(words,"([A-Za-z])(.*)\\1(.*)\\1", match = TRUE)

4 工具

利用正则表达式多种stringr 函数，可以：

确定与某种模式相匹配的字符串；
找出匹配的位置；
提取出匹配的内容；
使用新值替换匹配内容；
基于匹配拆分字符串。

4.1 匹配检测

str_detect()
要想确定一个字符向量能否匹配一种模式，可以使用str_detect() 函数。它返回一个与输入向量具有同样长度的逻辑向量

x <- c("apple", "banana", "pear")
str_detect(x, "e")
#> [1] TRUE FALSE TRUE

# 从数学意义上来说，逻辑向量中的FALSE 为0，TRUE 为1。这使得在匹配特别大的向量时，sum() 和mean() 函数能够发挥更大的作用
# 统计以t开头的常用单词
sum(str_detect(words, "^t"))
#> [1] 65
# 计算以元音字母结尾的常用单词的比例
mean(str_detect(words, "[aeiou]$"))
#> [1] 0.277

# 当逻辑条件非常复杂时，使用逻辑运算符将多个str_detect() 调用组合起来会更容易
# 找出至少包含一个元音字母的所有单词，然后取反
no_vowels_1 <- !str_detect(words, "[aeiou]")
# 找出仅包含辅音字母（非元音字母）的所有单词
no_vowels_2 <- str_detect(words, "^[^aeiou]+$")
identical(no_vowels_1, no_vowels_2)
#> [1] TRUE

str_subset()

# str_detect() 函数的一种常见用法是选取出匹配某种模式的元素。你可以通过逻辑取子集方式来完成这种操作，也可以使用便捷的str_subset() 包装器函数
# c[]，表示取向量的的元素
words[str_detect(words, "x$")]
#> [1] "box" "sex" "six" "tax"
str_subset(words, "x$")
#> [1] "box" "sex" "six" "tax"

# 字符串通常会是数据框的一列，可以使用filter 操作
df <- tibble(
  word = words,
  i = seq_along(word)
)
df %>%
  filter(str_detect(words, "x$"))
# # A tibble: 4 x 2
# word      i
# <chr> <int>
# 1 box     108
# 2 sex     747
# 3 six     772
# 4 tax     841
# Warning message:
#   `...` is not empty.

str_count()
str_detect() 函数的一种变体，不简单地返回是或否，而是返回字符串中匹配的数量：

x <- c("apple", "banana", "pear")
str_count(x, "a")
#> [1] 1 3 1
# 平均来看，每个单词中有多少个元音字母？
mean(str_count(words, "[aeiou]"))
#> [1] 1.99
str_count() 也完全可以同mutate() 函数一同使用：
df %>%
mutate(
vowels = str_count(word, "[aeiou]"),
consonants = str_count(word, "[^aeiou]")
)

很多stringr 函数都是成对出现的：一个函数用于单个匹配，另一个函数用于全部匹配，后者会有后缀_all。

4.2 练习

试着使用两种方法来解决以下每个问题，一种方法是使用单个正则表达式，另一种方法是使用多个str_detect() 函数的组合。
a. 找出以x 开头或结尾的所有单词。
b. 找出以元音字母开头并以辅音字母结尾的所有单词。
c. 是否存在包含所有元音字母的单词？
d. 哪个单词包含最多数量的元音字母？哪个单词包含最大比例的元音字母？（提示：
分母应该是什么？）

# 以x 开头或结尾
words[str_detect(words,"(^x(.*))|((.*)x$)")]
str_subset(words,"(^x(.*))|((.*)x$)")

# 以元音字母开头并以辅音字母结尾
words[str_detect(words,"^[aoiue](.*)[^aoiue]$")]
words[(str_detect(words,"^[aoiue](.*)"))&(str_detect(words,"(.*)[^aoiue]$"))]
str_subset(words,"^[aoiue](.*)[^aoiue]$")

# 包含所有元音字母的单词
words[str_detect(words, "a") &
        str_detect(words, "e") &
        str_detect(words, "i") &
        str_detect(words, "o") &
        str_detect(words, "u")
      ]
# character(0)

#包含最多数量的元音字母
words[
  which(
    str_count(words, "[aeiou]") == max(str_count(words, "[aeiou]"))
    )
  ]
# words[str_count(words, "[aeiou]") == max(str_count(words, "[aeiou]"))]
# [1] "appropriate" "associate"   "available"   "colleague"   "encourage"  
# [6] "experience"  "individual"  "television"

4.3 提取匹配内容

要想提取匹配的实际文本，我们可以使用str_extract()函数。

stringr::sentences
length(sentences)
# [1] 720
head(sentences)
# [1] "The birch canoe slid on the smooth planks." 
# [2] "Glue the sheet to the dark blue background."
# [3] "It's easy to tell the depth of a well."     
# [4] "These days a chicken leg is a rare dish."   
# [5] "Rice is often served in round bowls."       
# [6] "The juice of lemons makes fine punch."

# 创建一个颜色名称向量，然后将其转换成一个正则表达式：

colors <- c("red", "orange", "yellow", "green", "blue", "purple")
color_match <- str_c(colors, collapse = "|")
color_match
# [1] "red|orange|yellow|green|blue|purple"
# 选取出包含一种颜色的句子，再从中提取出颜色
has_color <- str_subset(sentences, color_match)
matches <- str_extract(has_color, color_match)
head(matches)
#> [1] "blue" "blue" "red" "red" "red" "blue"

more <- sentences[str_count(sentences,color_match)>1]
str_view_all(more,color_match)
str_extract(more,color_match)
# [1] "blue"   "green"  "orange"

str_extract_all(more,color_match)
# [[1]]
# [1] "blue" "red" 
# 
# [[2]]
# [1] "green" "red"  
# 
# [[3]]
# [1] "orange" "red"

# 如果设置了simplify = TRUE，那么str_extract_all() 会返回一个矩阵，其中较短的匹配会扩展到与最长的匹配具有同样的长度：
str_extract_all(more, color_match, simplify = TRUE)
#> [,1] [,2]
#> [1,] "blue" "red"
#> [2,] "green" "red"
#> [3,] "orange" "red"
x <- c("a", "a b", "a b c")

str_extract_all(x, "[a-z]", simplify = TRUE)
#> [,1] [,2] [,3]
#> [1,] "a" "" ""
#> [2,] "a" "b" ""
#> [3,] "a" "b" "c"

4.4 练习

4.4.1 在前面的示例中，你或许已经发现正则表达式匹配了flickered，这并不是一种颜色。修

改正则表达式来解决这个问题。

colors <- c(
  "red", "orange", "yellow", "green", "blue", "purple"
)
color_match <- str_c(colors, collapse = "|")
color_match
[1] "red|orange|yellow|green|blue|purple"
# 修改
## 我们在 10.3.3 中锚点部分提到" \b："表示匹配单词间的边界。
color_match2 <- str_c("\\b(", str_c(colors, collapse = "|"), ")\\b")
color_match2
# [1] "\\b(red|orange|yellow|green|blue|purple)\\b"
more2 <- sentences[str_count(sentences, color_match2) > 1]
str_view_all(more2, color_match2, match = TRUE)

4.4.2 从Harvard sentences 数据集中提取以下内容。

a. 每个句子的第一个单词。
b. 以ing 结尾的所有单词。
c. 所有复数形式的单词。

# 每个句子的第一个单词
str_extract(sentences, "[A-Za-z]+") %>% 
head()
# [1] "The"   "Glue"  "It"    "These" "Rice"  "The" 

# 以 ing 结尾的所有单词。
pattern <- "\\b[A-Za-z]+ing\\b"
sentences_with_ing <- str_detect(sentences, pattern)
unique(unlist(str_extract_all(sentences[sentences_with_ing], pattern)))

4.5 分组匹配

括号在正则表达式中可以阐明优先级，还能对正则表达式进行分组，分组可以在匹配时回溯引用，还可以使用括号来提取一个复杂匹配的各个部分。

# 找出跟在a 或the 后面的所有单词
# 直接使用正则表达式定义“单词”有一点难度，但是可以通过一种简单的近似定义"至少有1 个非空格字符的字符序列"
noun <- "(a|the) ([^ ]+)"
has_noun <- sentences %>%
str_subset(noun) %>%
head(10)
has_noun %>%
str_extract(noun)
#> [1] "the smooth" "the sheet" "the depth" "a chicken"
#> [5] "the parked" "the sun" "the huge" "the ball"
#> [9] "the woman" "a helps"

str_extract() 函数可以给出完整匹配；
-str_match() 函数则可以给出每个独立分组。str_match() 返回的不是字符向量，而是一个矩阵，其中一列是完整匹配，后面的列是每个分组的匹配：

str_match(has_noun,nonu)
# or has_noun %>% str_match(noun)
#       [,1]         [,2]  [,3]     
# [1,] "the smooth" "the" "smooth" 
# [2,] "the sheet"  "the" "sheet"  
# [3,] "the depth"  "the" "depth"  
# [4,] "a chicken"  "a"   "chicken"
# [5,] "the parked" "the" "parked" 
# [6,] "the sun"    "the" "sun"    
# [7,] "the huge"   "the" "huge"   
# [8,] "the ball"   "the" "ball"   
# [9,] "the woman"  "the" "woman"  
# [10,] "a helps"    "a"   "helps"

如果数据是保存在tibble 中的，那么使用tidyr::extract() 会更容易。这个函数的工作方式与str_match() 函数类似，只是要求为每个分组提供一个名称，以作为新列放在tibble 中：

tibble(sentence = sentences) %>%
tidyr::extract(
sentence, c("article", "noun"), "(a|the) ([^ ]+)",
remove = FALSE
)
#> # A tibble: 720 × 3
#> sentence article noun
#> * <chr> <chr> <chr>
#> 1 The birch canoe slid on the smooth planks. the smooth
#> 2 Glue the sheet to the dark blue background. the sheet
#> 3 It's easy to tell the depth of a well. the depth
#> 4 These days a chicken leg is a rare dish. a chicken
#> 5 Rice is often served in round bowls. <NA> <NA>
#> 6 The juice of lemons makes fine punch. <NA> <NA>
#> # ... with 714 more rows

与str_extract()函数一样，如果想要找出每个字符串的所有匹配，你需要使用str_match_all() 函数。

4.6　练习

4.6.1 找出跟在一个数词（one、two、three 等）后面的所有单词，提取出数词与后面的单词。

# \b：单词边界
# \w：任意单词字符
# \W：任意非单词字符

numword <- "\\b(one|two|three|four|five|six|seven|eight|nine|ten) +(\\w+)"
sentences[str_detect(sentences, numword)] %>%  str_extract(numword)

4.6.2 找出所有缩略形式，分别列出撇号前面和后面的部分。

contraction <- "([A-Za-z]+)'([A-Za-z]+)"
sentences[str_detect(sentences, contraction)] %>%
  str_extract(contraction) %>%
  str_split("'")

4.7 替换匹配内容

str_replace() 和str_replace_all()函数可以使用新字符串替换匹配内容。最简单的应用
是使用固定字符串替换匹配内容：

x <- c("apple", "pear", "banana")
str_replace(x, "[aeiou]", "-")
#> [1] "-pple" "p-ar" "b-nana"
str_replace_all(x, "[aeiou]", "-")
#> [1] "-ppl-" "p--r" "b-n-n-"

通过提供一个命名向量，使用str_replace_all() 函数可以同时执行多个替换：

x <- c("1 house", "2 cars", "3 people")
str_replace_all(x, c("1" = "one", "2" = "two", "3" = "three"))
#> [1] "one house" "two cars" "three people"

除了使用固定字符串替换匹配内容，你还可以使用回溯引用来插入匹配中的分组。在下面
的代码中，我们交换了第二个单词和第三个单词的顺序：

sentences %>%
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") %>%
head(5)
#> [1] "The canoe birch slid on the smooth planks."
#> [2] "Glue sheet the to the dark blue background."
#> [3] "It's to easy tell the depth of a well."
#> [4] "These a days chicken leg is a rare dish."
#> [5] "Rice often is served in round bowls."

4.8 练习

4.8.1 使用反斜杠替换字符串中的所有斜杠。

x <- c("a/b","a/b/c")
(str_replace_all(x,"/","\\\\"))
# [1] "a\\b"    "a\\b\\c"
writeLines(str_replace_all(x,"/","\\\\"))
# a\b
# a\b\c

4.8.2 使用replace_all() 函数实现str_to_lower() 函数的一个简单版。

LETTERS2letters <- letters
names(LETTERS2letters) <- LETTERS
str_replace_all(words, LETTERS2letters)

4.8.3 交换words 中单词的首字母和末尾字母，其中哪些字符串仍然是个单词？

4.9 拆分

str_split()函数可以将字符串拆分为多个片段。

# 将句子拆分成单词
# 字符向量的每个分量会包含不同数量的片段，所以str_split() 会返回一个列表
y <- str_split(head(sentences,5)," ")
# [[1]]
# [1] "The"     "birch"   "canoe"   "slid"    "on"      "the"     "smooth" 
# [8] "planks."
# 
# [[2]]
# [1] "Glue"        "the"         "sheet"       "to"          "the"        
# [6] "dark"        "blue"        "background."
# 
# [[3]]
# [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well."
# 
# [[4]]
# [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"      
# [8] "rare"    "dish."  
# 
# [[5]]
# [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls."
y[[1]][1]
# [1] "The"

# 如果你拆分的是长度为1 的向量，那么只要简单地提取列表的第一个元素即可：
"a|b|c|d" %>%
str_split("\\|") %>%
.[[1]]
#> [1] "a" "b" "c" "d"

# 通过设置simplify = TRUE 返回一个矩阵
str_split(head(sentences,5)," ", simplify = TRUE)
#      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]          [,9]   
# [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks."     ""     
# [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background." ""     
# [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"           "well."
# [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"        "dish."
# [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls." ""   



# 还可以设定拆分片段的最大数量
str_split(head(sentences,5)," ", n=2,simplify = TRUE)
#      [,1]    [,2]                                    
# [1,] "The"   "birch canoe slid on the smooth planks."
# [2,] "Glue"  "the sheet to the dark blue background."
# [3,] "It's"  "easy to tell the depth of a well."     
# [4,] "These" "days a chicken leg is a rare dish."    
# [5,] "Rice"  "is often served in round bowls." 




# 除了模式，还可以通过"字母、行、句子和单词边界"（boundary() 函数）来拆分字符串
# boundary(type = c("character", "line_break", "sentence", "word"),  skip_word_none = NA, ...)
str_split(head(sentences,5),boundary("word"))
# [[1]]
# [1] "The"    "birch"  "canoe"  "slid"   "on"     "the"    "smooth" "planks"
# 
# [[2]]
# [1] "Glue"       "the"        "sheet"      "to"         "the"        "dark"       "blue"      
# [8] "background"
# 
# [[3]]
# [1] "It's"  "easy"  "to"    "tell"  "the"   "depth" "of"    "a"     "well" 
# 
# [[4]]
# [1] "These"   "days"    "a"       "chicken" "leg"     "is"      "a"       "rare"    "dish"   
# 
# [[5]]
# [1] "Rice"   "is"     "often"  "served" "in"     "round"  "bowls" 
str_split(head(sentences,5),boundary("word"),simplify = TRUE)
#      [,1]    [,2]    [,3]    [,4]      [,5]  [,6]    [,7]     [,8]         [,9]  
# [1,] "The"   "birch" "canoe" "slid"    "on"  "the"   "smooth" "planks"     ""    
# [2,] "Glue"  "the"   "sheet" "to"      "the" "dark"  "blue"   "background" ""    
# [3,] "It's"  "easy"  "to"    "tell"    "the" "depth" "of"     "a"          "well"
# [4,] "These" "days"  "a"     "chicken" "leg" "is"    "a"      "rare"       "dish"
# [5,] "Rice"  "is"    "often" "served"  "in"  "round" "bowls"  ""           ""    
str_split(head(sentences,5),boundary("word"),n=2,simplify = TRUE)
#     [,1]    [,2]                                    
# [1,] "The"   "birch canoe slid on the smooth planks."
# [2,] "Glue"  "the sheet to the dark blue background."
# [3,] "It's"  "easy to tell the depth of a well."     
# [4,] "These" "days a chicken leg is a rare dish."    
# [5,] "Rice"  "is often served in round bowls."

sentences %>%
  head(2) %>%
  str_split(boundary("character"))
# [[1]]
# [1] "T" "h" "e" " " "b" "i" "r" "c" "h" " " "c" "a" "n" "o" "e" " " "s" "l" "i" "d" " " "o" "n" " " "t" "h" "e" " " "s" "m" "o" "o" "t"
# [34] "h" " " "p" "l" "a" "n" "k" "s" "."
# 
# [[2]]
# [1] "G" "l" "u" "e" " " "t" "h" "e" " " "s" "h" "e" "e" "t" " " "t" "o" " " "t" "h" "e" " " "d" "a" "r" "k" " " "b" "l" "u" "e" " " "b"
# [34] "a" "c" "k" "g" "r" "o" "u" "n" "d" "."
sentences %>%
  head(5) %>%
  str_split(boundary("line_break"))
# [[1]]
# [1] "The "    "birch "  "canoe "  "slid "   "on "     "the "    "smooth " "planks."
# 
# [[2]]
# [1] "Glue "       "the "        "sheet "      "to "         "the "        "dark "       "blue "       "background."
# 
# [[3]]
# [1] "It's "  "easy "  "to "    "tell "  "the "   "depth " "of "    "a "     "well." 
# 
# [[4]]
# [1] "These "   "days "    "a "       "chicken " "leg "     "is "      "a "       "rare "    "dish."   
# 
# [[5]]
# [1] "Rice "   "is "     "often "  "served " "in "     "round "  "bowls." 
sentences %>%
  head(5) %>%
  str_split(boundary("sentence"))
# [[1]]
# [1] "The birch canoe slid on the smooth planks."
# 
# [[2]]
# [1] "Glue the sheet to the dark blue background."
# 
# [[3]]
# [1] "It's easy to tell the depth of a well."
# 
# [[4]]
# [1] "These days a chicken leg is a rare dish."
# 
# [[5]]
# [1] "Rice is often served in round bowls."

4.10 练习

4.10.1 拆分字符串"apples, pears, and bananas"。

"apples, pears, and bananas" %>%
  str_split(boundary("character"))
# [[1]]
# [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " " "b" "a" "n" "a" "n" "a" "s"

"apples, pears, and bananas" %>%
  str_split(boundary("word"))
# [[1]]
# [1] "apples"  "pears"   "and"     "bananas"

"apples, pears, and bananas" %>%
  str_split(boundary("sentence"))
# [[1]]
# [1] "apples, pears, and bananas"

4.10.2 为什么使用boundary("word") 的拆分效果要比" " 好？

# 使用空格的情况
"apples, pears, and bananas" %>%
  str_split(" ")
# [[1]]
# [1] "apples," "pears,"  "and"     "bananas"

4.10.3 使用空字符串（""）进行拆分会得到什么结果？尝试一下，然后阅读文档。

# 使用空字符串（""）进行拆分，会拆分所有字符
"apples, pears, and bananas" %>%
  str_split("")
# [[1]]
# [1] "a" "p" "p" "l" "e" "s" "," " " "p" "e" "a" "r" "s" "," " " "a" "n" "d" " " "b" "a" "n" "a" "n" "a" "s"

4.11 定位匹配内容

str_locate()和str_locate_all() 函数可以给出每个匹配的开始位置和结束位置。
使用str_locate() 函数找出匹配的模式，然后使用str_sub()函数来提取或修改匹配的内容。

head(sentences,5)
# [1] "The birch canoe slid on the smooth planks."  "Glue the sheet to the dark blue background."
# [3] "It's easy to tell the depth of a well."      "These days a chicken leg is a rare dish."   
# [5] "Rice is often served in round bowls."       
str_locate(head(sentences,5),"days")
#       start end
# [1,]    NA  NA
# [2,]    NA  NA
# [3,]    NA  NA
# [4,]     7  10
# [5,]    NA  NA

5 其他类型的模式

当使用一个字符串作为模式时，R 会自动调用regex() 函数对其进行包装：

正常调用：

str_view(fruit, "nana")

上面形式是以下形式的简写

str_view(fruit, regex("nana"))
你可以使用regex() 函数的其他参数来控制具体的匹配方式。

ignore_case = TRUE 既可以匹配大写字母，也可以匹配小写字母，它总是使用当前的区域设置：

bananas <- c("banana", "Banana", "BANANA")
str_view(bananas, "banana")
str_view(bananas, regex("banana", ignore_case = TRUE))

multiline = TRUE 可以使得^ 和$ 从每行的开头和末尾开始匹配，而不是从完整字符串
的开头和末尾开始匹配：

x <- "Line 1\nLine 2\nLine 3"
str_extract_all(x, "^Line")[[1]]
#> [1] "Line"
str_extract_all(x, regex("^Line", multiline = TRUE))[[1]]
#> [1] "Line" "Line" "Line"

comments = TRUE 可以让你在复杂的正则表达式中加入注释和空白字符，以便更易理解。
匹配时会忽略空格和# 后面的内容。如果想要匹配一个空格，你需要对其进行转义："\ "：

phone <- regex("
\\(? # 可选的开括号
(\\d{3}) # 地区编码
[)- ]? # 可选的闭括号、短划线或空格
(\\d{3}) # 另外3个数字
[ -]? # 可选的空格或短划线
(\\d{3}) # 另外3个数字
", comments = TRUE)
str_match("514-791-8141", phone)
#> [,1] [,2] [,3] [,4]
#> [1,] "514-791-814" "514" "791" "814"
- dotall = TRUE 可以使得. 匹配包括\n 在内的所有字符。

5.1 除了regex()，你还可以使用其他3 种函数。

fixed() 函数可以按照字符串的字节形式进行精确匹配，它会忽略正则表达式中的所有特殊字符，并在非常低的层次上进行操作。这样可以让你不用进行那些复杂的转义操作，而且速度比普通正则表达式要快很多。从以下的微基准测试可以看出，在这个简单的示例中，它的速度差不多是普通正则表达式的3 倍：

microbenchmark::microbenchmark(
fixed = str_detect(sentences, fixed("the")),
regex = str_detect(sentences, "the"),
times = 20
)
#> Unit: microseconds
#> expr min lq mean median uq max neval cld
#> fixed 116 117 136 120 125 389 20 a
#> regex 333 337 346 338 342 467 20 b

在匹配非英语数据时，要慎用fixed() 函数。它可能会出现问题，因为此时同一个字符经常有多种表达方式。例如，定义á 的方式有两种：一种是单个字母a，另一种是a 加上重音符号

a1 <- "\u00e1"
a2 <- "a\u0301"
c(a1, a2)
#> [1] "á" "á"
a1 == a2
#> [1] FALSE

这两个字母的意义相同，但因为定义方式不同，所以fixed() 函数找不到匹配。然而，你可以使用接下来将要介绍的coll() 函数，按照我们使用的字符比较规则来进行匹配：

str_detect(a1, fixed(a2))
#> [1] FALSE
str_detect(a1, coll(a2))
#> [1] TRUE

coll()函数使用标准排序规则来比较字符串，这在进行不区分大小写的匹配时是非常有效的。注意，可以在coll() 函数中设置locale 参数，以确定使用哪种规则来比较字符。遗憾的是，世界各地所使用的规则是不同的！

# 这意味着在进行不区分大小写的匹配时，还是需要知道不同规则之间的区别：
i <- c("I", "İ", "i", "ı")
i
#> [1] "I" "İ" "i" "ı"
str_subset(i, coll("i", ignore_case = TRUE))
#> [1] "I" "i"
str_subset(
i,
coll("i", ignore_case = TRUE, locale = "tr")
)
#> [1] "İ" "i"

fixed() 和regex() 函数中都有ignore_case 参数，但都无法选择区域设置，它们总是使用默认的区域设置。你可以使用以下代码查看默认区域设置（我们稍后会对stringi 包进行更多介绍）：

stringi::stri_locale_info()
#> $Language
#> [1] "en"
#>
#> $Country
#> [1] "US"
#>
#> $Variant
#> [1] ""
#>
#> $Name
#> [1] "en_US"

coll() 函数的弱点是速度，因为确定哪些是相同字符的规则比较复杂，与regex() 和fixed()函数相比，coll()确实比较慢。

在介绍str_split() 函数时，你已经知道可以使用boundary()函数来匹配边界。你还可以在其他函数中使用这个函数：

x <- "This is a sentence."
str_view_all(x, boundary("word"))
str_extract_all(x, boundary("word"))
#> [[1]]
#> [1] "This" "is" "a" "sentence"

6 正则表达式的其他应用

R 基础包中有两个常用函数，它们也可以使用正则表达式。
• apropos()函数可以在全局环境空间中搜索所有可用对象。当不能确切想起函数名称时，
这个函数特别有用：

apropos("replace")
#> [1] "%+replace%" "replace" "replace_na"
#> [4] "str_replace" "str_replace_all" "str_replace_na"
#> [7] "theme_replace"

• dir() 函数可以列出一个目录下的所有文件。dir() 函数的patten 参数可以是一个正则
表达式，此时它只返回与这个模式相匹配的文件名。例如，你可以使用以下代码返回当
前目录中的所有R Markdown 文件：

head(dir(pattern = "\\.Rmd$"))
#> [1] "communicate-plots.Rmd" "communicate.Rmd"
#> [3] "datetimes.Rmd" "EDA.Rmd"
#> [5] "explore.Rmd" "factors.Rmd"

7 stringi

stringr 建立于stringi 的基础之上。stringr 非常容易学习，因为它只提供了非常少的函数，这些函数是精挑细选的，可以完成大部分常用字符串操作功能。与stringr 不同，stringi 的设计思想是尽量全面，几乎包含了我们可以用到的所有函数：stringi 中有234 个函数，而stringr 中只有42 个。

如果你发现某些工作很难使用stringr 来完成，那么可以考虑使用stringi。因为这两个包中的函数的工作方式非常相似，所以你可以很自然地从stringr 过渡到stringi。主要区别是前缀：str_ 与stri_。

Reference

1.https://blog.csdn.net/u011596455/article/details/79600579
2.https://www.jianshu.com/p/4790b00dc238

Rdata006 使用stringr处理字符串

1 准备工作

1.1 stringr介绍

1.2 stringr的分类

1.3 stringr包中的重要函数

1.4 特殊符号

2 字符串基础

2.1 字符串长度str_length()

2.3 字符串组合str_c()

2.4 字符串取子集str_sub()

2.5 区域设置

2.6 练习

2.6.1 在没有使用stringr 的那些代码中，你会经常看到paste() 和paste0() 函数，这两个函数的区别是什么？ stringr 中的哪两个函数与它们是对应的？这些函数处理NA 的方式有什么不同？

2.6.2 用自己的语言描述一下str_c() 函数的sep 和collapse 参数有什么区别？

2.6.3 使用str_length() 和str_sub() 函数提取出一个字符串最中间的字符。如果字符串中的

2.6.4 str_wrap() 函数的功能是什么？应该在何时使用这个函数？

2.6.5 str_trim() 函数的功能是什么？其逆操作是哪个函数？

2.6.6 编写一个函数将字符向量转换为字符串，例如，将字符向量c("a", "b", "c") 转换为字符串a、b 和c。仔细思考一下，如果给定一个长度为0、1 或2 的向量，那么这个函数应该怎么做？

3 使用正则表达式进行模式匹配

3.1 str_view() , str_view_all()

3.1 基础匹配

3.2 练习

3.2.1 解释一下为什么这些字符串不能匹配一个反斜杠\："\"、"\\"、"\\\"。

3.2.2 如何匹配字符序列"'\ ？

3.2.3 正则表达式...... 会匹配哪种模式？如何用字符串来表示这个正则表达式？

3.3 锚点(^,$)

3.4 练习

3.4.1 如何匹配字符串 "$^$" ？

3.4.2 给定stringr::words 中的常用单词语料库，创建正则表达式以找出满足下列条件的所有单词。

3.5 字符类与字符选项

3.6 练习

3.6.1 创建正则表达式来找出符合以下条件的所有单词。

3.6.2 实际验证一下规则：i 总是在e 前面，除非i 前面有c。

3.6.3 q 后面总是跟着一个u 吗？

3.6.4 编写一个正则表达式来匹配英式英语单词，排除美式英语单词。

3.6.5 创建一个正则表达式来匹配你所在国家的电话号码。

3.7 重复

3.8 练习

3.8.1 给出与?、+ 和* 等价的{m, n} 形式的正则表达式。

3.8.2 用语言描述以下正则表达式匹配的是何种模式（仔细阅读来确认我们使用的是正则表达

3.8.3 创建正则表达式来找出满足以下条件的所有单词。

3.9 分组与回溯引用

3.10 练习

3.10.1 用语言描述以下正则表达式会匹配何种模式？

3.10.2 创建正则表达式来匹配出以下单词。

4 工具

4.1 匹配检测

4.2 练习

4.3 提取匹配内容

4.4 练习

4.4.1 在前面的示例中，你或许已经发现正则表达式匹配了flickered，这并不是一种颜色。修

4.4.2 从Harvard sentences 数据集中提取以下内容。

4.5 分组匹配

4.6 练习

4.6.1 找出跟在一个数词（one、two、three 等）后面的所有单词，提取出数词与后面的单词。

4.6.2 找出所有缩略形式，分别列出撇号前面和后面的部分。

4.7 替换匹配内容

4.8 练习

4.8.1 使用反斜杠替换字符串中的所有斜杠。

4.8.2 使用replace_all() 函数实现str_to_lower() 函数的一个简单版。

4.8.3 交换words 中单词的首字母和末尾字母，其中哪些字符串仍然是个单词？

4.9 拆分

4.10 练习

4.10.1 拆分字符串"apples, pears, and bananas"。

4.10.2 为什么使用boundary("word") 的拆分效果要比" " 好？

4.10.3 使用空字符串（""）进行拆分会得到什么结果？尝试一下，然后阅读文档。

4.11 定位匹配内容

5 其他类型的模式

正常调用：

上面形式是以下形式的简写

5.1 除了regex()，你还可以使用其他3 种函数。

6 正则表达式的其他应用

7 stringi

Reference

1.3 `stringr`包中的重要函数

2.1 字符串长度`str_length()`

2.3 字符串组合`str_c()`

2.4 字符串取子集`str_sub()`

3.1 `str_view()` , `str_view_all()`

3.3 锚点(`^`,`$`)

3.6　练习

4.6　练习