Python标准库——re

正则#

Python 的 re 库是用于处理正则表达式的核心库，提供字符串匹配、替换、分割等功能。以下是其常用用法及示例：

1. 基本匹配#

re.match(pattern, string) 从字符串开头匹配，返回第一个匹配对象；失败返回 None。

import re
result = re.match(r'hello', 'hello world')
if result:
    print(result.group())  # 输出: hello

re.search(pattern, string) 在字符串中全局搜索第一个匹配项。

result = re.search(r'world', 'hello world')
print(result.group())  # 输出: world

2. 查找所有匹配#

re.findall(pattern, string) 返回所有匹配的字符串列表。

matches = re.findall(r'\d+', '2021年7月7日，2002年02月03日')
print(matches)  # 输出: ['2021', '7', '7', '2002', '02', '03']

re.finditer(pattern, string) 返回迭代器，逐个生成匹配对象（适合处理大文本）。

for match in re.finditer(r'\d+', '2021年7月7日'):
    print(match.group())  # 依次输出: 2021, 7, 7

3. 字符串替换#

re.sub(pattern, repl, string) 替换匹配的字符串。

text = "dawdada"
text_new = re.sub(r'a', '%', text)  # 替换所有 'a' 为 %
print(text_new)  # 输出: d%w%d%d%

# 使用分组反向引用（如 \1 代表第一个分组）
para = re.sub(r'([！。])([%@])', r'\1分隔\2', "hell。！@%o")
print(para)  # 输出: hell。分隔！@%o

4. 预编译正则表达式#

re.compile(pattern) 预编译正则表达式提升复用效率。

pattern = re.compile(r'\d+')
matches = pattern.findall('2021年7月7日')
print(matches)  # 输出: ['2021', '7', '7']

5. 分割字符串#

re.split(pattern, string) 根据正则表达式分割字符串。

parts = re.split(r'[,\s]+', 'apple, banana  cherry')
print(parts)  # 输出: ['apple', 'banana', 'cherry']

6. 分组与捕获#

() 分组 提取子模式内容。

text = "2021年7月7日"
match = re.search(r'(\d+)年(\d+)月(\d+)日', text)
if match:
    print(match.groups())  # 输出: ('2021', '7', '7')
    print(match.group(1))  # 输出: 2021

7. 常用正则符号#

\d：匹配数字（等价于 [0-9]）。
\w：匹配字母、数字、下划线。
\s：匹配空白字符（空格、换行等）。
.：匹配任意字符（除换行外）。
*：匹配前一个字符 0 次或多次。
+：匹配前一个字符 1 次或多次。
?：匹配前一个字符 0 次或 1 次。
{n}：匹配前一个字符恰好 n 次。

8. 标志参数#

re.IGNORECASE（re.I）：忽略大小写。
re.MULTILINE（re.M）：多行模式。
re.DOTALL（re.S）：允许 . 匹配换行符。

text = "Hello\nWorld"
matches = re.findall(r'^[a-z]+', text, flags=re.I | re.M)
print(matches)  # 输出: ['Hello', 'World']

示例整合#

import re

# 示例1：查找日期
text = "日期：2023-12-25，2024-01-01"
dates = re.findall(r'\d{4}-\d{2}-\d{2}', text)
print(dates)  # 输出: ['2023-12-25', '2024-01-01']

# 示例2：替换电话号码
text = "电话：123-4567-8900"
new_text = re.sub(r'(\d{3})-(\d{4})-(\d{4})', r'(\1) \2-\3', text)
print(new_text)  # 输出: 电话：(123) 4567-8900

注意事项#

正则表达式需注意贪婪匹配（如 .* 默认匹配到最长）。
复杂正则可添加注释使用 re.VERBOSE 标志提高可读性。
处理中文时建议使用 re.UNICODE 标志。
使用原始字符串：正则表达式中常用原始字符串（如 r’\n’），避免反斜杠转义问题。例如，r”\n” 表示 \n，而 “\n” 表示换行符。
性能考虑：对于频繁使用的模式，优先使用 re.compile() 预编译。
复杂任务的替代：对于如 HTML 解析或深度邮箱验证，re 库可能不够 robust，建议使用专门库（如 Beautiful Soup 或 email-validator）

通过灵活组合这些方法，可以高效处理字符串匹配、清洗和提取任务。