本篇博文将为大家介绍

1.正则表达式介绍

在前面几篇博文中，我们经过一步步的学习已经可以获得网页的HTML全部数据了。但是获得的数据中包含很多的代码，非常非常的乱，而我们要想获得这堆数据中的有用信息，该怎么办呢？答案是：使用秘密武器——正则表达式。

正则表达式并不是Python的一部分。正则表达式是用于处理字符串的强大工具，拥有自己独特的语法以及一个独立的处理引擎，效率上可能不如str自带的方法，但功能十分强大。得益于这一点，在提供了正则表达式的语言里，正则表达式的语法都是一样的，区别只在于不同的编程语言实现支持的语法数量不同；但不用担心，不被支持的语法通常是不常用的部分。如果已经在其他语言里使用过正则表达式，只需要简单看一看就可以上手了。

说实话，正则表达式我看的也是晕乎乎的，可能要多用用才能慢慢适应并且记住吧。我找到一篇教程，感觉讲的还可以：https://www.jb51.net/tools/zhengze.html大家可以点击去看一下，详细了解一下正则表达式的语法。这里给大家推荐一个在线工具，方便大家以后调试正则表达式：http://tool.chinaz.com/regex/?qq-pf-to=pcqq.c2c

2.正则表达式的语法规则

这里我强烈建议大家，先看看上面推荐的那篇教程，然后再回过头来看下面这幅图。你会明白很多的，不然会很晕的：

3.正则表达式的相关注释

1.字符串数量的贪婪与非贪婪表示形式

当正则表达式中包含能接受重复的限定符时，通常的行为是（在使整个表达式能得到匹配的前提下）匹配尽可能多的字符。以这个表达式为例：a.*b，它将会匹配最长的以a开始，以b结束的字符串。如果用它来搜索aabab的话，它会匹配整个字符串aabab。这被称为贪婪匹配。

有时，我们更需要懒惰匹配，也就是匹配尽可能少的字符。前面给出的限定符都可以被转化为懒惰匹配模式，只要在它后面加上一个问号?。这样.*?就意味着匹配任意数量的重复，但是在能使整个匹配成功的前提下使用最少的重复。现在看看懒惰版的例子吧：

a.*?b匹配最短的，以a开始，以b结束的字符串。如果把它应用于aabab的话，它会匹配aab（第一到第三个字符）和ab（第四到第五个字符）。

为什么第一个匹配是aab（第一到第三个字符）而不是ab（第二到第三个字符）？简单地说，因为正则表达式有另一条规则，比懒惰／贪婪规则的优先级更高：最先开始的匹配拥有最高的优先权——The match that begins earliest wins。

表5.懒惰限定符
代码/语法	说明
*?	重复任意次，但尽可能少重复
+?	重复1次或更多次，但尽可能少重复
??	重复0次或1次，但尽可能少重复
{n,m}?	重复n到m次，但尽可能少重复
{n,}?	重复n次以上，但尽可能少重复

2.转义字符需要额外注意

如果你想查找元字符本身的话，比如你查找.,或者*,就出现了问题：你没办法指定它们，因为它们会被解释成别的意思。这时你就得使用\来取消这些字符的特殊意义。因此，你应该使用\.和\*。当然，要查找\本身，你也得用\\.

例如：deerchao\.net匹配deerchao.net，C:\\Windows匹配C:\Windows。

4.正则表达式之Re模块常用方法简介

Python 自带了re模块，它提供了对正则表达式的支持。这一节，我们就来大致的了解一下re模块中的内容。

打开我们常用的python3.6 标准库网站：https://docs.python.org/3.6/library/re.html，找到6.2章节正则表达式。本节我们要分析的内容全部在这个链接里。（很不幸，全是英语……）6.2.1是正则表达式的语法，建议直接看上面第一节和第二节的讲解，然后大致简单浏览一下标准库网站的6.2.1语法说明部分内容。

这里我们要着重看一下6.2.2部分的内容。

Module Contents

The module defines several functions, constants, and an exception. Some of the functions are simplified versions of the full featured methods for compiled regular expressions. Most non-trivial applications always use the compiled form.

该模块定义了几个函数、常量和一个异常。有些函数是编译正则表达式的全功能方法的简化版本。大多数重要的应用程序总是使用compiled后的表单。

其主要包括以下方法：（ps：这些内容，在标准库网站都能找到，只不过全是英文的……）

# 将一个正则表达式 编译成 可用的正则表达式对象
re.compile(pattern, flags=0)

#以下为匹配所用函数
re.</code><code>match</code><span class="sig-paren">(</span><em>pattern</em>, <em>string</em>, <em>flags=0</em><span class="sig-paren">)</span> <code>re.</code><code>search</code><span class="sig-paren">(</span><em>pattern</em>, <em>string</em>, <em>flags=0</em><span class="sig-paren">)
re.</code><code>fullmatch</code>(<em>pattern</em>, <em>string</em>, <em>flags=0</em>)</span> <code>re.</code><code>split</code><span class="sig-paren">(</span><em>pattern</em>, <em>string</em>, <em>maxsplit=0</em>, <em>flags=0</em><span class="sig-paren">)</span> <code>re.</code><code>findall</code><span class="sig-paren">(</span><em>pattern</em>, <em>string</em>, <em>flags=0</em><span class="sig-paren">)</span> <code>re.</code><code>finditer</code><span class="sig-paren">(</span><em>pattern</em>, <em>string</em>, <em>flags=0</em><span class="sig-paren">)</span> <code>re.</code><code>sub</code><span class="sig-paren">(</span><em>pattern</em>, <em>repl</em>, <em>string</em>, <em>count=0</em>, <em>flags=0</em><span class="sig-paren">)</span> <code>re.</code><code>subn</code><span class="sig-paren">(</span><em>pattern</em>, <em>repl</em>, <em>string</em>, <em>count=0</em>, <em>flags=0</em><span class="sig-paren">)</span>

# 将一个正则表达式编译成可用的正则表达式对象

re.compile(pattern, flags=0)

#以下为匹配所用函数

re.</code><code>match</code>(pattern, string, flags=0) <code>re.</code><code>search</code>(pattern, string, flags=0)

re.</code><code>fullmatch</code>(pattern, string, flags=0) <code>re.</code><code>split</code>(pattern, string, maxsplit=0, flags=0) <code>re.</code><code>findall</code>(pattern, string, flags=0) <code>re.</code><code>finditer</code>(pattern, string, flags=0) <code>re.</code><code>sub</code>(pattern, repl, string, count=0, flags=0) <code>re.</code><code>subn</code>(pattern, repl, string, count=0, flags=0)

1.re.match(pattern, string[, flags])

如果字符串开头的零或多个字符与正则表达式模式匹配，则返回相应的匹配对象。如果字符串与模式不匹配，则返回None;注意，这与零长度匹配不同。

即使在多行模式下，re.match()也只会在字符串的开头匹配，而不会在每行的开头匹配。如果您想要在字符串中的任何地方找到匹配项，请使用search()(参见search()和match()))。

2.re.search(pattern, string, flags=0)

扫描字符串，查找正则表达式模式产生匹配的第一个位置，并返回相应的匹配对象。如果字符串中没有与模式匹配的位置，则返回None;注意，这与在字符串的某个点找到零长度匹配是不同的。

3.re.split(pattern, string, maxsplit=0, flags=0)

通过模式的出现分割字符串。如果在模式中使用捕获括号，那么模式中所有组的文本也将作为结果列表的一部分返回。如果maxsplit是非零的，那么最多会发生maxsplit split，并且字符串的其余部分作为列表的最后元素返回。

>>> re.split(r'\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split(r'(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split(r'\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']

>>> re.split(r'\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

>>> re.split(r'(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

>>> re.split(r'\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

4.re.findall(pattern, string, flags=0)

返回字符串中所有模式的非重叠匹配，如字符串列表。从左到右扫描字符串，并按找到的顺序返回匹配项。如果模式中存在一个或多个组，则返回组列表;如果模式有多个组，这将是一个元组列表。结果中包含空匹配项。

5.re.finditer(pattern, string, flags=0)

返回一个迭代器，在字符串中的所有非重叠匹配中产生匹配对象。从左到右扫描字符串，并按找到的顺序返回匹配项。结果中包含空匹配项。请参见关于findall()的说明。

还有三个不一一介绍了，大家自己打开标准库链接进行学习查看。

大家应该注意到re.findall(pattern, string, flags=0)，每一个方法括号中传入的参数都有一个flag，这个flag参数可选值，在上面提到的标准库说明链接中都有详细说明和介绍，大家自己去看。参数flag是匹配模式，取值可以使用按位或运算符’|’表示同时生效，比如re.I | re.M。

• re.I(全拼：IGNORECASE): 忽略大小写（括号内是完整写法，下同）
• re.M(全拼：MULTILINE): 多行模式，改变'^'和'$'的行为（参见上图）
• re.S(全拼：DOTALL): 点任意匹配模式，改变'.'的行为
• re.L(全拼：LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定
• re.U(全拼：UNICODE): 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
• re.X(全拼：VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。

• re.I(全拼：IGNORECASE): 忽略大小写（括号内是完整写法，下同）

• re.M(全拼：MULTILINE): 多行模式，改变'^'和'$'的行为（参见上图）

• re.S(全拼：DOTALL): 点任意匹配模式，改变'.'的行为

• re.L(全拼：LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定

• re.U(全拼：UNICODE): 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性

• re.X(全拼：VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。

5.几个常用方法详细解析

上面re模块中的8个方法，均需要传入pattern参数，注：以上8个方法中的flags同样是代表匹配模式的意思，如果在pattern生成时已经指明了flags，那么在下面的方法中就不需要传入这个参数了。

下面详细介绍一下这8个方法：

1.re.match(pattern, string[, flags])

如果字符串开头的零或多个字符与正则表达式模式匹配，则返回相应的匹配对象。如果字符串与模式不匹配，则返回None;注意，这与零长度匹配不同。

即使在多行模式下，re.match()也只会在字符串的开头匹配，而不会在每行的开头匹配。如果您想要在字符串中的任何地方找到匹配项，请使用search()(参见search()和match()))。

# 导入re模块
import re

# 将正则表达式编译成Pattern对象，注意hello前面的r的意思是“原生字符串”
pattern = re.compile(r'hello')

# 使用re.match匹配文本，获得匹配结果，无法匹配时将返回None
result1 = re.match(pattern, 'hello')
result2 = re.match(pattern, 'helloo CQC!')
result3 = re.match(pattern, 'helo CQC!')
result4 = re.match(pattern, 'hello CQC!')

# 如果1匹配成功
if result1:
    # 使用Match获得分组信息
    print
    result1.group()
else:
    print('1匹配失败！')

# 如果2匹配成功
if result2:
    # 使用Match获得分组信息
    print(result2.group())
else:
    print('2匹配失败！')

# 如果3匹配成功
if result3:
    # 使用Match获得分组信息
    print(result3.group())
else:
    print('3匹配失败！')

# 如果4匹配成功
if result4:
    # 使用Match获得分组信息
    print(result4.group())
else:
    print('4匹配失败！')

# 导入re模块

import re

# 将正则表达式编译成Pattern对象，注意hello前面的r的意思是“原生字符串”

pattern = re.compile(r'hello')

# 使用re.match匹配文本，获得匹配结果，无法匹配时将返回None

result1 = re.match(pattern, 'hello')

result2 = re.match(pattern, 'helloo CQC!')

result3 = re.match(pattern, 'helo CQC!')

result4 = re.match(pattern, 'hello CQC!')

# 如果1匹配成功

if result1:

# 使用Match获得分组信息

result1.group()

else:

print('1匹配失败！')

# 如果2匹配成功

if result2:

# 使用Match获得分组信息

print(result2.group())

else:

print('2匹配失败！')

# 如果3匹配成功

if result3:

# 使用Match获得分组信息

print(result3.group())

else:

print('3匹配失败！')

# 如果4匹配成功

if result4:

# 使用Match获得分组信息

print(result4.group())

else:

print('4匹配失败！')

运行结果如下：

D:\python\python.exe E:/技术学习/Python代码/14.正则表达式/march.py
hello
3匹配失败！
hello

进程已结束,退出代码0

D:\python\python.exe E:/技术学习/Python代码/14.正则表达式/march.py

hello

3匹配失败！

hello

进程已结束,退出代码0

我们还看到最后打印出了result.group()，这个是什么意思呢？下面我们说一下关于match对象的的属性和方法
Match对象是一次匹配的结果，包含了很多关于此次匹配的信息，可以使用Match提供的可读属性或方法来获取这些信息

在上面提到的python标准库6.2.4部分，有关于match对象的讲解，这个对象有一些对应的方法可以调用。这里希望大家能自己打开下面下面这个链接进行学习了解：https://docs.python.org/3.6/library/re.html#re.findall找到6.2.4

属性：
1.string: 匹配时使用的文本。
2.re: 匹配时使用的Pattern对象。
3.pos: 文本中正则表达式开始搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
4.endpos: 文本中正则表达式结束搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
5.lastindex: 最后一个被捕获的分组在文本中的索引。如果没有被捕获的分组，将为None。
6.lastgroup: 最后一个被捕获的分组的别名。如果这个分组没有别名或者没有被捕获的分组，将为None。

方法：
1.group([group1, …]):
获得一个或多个分组截获的字符串；指定多个参数时将以元组形式返回。group1可以使用编号也可以使用别名；编号0代表整个匹配的子串；不填写参数时，返回group(0)；没有截获字符串的组返回None；截获了多次的组返回最后一次截获的子串。
2.groups([default]):
以元组形式返回全部分组截获的字符串。相当于调用group(1,2,…last)。default表示没有截获字符串的组以这个值替代，默认为None。
3.groupdict([default]):
返回以有别名的组的别名为键、以该组截获的子串为值的字典，没有别名的组不包含在内。default含义同上。
4.start([group]):
返回指定的组截获的子串在string中的起始索引（子串第一个字符的索引）。group默认值为0。
5.end([group]):
返回指定的组截获的子串在string中的结束索引（子串最后一个字符的索引+1）。group默认值为0。
6.span([group]):
返回(start(group), end(group))。
7.expand(template):
将匹配到的分组代入template中然后返回。template中可以使用\id或\g、\g引用分组，但不能使用编号0。\id与\g是等价的；但\10将被认为是第10个分组，如果你想表达\1之后是字符’0’，只能使用\g0。

这些内容，在上面提到链接6.2.4有详细的介绍，大家可以去看看。

2.`re.search`(pattern, string, flags=0)

2.re.search(pattern, string, flags=0)

扫描字符串，查找正则表达式模式产生匹配的第一个位置，并返回相应的匹配对象。如果字符串中没有与模式匹配的位置，则返回None;注意，这与在字符串的某个点找到零长度匹配是不同的。

# 导入re模块
import re

# 将正则表达式编译成Pattern对象
pattern = re.compile(r'world')
# 使用search()查找匹配的子串，不存在能匹配的子串时将返回None
# 这个例子中使用match()无法成功匹配
match = re.search(pattern, 'hello world!')
if match:
    # 使用Match获得分组信息
    print(match.group())
    ### 输出 ###
    # world

# 导入re模块

import re

# 将正则表达式编译成Pattern对象

pattern = re.compile(r'world')

# 使用search()查找匹配的子串，不存在能匹配的子串时将返回None

# 这个例子中使用match()无法成功匹配

match = re.search(pattern, 'hello world!')

if match:

# 使用Match获得分组信息

print(match.group())

### 输出 ###

# world

3.`re.fullmatch`(pattern, string, flags=0)

如果整个字符串匹配正则表达式模式，则返回相应的匹配对象。如果字符串与模式不匹配，则返回None;注意，这与零长度匹配不同。必须是全词匹配。

# 导入re模块
import re

# 将正则表达式编译成Pattern对象
pattern = re.compile(r'hello world!')
# 使用search()查找匹配的子串，不存在能匹配的子串时将返回None
# 这个例子中使用match()无法成功匹配
match = re.fullmatch(pattern, 'hello world!')
if match:
    # 使用Match获得分组信息
    print(match.group())
    ### 输出 ###
    # hello world!

# 导入re模块

import re

# 将正则表达式编译成Pattern对象

pattern = re.compile(r'hello world!')

# 使用search()查找匹配的子串，不存在能匹配的子串时将返回None

# 这个例子中使用match()无法成功匹配

match = re.fullmatch(pattern, 'hello world!')

if match:

# 使用Match获得分组信息

print(match.group())

### 输出 ###

# hello world!

4.`re.split`(pattern, string, maxsplit=0, flags=0)

按照能够匹配的子串将string分割后返回列表。maxsplit用于指定最大分割次数，不指定将全部分割。我们通过下面的例子感受一下。

import re
 
pattern = re.compile(r'\d+')
print(re.split(pattern,'one1two2three3four4'))

### 输出 ###
# ['one', 'two', 'three', 'four', '']

import re

pattern = re.compile(r'\d+')

print(re.split(pattern,'one1two2three3four4'))

### 输出 ###

# ['one', 'two', 'three', 'four', '']

5.`re.findall`(pattern, string, flags=0)

"""
正则表达式的主要功能就是匹配字符串
"""

import re

# 基本用法
buffer = re.findall('world', "hello world**Worldworld")  # 查找制定字符串，以list形式返回
print(buffer)
"""
原字符
"""

# 1. '.' 通配符：代表任意字符，一个点一个字符
buffer = re.findall('w...d', "hello world")
print(buffer)  # ['world']

buffer = re.findall('w...d', "hello w\nrld")
print(buffer)  # [] 除了\n其他都行,当然也可以通过修改findall的第三个参数去修改成连\n都能匹配

# 2. '^' 尖角符：必须从字符串的起始位置开始匹配,不考虑后续字符串中是否存在
buffer = re.findall('^w...d', "hello world")
print(buffer)  # []

buffer = re.findall('^w...d', "worldhello world")
print(buffer)  # ['world']

# 3. '$' 只从最后开始匹配
buffer = re.findall('w...d', "hello world!!")
print(buffer)  # ['world']

buffer = re.findall('w...d$', "hello world!!")
print(buffer)  # []

buffer = re.findall('w...d$', "hello world!!world")
print(buffer)  # ['world']

# *************************************************************

# 4. '*' 重复匹配 允许*之前的一个字符重复多次

buffer = re.findall('hello*world', 'hellooooooworld')
print(buffer)  # ['hellooooooworld']

buffer = re.findall('hello.*world', 'hello@@sssworld')  # 如果我使用通配符'.'他就能匹配任意字符
print(buffer)  # ['hello@@sssworld']

# 5. '+' 也是重复匹配 但是至少得有一个
buffer = re.findall('hello*world', 'hellworld')
print(buffer)  # ['hellworld']

buffer = re.findall('hello+world', 'hellworld')
print(buffer)  # []
# 也就是说'+'号之前的o，在目标字符串里必须出现一次，但是'*'号允许一次也不出现

# 6. '?' 还是重复匹配，但是只能是0次或者1次多了就不行
buffer = re.findall('hello?world', 'hellworld')
print(buffer)  # ['hellworld']

buffer = re.findall('hello?world', 'helloworld')
print(buffer)  # ['helloworld']

buffer = re.findall('hello?world', 'helloooworld')
print(buffer)  # []

# 7. '{}' 大括号也是重复匹配，但是匹配几次自己可以设置
buffer = re.findall('a{5}b', 'aaaabbaaa')  # 要求a重复5次
print(buffer)  # []

buffer = re.findall('a{5}b', 'aaaaabbaaa')
print(buffer)  # ['aaaaab']

buffer = re.findall('a{1,3}b', 'ba***aab***aaab***aaaaaaaabaaa')
print(buffer)  # ['aab', 'aaab', 'aaab']

"""

正则表达式的主要功能就是匹配字符串

"""

import re

# 基本用法

buffer = re.findall('world', "hello world**Worldworld") # 查找制定字符串，以list形式返回

print(buffer)

"""

原字符

"""

# 1. '.' 通配符：代表任意字符，一个点一个字符

buffer = re.findall('w...d', "hello world")

print(buffer) # ['world']

buffer = re.findall('w...d', "hello w\nrld")

print(buffer) # [] 除了\n其他都行,当然也可以通过修改findall的第三个参数去修改成连\n都能匹配

# 2. '^' 尖角符：必须从字符串的起始位置开始匹配,不考虑后续字符串中是否存在

buffer = re.findall('^w...d', "hello world")

print(buffer) # []

buffer = re.findall('^w...d', "worldhello world")

print(buffer) # ['world']

# 3. '$' 只从最后开始匹配

buffer = re.findall('w...d', "hello world!!")

print(buffer) # ['world']

buffer = re.findall('w...d$', "hello world!!")

print(buffer) # []

buffer = re.findall('w...d$', "hello world!!world")

print(buffer) # ['world']

# *************************************************************

# 4. '*' 重复匹配允许*之前的一个字符重复多次

buffer = re.findall('hello*world', 'hellooooooworld')

print(buffer) # ['hellooooooworld']

buffer = re.findall('hello.*world', 'hello@@sssworld') # 如果我使用通配符'.'他就能匹配任意字符

print(buffer) # ['hello@@sssworld']

# 5. '+' 也是重复匹配但是至少得有一个

buffer = re.findall('hello*world', 'hellworld')

print(buffer) # ['hellworld']

buffer = re.findall('hello+world', 'hellworld')

print(buffer) # []

# 也就是说'+'号之前的o，在目标字符串里必须出现一次，但是'*'号允许一次也不出现

# 6. '?' 还是重复匹配，但是只能是0次或者1次多了就不行

buffer = re.findall('hello?world', 'hellworld')

print(buffer) # ['hellworld']

buffer = re.findall('hello?world', 'helloworld')

print(buffer) # ['helloworld']

buffer = re.findall('hello?world', 'helloooworld')

print(buffer) # []

# 7. '{}' 大括号也是重复匹配，但是匹配几次自己可以设置

buffer = re.findall('a{5}b', 'aaaabbaaa') # 要求a重复5次

print(buffer) # []

buffer = re.findall('a{5}b', 'aaaaabbaaa')

print(buffer) # ['aaaaab']

buffer = re.findall('a{1,3}b', 'ba***aab***aaab***aaaaaaaabaaa')

print(buffer) # ['aab', 'aaab', 'aaab']

对应的运行结果如下：

D:\python\python.exe E:/技术学习/Python代码/14.正则表达式/正则表达式.py
['world', 'world']
['world']
[]
[]
['world']
['world']
[]
['world']
['hellooooooworld']
['hello@@sssworld']
['hellworld']
[]
['hellworld']
['helloworld']
[]
[]
['aaaaab']
['aab', 'aaab', 'aaab']

进程已结束,退出代码0

D:\python\python.exe E:/技术学习/Python代码/14.正则表达式/正则表达式.py

['world', 'world']

['world']

[]

['world']

[]

['world']

['hellooooooworld']

['hello@@sssworld']

['hellworld']

[]

['hellworld']

['helloworld']

[]

['aaaaab']

['aab', 'aaab', 'aaab']

进程已结束,退出代码0

6.`re.finditer`(pattern, string, flags=0)

搜索string，返回一个顺序访问每一个匹配结果（Match对象）的迭代器(索引)。我们通过下面的例子来感受一下

import re
 
pattern = re.compile(r'\d+')
for m in re.finditer(pattern,'one1two2three3four4'):
    print(m.group())
 
### 输出 ###
# 1 2 3 4

import re

pattern = re.compile(r'\d+')

for m in re.finditer(pattern,'one1two2three3four4'):

print(m.group())

### 输出 ###

# 1 2 3 4

7.`re.sub`(pattern, repl, string, count=0, flags=0)

使用repl替换string中每一个匹配的子串后返回替换后的字符串。当repl是一个字符串时，可以使用\id或\g、\g引用分组，但不能使用编号0。当repl是一个方法时，这个方法应当只接受一个参数（Match对象），并返回一个字符串用于替换（返回的字符串中不能再引用分组）。
count用于指定最多替换次数，不指定时全部替换。

import re
 
pattern = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'
 
print(re.sub(pattern,r'\2 \1', s))
 
def func(m):
    return m.group(1).title() + ' ' + m.group(2).title()
 
print(re.sub(pattern,func, s))
 
### output ###
# say i, world hello!
# I Say, Hello World!

import re

pattern = re.compile(r'(\w+) (\w+)')

s = 'i say, hello world!'

print(re.sub(pattern,r'\2 \1', s))

def func(m):

return m.group(1).title() + ' ' + m.group(2).title()

print(re.sub(pattern,func, s))

### output ###

# say i, world hello!

# I Say, Hello World!

8.`re.subn`(pattern, repl, string, count=0, flags=0)

返回 (sub(repl, string[, count]), 替换次数)。

import re
 
pattern = re.compile(r'(\w+) (\w+)')
s = 'i say, hello world!'
 
print( re.subn(pattern,r'\2 \1', s))
 
def func(m):
    return m.group(1).title() + ' ' + m.group(2).title()
 
print (re.subn(pattern,func, s))
 
### output ###
# ('say i, world hello!', 2)
# ('I Say, Hello World!', 2)

import re

pattern = re.compile(r'(\w+) (\w+)')

s = 'i say, hello world!'

print( re.subn(pattern,r'\2 \1', s))

def func(m):

return m.group(1).title() + ' ' + m.group(2).title()

print (re.subn(pattern,func, s))

### output ###

# ('say i, world hello!', 2)

# ('I Say, Hello World!', 2)

6.本篇小结

这篇博文，本人在写的过程中大约也用了7个小时左右。我在写的过程中也加深了自己对正则表达式的理解，感谢大家的支持，下面的博文将带大家一起做一些小项目练手了。

转载请注明：燕骏博客 » Python3.6爬虫入门自学教程之九：python中的正则表达式学习

赞赏作者

微信赞赏支付宝赞赏

燕骏博客本站早已不维护，资源已失效，择期关闭，勿打赏

Python3.6爬虫入门自学教程之九：python中的正则表达式学习

1.正则表达式介绍

2.正则表达式的语法规则

3.正则表达式的相关注释

1.字符串数量的贪婪与非贪婪表示形式

2.转义字符需要额外注意

4.正则表达式之Re模块常用方法简介

Module Contents

5.几个常用方法详细解析

1.re.match(pattern, string[, flags])

2.`re.search`(pattern, string, flags=0)

3.`re.fullmatch`(pattern, string, flags=0)

4.`re.split`(pattern, string, maxsplit=0, flags=0)

5.`re.findall`(pattern, string, flags=0)

6.`re.finditer`(pattern, string, flags=0)

7.`re.sub`(pattern, repl, string, count=0, flags=0)

8.`re.subn`(pattern, repl, string, count=0, flags=0)

6.本篇小结

Hi，您需要填写昵称和邮箱！

1.正则表达式介绍

2.正则表达式的语法规则

3.正则表达式的相关注释

1.字符串数量的贪婪与非贪婪表示形式

2.转义字符需要额外注意

4.正则表达式之Re模块常用方法简介

Module Contents

5.几个常用方法详细解析

1.re.match(pattern, string[, flags])

2.re.search(pattern, string, flags=0)

3.re.fullmatch(pattern, string, flags=0)

4.re.split(pattern, string, maxsplit=0, flags=0)

5.re.findall(pattern, string, flags=0)

6.re.finditer(pattern, string, flags=0)

7.re.sub(pattern, repl, string, count=0, flags=0)

8.re.subn(pattern, repl, string, count=0, flags=0)

6.本篇小结

Hi，您需要填写昵称和邮箱！

2.`re.search`(pattern, string, flags=0)

3.`re.fullmatch`(pattern, string, flags=0)

4.`re.split`(pattern, string, maxsplit=0, flags=0)

5.`re.findall`(pattern, string, flags=0)

6.`re.finditer`(pattern, string, flags=0)

7.`re.sub`(pattern, repl, string, count=0, flags=0)

8.`re.subn`(pattern, repl, string, count=0, flags=0)