Back to home

WSGI进行URL正则匹配之前是否应该缩小正则匹配范围

起因

因为业务需要,计划实现一个WSGI application,用于现有逻辑. 实现WSGI时碰到需要进行URL匹配的问题了.
有人提议能否能像SQL中的查询一样,进行预处理缩小被匹配的正则范围.

这里有几个值得考虑的地方.
- 如何进行缩小匹配正则范围. - 额外的一次字符串匹配(startswith)的消耗是否小于缩小后的正则匹配消耗.

验证

首先进行正则表达式的处理,获取正则中能用于字符串匹配的"开头几个"字符. 这就涉及到正则表达式的解析了.
好在Python的成熟类库,自带提供了解析正则表达式的库

正则表达式解析

re.sre_parse提供了对正则表达式的解析.
返回的是每个字符的类型.有"literal"的常规字符.也有in的正则捕获,还有max_repeat的贪婪匹配等等.

In [28]: a.parse("/asdas\wd\sa\Wsssss\Safadf\Dsd\d+a[^a]*dasdasd")
Out[28]: [('literal', 47), ('literal', 97), ('literal', 115), ('literal', 100), ('literal', 97), ('literal', 115), ('in', [('category', 'category_word')]), ('literal', 100), ('in', [('category', 'category_space')]), ('literal', 97), ('in', [('category', 'category_not_word')]),('literal', 115), ('literal', 115), ('literal', 115), ('literal', 115), ('literal', 115), ('in', [('category', 'category_not_space')]), ('literal', 97), ('literal', 102), ('literal', 97), ('literal', 100), ('literal', 102), ('in', [('category', 'category_not_digit')]), ('literal', 115), ('literal', 100), ('max_repeat', (1, 65535, [('in', [('category', 'category_digit')])])), ('literal', 97), ('max_repeat', (0, 65535, [('not_literal', 97)])), ('literal', 100), ('literal', 97), ('literal', 115), ('literal', 100), ('literal', 97), ('literal', 115), ('literal', 100)]

有了re.sre_parse之后处理就变得简单了.碰到非字符常量就截取到当前位置字符,作为匹配键.

其他逻辑实现

其他的逻辑就很直接,直接模拟WSGI application的URL匹配操作.

import re

match_list = [
    ("/sss", "dataA"),
    ("/ddd\d+", "dataB"),
    ("/asd\W+", "dataC"),
    ("/bbb", "dataD"),
]

match_list_start = []
for n, i in enumerate(match_list):
    reg_nodes = [0 if node[0] == "literal" else 1 for node in re.sre_parse.parse(i[0])]
    position = -1
    try:
        position = reg_nodes.index(1)
    except ValueError, ex:
        pass

    match_list_start.append(i[0][:position] if position > -1 else i[0])

    match_list[n] = (re.compile(i[0]), i[1])

def str_match(ls):
    for i in ls:
        for n, j in enumerate(match_list_start):
            if i.startswith(j):
                match = match_list[n][0].match(i)
                if match:
                    #domatchting
                    pass

def regex_match(ls):
    for i in ls:
        for j in match_list:
            match = j[0].match(i)
            if match:
                #domatching
                pass

example_str = [
    "/ssssssssssss",
    "/ddd123123",
    "/asd4134134asdasd",
    "/bbbbbbbbbbbbbbb",
    "/bbb",
    "/sss",
    "/dddwww",
    "/asdWWWWWW",
] * 800000

str_match(example_str) #29s
regex_match(example_str) #21s

结论

在cProfile反馈的数据中光是string的startswith耗时16s.后来只执行regex.match耗时21s.
所以直接进行正则匹配更为有效.