使用 Web 服务器日志 -2 将日志信息提供给程序

论坛元老

Rank: 8 Rank: 8

UID: 1066743

1^#

打印

字体大小: tT

look_w发表于 2018-8-23 21:07 | 只看该作者

使用 Web 服务器日志 -2 将日志信息提供给程序

将日志信息提供给程序您已经看到这些格式的结构是多么良好。很容易使用正则表达式获得这些信息。清单 3 是一个示例程序，它解析一个日志行，并写出对日志信息的总结。该程序是用 Python 编写的，但是其中的重要部分是正则表达式，所以很容易将它移植到任何其他语言。
清单 3. 解析日志行的代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

import re

#This regular expression is the heart of the code.
#Python uses Perl regex, so it should be readily portable
#The r'' string form is just a convenience so you don't have to escape backslashes
COMBINED_LOGLINE_PAT = re.compile(
r'(?P<origin>\d+\.\d+\.\d+\.\d+) '
+ r'(?P<identd>-|\w*) (?P<auth>-|\w*) '
+ r'\[(?P<date>[^\[\]:]+)

aste the Apache log line then press enter: ")

match_info = COMBINED_LOGLINE_PAT.match(logline)
print #Add a new line

#Print all named groups matched in the regular expression
for key, value in match_info.groupdict().items():
print key, ":", value

模式 COMBINED_LOGLINE_PAT 是为 combined 格式设计的，但是既然 combined 格式只比 common 格式多 3 个可选字段，那么该模式同时适用于这两种格式。该模式使用一个特定于 Python 的特性，即捕捉组（capturing groups），以便为日志行中的每个字段赋予逻辑名。如果要移植到其他风格的正则表达式，只需使用正则组，并按数字顺序引用字段。注意这个模式有多细粒度。它不是简单地以状态行为单元进行抓取，它可以分别抓取 HTTP 方法、请求路径和协议版本，从而提高更多方便。data/time 也被拆分成日期、时间和时区。图 1 显示在清单 2 所示示例日志行上运行清单 3 的输出。
图 1. 清单 3 的输出

避免 spider从日志文件中可以获悉很多更有趣的事情，但是这要求能够将 spider 的访问与人的访问区分开来。一些主流搜索引擎，例如 Google 和 Yahoo!，使用非常有侵略性的索引方式，即使站点不是很流行，日志仍会受到来自这些 spider 的流量的支配。要 100% 地剔除 spider 流量几乎是不可能的，但是可以通过检查日志中常见的 spider 模式来剔除大部分 spider 流量。此处 “client” 字段是关键。清单 4 是另一个 Python 程序，该程序也非常容易移植到其他语言。它将日志文件导入到标准输入中，然后导出除认定为 spider 流量的行之外的所有行。
清单 4. 从日志文件中剔除搜索引擎 spider 流量

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45

import re
import sys

#This regular expression is the heart of the code.
#Python uses Perl regex, so it should be readily portable
#The r'' string form is just a convenience so you don't have to escape backslashes
COMBINED_LOGLINE_PAT = re.compile(
r'(?P<origin>\d+\.\d+\.\d+\.\d+) '
+ r'(?P<identd>-|\w*) (?P<auth>-|\w*) '
+ r'\[(?P<date>[^\[\]:]+)

?P<time>\d+:\d+:\d+) (?P<tz>[\-\+]?\d\d\d\d)\] '
+ r'"(?P<method>\w+) (?P<path>[\S]+) (?P<protocol>[^"]+)" (?P<status>\d+)
(?P<bytes>-|\d+)'
+ r'( (?P<referrer>"[^"]*")( (?P<client>"[^"]*")( (?P<cookie>"[^"]*"))?)?)?\s*\Z'
)

#Patterns in the client field for sniffing out bots
BOT_TRACES = [
(re.compile(r".*http://help\.yahoo\.com/help/us/ysearch/slurp.*"),
      "Yahoo robot"),
(re.compile(r".*\+http://www\.google\.com/bot\.html.*"),
      "Google robot"),
(re.compile(r".*\+http://about\.ask\.com/en/docs/about/webmasters.shtml.*"),
      "Ask Jeeves/Teoma robot"),
(re.compile(r".*\+http://search\.msn\.com\/msnbot\.htm.*"),
      "MSN robot"),
(re.compile(r".*http://www\.entireweb\.com/about/search_tech/speedy_spider/.*"),
      "Speedy Spider"),
(re.compile(r".*\+http://www\.baidu\.com/search/spider_jp\.html.*"),
      "Baidu spider"),
(re.compile(r".*\+http://www\.gigablast\.com/spider\.html.*"),
      "Gigabot robot"),
]

for line in sys.stdin:
match_info = COMBINED_LOGLINE_PAT.match(line)
if not match_info:
      sys.stderr.write("Unable to parse log line\n")
      continue
isbot = False
for pat, botname in BOT_TRACES:
      if pat.match(match_info.group('client')):
         isbot = True
         break
if not isbot:
      sys.stdout.write(line)

这里的 spider 客户机正则表达式列表并不是完整的。新的搜索引擎总是不断出现。当查看流量的过程中发现新的 spider 时，您应该能仿照清单中的模式，将新的 spider 加入到列表中。

收藏分享评分

回复引用

订阅 TOP

返回列表