使用 Web 服务器日志 -3 日志统计的基本工具

论坛元老

Rank: 8 Rank: 8

UID: 1066743

1^#

打印

字体大小: tT

look_w发表于 2018-8-23 21:08 | 只看该作者

使用 Web 服务器日志 -3 日志统计的基本工具

日志统计的基本工具有很多流行的工具可用于分析 Web 服务器日志和提供关于 Web 的统计信息。使用本文到目前为止的构建块，很容易开发专门的日志信息显示方式。还有一个构建块是从 Apache 日志格式到 JavaScript Object Notation（JSON）的转换。如果有 JSON 格式的信息，就可以轻松地用 JavaScript 分析它、操纵和呈现它。
您也不需要自己编写工具。在本节中，我将展示如何将 Apache 日志文件转换成 JSON 格式，后者是来自 MIT SIMILE 项目的强大数据呈现工具 Exhibit 所使用的格式。在较早的一篇文章 “Practical linked, open data with Exhibit”（见）中，我谈到了这个工具。只需提供 JSON，Exhibit 就可以创建一个丰富、动态的系统，以显示、过滤和搜索数据。清单 5（apachelog2exhibit.py）以之前的例子为基础，但是将 Apache 日志转换成 Exhibit 风格的 JSON。
清单 5 (apachelog2exhibit.py)。将非 spider 流量日志条目转换成 Exhibit JSON

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113

import re
import sys
import time
import httplib
import datetime
import itertools

# You'll need to install the simplejson module
# http://pypi.python.org/pypi/simplejson
import simplejson

# This regular expression is the heart of the code.
# Python uses Perl regex, so it should be readily portable
# The r'' string form is just a convenience so you don't have to escape backslashes
COMBINED_LOGLINE_PAT = re.compile(
r'(?P<origin>\d+\.\d+\.\d+\.\d+) '
+ r'(?P<identd>-|\w*) (?P<auth>-|\w*) '
+ r'\[(?P<ts>(?P<date>[^\[\]:]+)

?P<time>\d+:\d+:\d+)) (?P<tz>[\-\+]?\d\d\d\d)\] '
+ r'"(?P<method>\w+) (?P<path>[\S]+) (?P<protocol>[^"]+)" (?P<status>\d+)
(?P<bytes>-|\d+)'
+ r'( (?P<referrer>"[^"]*")( (?P<client>"[^"]*")( (?P<cookie>"[^"]*"))?)?)?\s*\Z'
)

# Patterns in the client field for sniffing out bots
BOT_TRACES = [
(re.compile(r".*http://help\.yahoo\.com/help/us/ysearch/slurp.*"),
      "Yahoo robot"),
(re.compile(r".*\+http://www\.google\.com/bot\.html.*"),
      "Google robot"),
(re.compile(r".*\+http://about\.ask\.com/en/docs/about/webmasters.shtml.*"),
      "Ask Jeeves/Teoma robot"),
(re.compile(r".*\+http://search\.msn\.com\/msnbot\.htm.*"),
      "MSN robot"),
(re.compile(r".*http://www\.entireweb\.com/about/search_tech/speedy_spider/.*"),
      "Speedy Spider"),
(re.compile(r".*\+http://www\.baidu\.com/search/spider_jp\.html.*"),
      "Baidu spider"),
(re.compile(r".*\+http://www\.gigablast\.com/spider\.html.*"),
      "Gigabot robot"),
]

MAXRECORDS = 1000

# Apache's date/time format is very messy, so dealing with it is messy
# This class provides support for managing timezones in the Apache time field
# Reuses some code from: http://seehuhn.de/blog/52
class timezone(datetime.tzinfo):
def __init__(self, name="+0000"):
      self.name = name
      seconds = int(name[:-2])*3600+int(name[-2:])*60
      self.offset = datetime.timedelta(seconds=seconds)

def utcoffset(self, dt):
      return self.offset

def dst(self, dt):
      return timedelta(0)

def tzname(self, dt):
      return self.name

def parse_apache_date(date_str, tz_str):
'''
Parse the timestamp from the Apache log file, and return a datetime object
'''
tt = time.strptime(date_str, "%d/%b/%Y:%H:%M:%S")
tt = tt[:6] + (0, timezone(tz_str))
return datetime.datetime(*tt)

def bot_check(match_info):
'''
Return True if the matched line looks like a robot
'''
for pat, botname in BOT_TRACES:
      if pat.match(match_info.group('client')):
         return True
         break
return False

entries = []

# enumerate lets you iterate over the lines in the file, maintaining a count variable
# itertools.islice lets you iterate over only a subset of the lines in the file
for count, line in enumerate(itertools.islice(sys.stdin, 0, MAXRECORDS)):
match_info = COMBINED_LOGLINE_PAT.match(line)
if not match_info:
      sys.stderr.write("Unable to parse log line\n")
      continue
# If you want to include robot clients, comment out the next two lines
if bot_check(match_info):
      continue
entry = {}
timestamp = parse_apache_date(match_info.group('ts'), match_info.group('tz'))
timestamp_str = timestamp.isoformat()
# To make Exhibit happy, set id and label fields that give some information
# about the entry, but are unique across all entries (ensured by appending count)
entry['id'] = match_info.group('origin') + ':' + timestamp_str + ':' + str(count)
entry['label'] = entry['id']
entry['origin'] = match_info.group('origin')
entry['timestamp'] = timestamp_str
entry['path'] = match_info.group('path')
entry['method'] = match_info.group('method')
entry['protocol'] = match_info.group('protocol')
entry['status'] = match_info.group('status')
entry['status'] += ' ' + httplib.responses[int(entry['status'])]
if match_info.group('bytes') != '-':
      entry['bytes'] = match_info.group('bytes')
if match_info.group('referrer') != '"-"':
      entry['referrer'] = match_info.group('referrer')
entry['client'] = match_info.group('client')
entries.append(entry)

print simplejson.dumps({'items': entries}, indent=4)

只需将 Apache 日志文件输入到 python apachelog2exhibit.py，就可以捕捉输出的 JSON。清单 6 是输出的 JSON 的一个简单的例子。
清单 6. Apache 日志转换成的 Exhibit JSON 示例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

{
"items": [
      {
         "origin": "208.111.154.16",
         "status": "200 OK",
         "protocol": "HTTP/1.1",
         "timestamp": "2009-04-27T08:21:42-05:00",
         "bytes": "2638",
         "auth": "-",
         "label": "208.111.154.16:2009-04-27T08:21:42-05:00:2",
         "identd": "-",
         "method": "GET",
         "client": "Mozilla/5.0 (compatible; Charlotte/1.1;
http://www.searchme.com/support/)",
         "referrer": "-",
         "path": "/uche.ogbuji.net",
         "id": "208.111.154.16:2009-04-27T08:21:42-05:00:2"
      },
      {
         "origin": "65.103.181.249",
         "status": "200 OK",
         "protocol": "HTTP/1.1",
         "timestamp": "2009-04-27T09:11:54-05:00",
         "bytes": "6767",
         "auth": "-",
         "label": "65.103.181.249:2009-04-27T09:11:54-05:00:4",
         "identd": "-",
         "method": "GET",
         "client": "Mozilla/5.0 (compatible; MJ12bot/v1.2.4;
http://www.majestic12.co.uk/bot.php?+)",
         "referrer": "-",
         "path": "/",
         "id": "65.103.181.249:2009-04-27T09:11:54-05:00:4"
      }
]
}

收藏分享评分

回复引用

订阅 TOP

返回列表