在 Linux 上构建 Web spider（5）实例：Web 站点爬虫

论坛元老

Rank: 8 Rank: 8

UID: 1066743

1^#

打印

字体大小: tT

look_w发表于 2018-7-20 08:48 | 只看该作者

在 Linux 上构建 Web spider（5）实例：Web 站点爬虫

在最后这个例子中，将探索一下在 Web 站点上爬行的 Web spider。为了安全起见，我将避免在该站点之外浪费时间，而只会深入研究一个 Web 页面。
要在 Web 站点上爬行并访问这个站点上所提供的链接，必须要对 HTML 页面进行解析。如果可以成功解析 Web 页面，就可以确定到其他资源的链接，这些链接有些指定的是本地资源（文件），而有些则会代表非本地的资源（例如到其他 Web 页面的链接）。
要在 Web 上爬行，需要从一个给定的 Web 页面开始，确定这个页面中的所有链接，将它们放入一个等待访问的队列中进行排序，然后使用等待访问队列中的第一项来重复这个处理过程。这会产生广度优先遍历（与优先处理首先找到的第一个链接不同，后者是一种深度优先遍历）。
如果能够避免非本地的链接而只访问本地 Web 页面，就可以为这个单一 Web 站点提供 Web 爬虫了，如清单 7 所示。在本例中，使用 Python 语言来代替 Ruby 语言，这样做是为了利用 Python 非常有用的 HTMLParser 类。
清单 7. 简单的 Python Web 站点爬虫（minispider.py）

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81

#!/usr/local/bin/python

import httplib
import sys
import re
from HTMLParser import HTMLParser

class miniHTMLParser( HTMLParser ):

  viewedQueue = []
  instQueue = []

  def get_next_link( self ):
if self.instQueue == []:
   return ''
else:
   return self.instQueue.pop(0)

  def gethtmlfile( self, site, page ):
try:
   httpconn = httplib.HTTPConnection(site)
   httpconn.request("GET", page)
   resp = httpconn.getresponse()
   resppage = resp.read()
except:
   resppage = ""

return resppage

  def handle_starttag( self, tag, attrs ):
if tag == 'a':
   newstr = str(attrs[0][1])
   if re.search('http', newstr) == None:
      if re.search('mailto', newstr) == None:
      if re.search('htm', newstr) != None:
         if (newstr in self.viewedQueue) == False:
            print "  adding", newstr
            self.instQueue.append( newstr )
            self.viewedQueue.append( newstr )
      else:
         print "  ignoring", newstr
      else:
      print "  ignoring", newstr
   else:
      print "  ignoring", newstr

def main():

  if sys.argv[1] == '':
print "usage is ./minispider.py site link"
sys.exit(2)

  mySpider = miniHTMLParser()

  link = sys.argv[2]

  while link != '':

print "\nChecking link ", link

# Get the file from the site and link
retfile = mySpider.gethtmlfile( sys.argv[1], link )

# Feed the file into the HTML parser
mySpider.feed(retfile)

# Search the retfile here

# Get the next link in level traversal order
link = mySpider.get_next_link()

  mySpider.close()

  print "\ndone\n"

if __name__ == "__main__":
  main()

这个爬虫的基本设计是加载第一个链接并将其放入一个队列。此队列就是下一个要询问 (next-to-interrogate) 队列。当一个链接被选中时，所发现的任何新链接都被加入相同的队列中。这提供了一种广度优先的搜索。另外还维护了一个已查看过的队列以防止再次访问过去已经查看过的链接。基本上就这些，很多实际工作都可以由 HTML 解析器来完成。
先是从 Python 的 HTMLParser 类获取一个新类 miniHTMLParser。这个类可以实现几个功能。首先，它可以用作 HTML 解析器，只要碰到开始的 HTML 标记都会提供一个回调方法 (handle_starttag)。其次，这个类还可以用来访问在爬行中所碰到的链接 (get_next_link) 并检索这个链接所代表的文件（在本例中是一个 HTML 文件）。
这个类中还包含了两个实例变量：viewedQueue，其中包含了到目前为止已经检查过的链接；instQueue，表示将要被审查的链接。
正如您所见，类方法非常简单。get_next_link 方法检查 instQueue 是否为空，并返回 ''。否则，就通过 pop 方法返回下一项。gethtmlfile 方法使用 HTTPConnectionK 连接到站点上并返回指定页面的内容。最后，对 Web 页面中的每个开始标记都调用 handle_starttag（它是通过 feed 方法传递给 HTML 解析器的）。在这个函数中，检查该链接是否是非本地链接（如果链接中包含 http），是否是 e-mail 地址（如果包含 mailto），以及链接中是否包含 'htm'，如果包含则说明它（有很大的可能）是一个 Web 页面。另外还会检查以确保之前没有访问过这个链接；否则，就将这个链接加载到已经审查过的队列中。
main 方法非常简单。创建一个新 miniHTMLParser 实例并着手处理用户定义的站点（argv[1]）和链接（argv[2]）。然后获取这个链接的内容，将其传递给 HTML 解析器，并获取下一个要访问的链接（如果存在）。当还存在需要访问的链接时，循环继续。
要调用这个 Web spider，需要提供一个 Web 站点地址和一个链接：
./minispider.py www.fsf.org /
在本例中，会请求 Free Software Foundation 的根文件。这个命令的结果如清单 8 所示。可以看到新链接已经被加入检查队列和那些被忽略的链接中，例如非本地链接。在这个清单底部，可以看到在根文件所找到的审查链接。
清单 8. minispider 脚本的输出结果

[mtj@camus]$ ./minispider.py www.fsf.org /

Checking link  /
  ignoring hiddenStructure
  ignoring http://www.fsf.org
  ignoring http://www.fsf.org
  ignoring http://www.fsf.org/news
  ignoring http://www.fsf.org/events
  ignoring http://www.fsf.org/campaigns
  ignoring http://www.fsf.org/resources
  ignoring http://www.fsf.org/donate
  ignoring http://www.fsf.org/associate
  ignoring http://www.fsf.org/licensing
  ignoring http://www.fsf.org/blogs
  ignoring http://www.fsf.org/about
  ignoring https://www.fsf.org/login_form
  ignoring http://www.fsf.org/join_form
  ignoring http://www.fsf.org/news/fs-award-2005.html
  ignoring http://www.fsf.org/news/fsfsysadmin.html
  ignoring http://www.fsf.org/news/digital-communities.html
  ignoring http://www.fsf.org/news/patents-defeated.html
  ignoring /news/RSS
  ignoring http://www.fsf.org/news
  ignoring http://www.fsf.org/blogs/rms/entry-20050802.html
  ignoring http://www.fsf.org/blogs/rms/entry-20050712.html
  ignoring http://www.fsf.org/blogs/rms/entry-20050601.html
  ignoring http://www.fsf.org/blogs/rms/entry-20050526.html
  ignoring http://www.fsf.org/blogs/rms/entry-20050513.html
  ignoring http://www.fsf.org/index_html/SimpleBlogFullSearch
  ignoring documentContent
  ignoring http://www.fsf.org/index_html/sendto_form
  ignoring javascript:this.print();
  adding licensing/essays/free-sw.html
  ignoring /licensing/essays
  ignoring http://www.gnu.org/philosophy
  ignoring http://www.freesoftwaremagazine.com
  ignoring donate
  ignoring join_form
  adding associate/index_html
  ignoring http://order.fsf.org
  adding donate/patron/index_html
  adding campaigns/priority.html
  ignoring http://r300.sf.net/
  ignoring http://developer.classpath.org/mediation/OpenOffice2GCJ4
  ignoring http://gcc.gnu.org/java/index.html
  ignoring http://www.gnu.org/software/classpath/
  ignoring http://gplflash.sourceforge.net/
  ignoring campaigns
  adding campaigns/broadcast-flag.html
  ignoring http://www.gnu.org
  ignoring /fsf/licensing
  ignoring http://directory.fsf.org
  ignoring http://savannah.gnu.org
  ignoring mailto:webmaster@fsf.org
  ignoring http://www.fsf.org/Members/root
  ignoring http://www.plonesolutions.com
  ignoring http://www.enfoldtechnology.com
  ignoring http://blacktar.com
  ignoring http://plone.org
  ignoring http://www.section508.gov
  ignoring http://www.w3.org/WAI/WCAG1AA-Conformance
  ignoring http://validator.w3.org/check/referer
  ignoring http://jigsaw.w3.org/css-validator/check/referer
  ignoring http://plone.org/browsersupport

Checking link  licensing/essays/free-sw.html
  ignoring mailto:webmaster

Checking link  associate/index_html
  ignoring mailto:webmaster

Checking link  donate/patron/index_html
  ignoring mailto:webmaster

Checking link  campaigns/priority.html
  ignoring mailto:webmaster

Checking link  campaigns/broadcast-flag.html
  ignoring mailto:webmaster

done

[mtj@camus]$

这个例子展示了 Web spider 爬行的阶段。当客户机读取一个文件之后，就对这个页面的内容进行扫描，这与索引程序的情况相同。

收藏分享评分

回复引用

订阅 TOP

返回列表