在 Linux 上构建 Web spider（2）实例:简单的 scraper

论坛元老

Rank: 8 Rank: 8

UID: 1066743

1^#

打印

字体大小: tT

look_w发表于 2018-7-20 08:46 | 只看该作者

在 Linux 上构建 Web spider（2）实例:简单的 scraper

这个例子向您展示了该如何确定给定的 Web 站点正在运行哪种 Web 服务器。这可能非常有趣，而且如果能在一个足够大的示例上实现，还可以提供关于 Web 服务器在政府、学术界和工业界中的普及率的有趣统计数据。
清单 1 给出了用来搜索 Web 站点以确定 HTTP 服务器的 Ruby 脚本。Net::HTTP 类实现了一个 HTTP 客户机和 GET、HEAD 和 POST 方法。只要向 HTTP 服务器发起一个请求，HTTP 响应消息的一部分就会指出这些内容是由哪个服务器提供的。这里使用 HEAD 方法来获取有关根页面（'/'）的信息，而没有从站点上下载一个页面。只要 HTTP 服务器成功响应（由响应代码 "200" 指示），就会循环迭代响应消息的每行内容，来寻找 server 关键字，如果找到，就打印这个值。这个关键字的值是一个代表 HTTP 服务器的字符串。
清单 1. 用来简单搜索元数据的 Ruby 脚本（srvinfo.rb）

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

#!/usr/local/bin/ruby
require 'net/http'

# Get the first argument from the command-line (the URL)
url = ARGV[0]

begin

  # Create a new HTTP connection
  httpCon = Net::HTTP.new( url, 80 )

  # Perform a HEAD request
  resp, data = httpCon.head( "/", nil )

  # If it succeeded (200 is success)
  if resp.code == "200" then

# Iterate through the response hash
resp.each {|key,val|

   # If the key is the server, print the value
   if key == "server" then

      print "  The server at "+url+" is "+val+"\n"

   end

}

  end

end

除了显示如何使用 srvinfo 脚本之外，清单 2 还给出了 scraper 在很多政府、学术和商业 Web 站点上的应用。这些服务器有很大差异，从 Apache（占 68% ）到 Sun 和 Microsoft® 的 IIS（Internet Information Services）。您还可以看到其中有一个应用没有给出所使用的服务器。有趣的是就当密克罗尼西亚联邦政府还在运行一个旧版本的 Apache（应该更新了）的时候，Apache.org 却在技术上不断大胆尝试、推陈出新。
清单 2. 服务器 scraper 的示例应用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33

[mtj@camus]$ ./srvrinfo.rb www.whitehouse.gov
  The server at www.whitehouse.gov is Apache
[mtj@camus]$ ./srvrinfo.rb www.cisco.com
  The server at www.cisco.com is Apache/2.0 (Unix)
[mtj@camus]$ ./srvrinfo.rb www.gov.ru
  The server at www.gov.ru is Apache/1.3.29 (Unix)
[mtj@camus]$ ./srvrinfo.rb www.gov.cn
[mtj@camus]$ ./srvrinfo.rb www.kantei.go.jp
  The server at www.kantei.go.jp is Apache
[mtj@camus]$ ./srvrinfo.rb www.pmo.gov.to
  The server at www.pmo.gov.to is Apache/2.0.46 (Red Hat Linux)
[mtj@camus]$ ./srvrinfo.rb www.mozambique.mz
  The server at www.mozambique.mz is Apache/1.3.27
(Unix) PHP/3.0.18 PHP/4.2.3
[mtj@camus]$ ./srvrinfo.rb www.cisco.com
  The server at www.cisco.com is Apache/1.0 (Unix)
[mtj@camus]$ ./srvrinfo.rb www.mit.edu
  The server at www.mit.edu is MIT Web Server Apache/1.3.26 Mark/1.5
(Unix) mod_ssl/2.8.9 OpenSSL/0.9.7c
[mtj@camus]$ ./srvrinfo.rb www.stanford.edu
  The server at www.stanford.edu is Apache/2.0.54 (Debian GNU/Linux)
mod_fastcgi/2.4.2 mod_ssl/2.0.54 OpenSSL/0.9.7e WebAuth/3.2.8
[mtj@camus]$ ./srvrinfo.rb www.fsmgov.org
  The server at www.fsmgov.org is Apache/1.3.27 (Unix) PHP/4.3.1
[mtj@camus]$ ./srvrinfo.rb www.csuchico.edu
  The server at www.csuchico.edu is Sun-ONE-Web-Server/6.1
[mtj@camus]$ ./srvrinfo.rb www.sun.com
  The server at www.sun.com is Sun Java System Web Server 6.1
[mtj@camus]$ ./srvrinfo.rb www.microsoft.com
  The server at www.microsoft.com is Microsoft-IIS/6.0
[mtj@camus]$ ./srvrinfo.rb www.apache.org
The server at www.apache.org is Apache/2.2.3 (Unix)
mod_ssl/2.2.3 OpenSSL/0.9.7g

这些数据都非常有用，可以从中看到政府和学术机构都使用了何种 Web 服务器。下一个例子将展示一些更加有用的信息：一个搜集股票价格的 scraper。

收藏分享评分

回复引用

订阅 TOP

返回列表