java下载html页面---把网页内容保存成本地html（1）

论坛元老

Rank: 8 Rank: 8

UID: 1066743

1^#

打印

字体大小: tT

look_w发表于 2019-4-20 14:40 | 只看该作者

java下载html页面---把网页内容保存成本地html（1）

我们在前面讲到httpclient抓取网页内容的时候通常都是获取到页面的源代码content存入数据库。

详见下文:

HTTPClient模块的HttpGet和HttpPost

httpclient常用基本抓取类

那么如果我们除了获得页面源代码之外还想把页面保存到本地存成html应该怎么做呢？

其实很简单我们先来看访问页面获取content的代码

      private static String getUrlContent(DefaultHttpClient httpPostClient,
            String urlString) throws IOException, ClientProtocolException {
         HttpGet httpGet = new HttpGet(urlString);
         HttpResponse httpGetResponse = httpPostClient.execute(httpGet);// 其中HttpGet是HttpUriRequst的子类
         httpPostClient.getParams().setParameter(
                  CoreConnectionPNames.CONNECTION_TIMEOUT, 10000);// 连接时间20s
         httpPostClient.getParams().setParameter(
                  CoreConnectionPNames.SO_TIMEOUT, 8000);// 数据传输时间60s
         if (httpGetResponse.getStatusLine().getStatusCode() == 200) {
            HttpEntity httpEntity = httpGetResponse.getEntity();
            if (httpEntity.getContentEncoding() != null) {
                  if ("gzip".equalsIgnoreCase(httpEntity.getContentEncoding()
                        .getValue())) {
                     httpEntity = new GzipDecompressingEntity(httpEntity);
                  } else if ("deflate".equalsIgnoreCase(httpEntity
                        .getContentEncoding().getValue())) {
                     httpEntity = new DeflateDecompressingEntity(httpEntity);
                  }
            }
            String result = enCodetoString(httpEntity, encode);// 取出应答字符串
            // System.out.println(result);
            return result;
         }
         return "";
      }

      public static String enCodetoStringDo(final HttpEntity entity,
            Charset defaultCharset) throws IOException, ParseException {
         if (entity == null) {
            throw new IllegalArgumentException("HTTP entity may not be null");
         }
         InputStream instream = entity.getContent();
         if (instream == null) {
            return null;
         }
         try {
            if (entity.getContentLength() > Integer.MAX_VALUE) {
                  throw new IllegalArgumentException(
                        "HTTP entity too large to be buffered in memory");
            }
            int i = (int) entity.getContentLength();
            if (i < 0) {
                  i = 4096;
            }
            Charset charset = null;
            try {
                  // ContentType contentType = ContentType.get(entity);
                  // if (contentType != null) {
                  // charset = contentType.getCharset();
                  // }
            } catch (final UnsupportedCharsetException ex) {
                  throw new UnsupportedEncodingException(ex.getMessage());
            }
            if (charset == null) {
                  charset = defaultCharset;
            }
            if (charset == null) {
                  charset = HTTP.DEF_CONTENT_CHARSET;
            }
            Reader reader = new InputStreamReader(instream, charset);
            CharArrayBuffer buffer = new CharArrayBuffer(i);
            char[] tmp = new char[1024];
            int l;
            while ((l = reader.read(tmp)) != -1) {
                  buffer.append(tmp, 0, l);
            }
            return buffer.toString();
         } finally {
            instream.close();
         }
      }

我们得到content之后就可以直接把它存成本地文件就可以了。

收藏分享评分

回复引用

订阅 TOP

返回列表