httpClient采集到的数据乱码问题完整解决（2）

论坛元老

Rank: 8 Rank: 8

UID: 1066743

1^#

打印

字体大小: tT

look_w发表于 2019-4-20 11:03 | 只看该作者

httpClient采集到的数据乱码问题完整解决（2）

如果以上几种方式都没解决，那就可能是以下的两种特殊情况：

情况一: 编码被压缩了

在用httpclient做网页提取的过程中，通过抓包工具发现了头部中会有 Accept-Encoding: gzip, deflate字段

如果头部有了该字段，则服务器会将内容reponse的内容进行压缩用gzip或者deflate算法，然后reponse给用户。目前我看到的仅仅有gzip算法被用到，然后返回给用户的数据也是压缩后的数据，这样往往可以减轻服务器的负担，同时也减少了网络传输

如果有了该字段，你又不处理，那么就会遇到乱码现象(这是肯定的，因为只是压缩过的数据)。下边我会利用httpclient工具对加入了Accept-Encoding: gzip, deflate 的内容进行处理，使得内容可以正常处理。

增加代码如下:

      if (httpResponse.getStatusLine().getStatusCode() == 200) {
                     HttpEntity httpEntity = httpResponse.getEntity();
                     if(httpEntity.getContentEncoding()!=null){
                     if("gzip".equalsIgnoreCase(httpEntity.getContentEncoding().getValue())){
                        httpEntity = new GzipDecompressingEntity(httpEntity);
                     } else if("deflate".equalsIgnoreCase(httpEntity.getContentEncoding().getValue())){
                        httpEntity = new DeflateDecompressingEntity(httpEntity);
                     }}
                     result = EntityUtils.toString(httpEntity, encode);// 取出应答字符串
//                   System.out.println(result);}

第二种情况: 有时候我们使用System.out.println(EntityUtils.toString(entity,"GBK"))能解析大部分代码，但是还有小部分出现乱码

这种情况是因为我们设置的编码没起作用

我们看httpcore-4.2.4.jar版本的EntityUtils源码

如下:

   public static String toString(
            final HttpEntity entity, final String defaultCharset) throws IOException, ParseException {
         return toString(entity, defaultCharset != null ? Charset.forName(defaultCharset) : null);
      }

public static String toString(
            final HttpEntity entity, final Charset defaultCharset) throws IOException, ParseException {
         if (entity == null) {
            throw new IllegalArgumentException("HTTP entity may not be null");
         }
         InputStream instream = entity.getContent();
         if (instream == null) {
            return null;
         }
         try {
            if (entity.getContentLength() > Integer.MAX_VALUE) {
                  throw new IllegalArgumentException("HTTP entity too large to be buffered in memory");
            }
            int i = (int)entity.getContentLength();
            if (i < 0) {
                  i = 4096;
            }
            Charset charset = null;
            try {
                  ContentType contentType = ContentType.get(entity);
                  if (contentType != null) {
                     charset = contentType.getCharset();
                  }
            } catch (final UnsupportedCharsetException ex) {
                  throw new UnsupportedEncodingException(ex.getMessage());
            }
            if (charset == null) {
                  charset = defaultCharset;
            }
            if (charset == null) {
                  charset = HTTP.DEF_CONTENT_CHARSET;
            }
            Reader reader = new InputStreamReader(instream, charset);
            CharArrayBuffer buffer = new CharArrayBuffer(i);
            char[] tmp = new char[1024];
            int l;
            while((l = reader.read(tmp)) != -1) {
                  buffer.append(tmp, 0, l);
            }
            return buffer.toString();
         } finally {
            instream.close();
         }
      }

收藏分享评分

回复引用

订阅 TOP

返回列表