五种基于 MapReduce 的并行计算框架介绍及性能测试（3）WordCount 实验基本原理

论坛元老

Rank: 8 Rank: 8

UID: 1066743

1^#

打印

字体大小: tT

look_w发表于 2018-7-7 13:43 | 只看该作者

五种基于 MapReduce 的并行计算框架介绍及性能测试（3）WordCount 实验基本原理

WordCount 实验

基本原理

单词计数 (WordCount) 是最简单也是最能体现 MapReduce 思想的程序之一，可以称为 MapReduce 版"Hello World"。单词计数主要完成功能是：统计一系列文本文件中每个单词出现的次数。

本次实验步骤

本次实验的硬件资源基于 x86 服务器 1 台，配置为内存 32GB DDR3、E5 CPU/12 核、GPU，实验数据样本为 10M/50M/100M/500M/1000M 的文本文件五个，我们使用 Hadoop MapReduce、Spark、Phoenix、Disco、Mars 等 MapReduce 框架分别运行文本分析程序，基于结果一致的前提下统计出运行时间、运行时 CPU 占有率、运行时内存占有率等数据，并采用这些数据绘制成柱状图。
Hadoop MapReduce首先需要将文件拆分成 splits，由于测试用的文件较小，所以每个文件为一个 split，并将文件按行分割形成<key,value>对，图 12 分割过程图所示。这一步由 MapReduce 框架自动完成，其中偏移量（即 key 值）包括了回车所占的字符数（Windows 和 Linux 环境会不同）。
图 5 . 分割过程图

将分割好的<key,value>对交给用户定义的 map 方法进行处理，生成新的<key,value>对，图 6 执行 map 方法所示。
图 6 . 执行 Map 方法过程图

得到 map 方法输出的<key,value>对后，Mapper 会将它们按照 key 值进行排序，并执行 Combine 过程，将 key 相同的 value 值累加，得到 Mapper 的最终输出结果。图 7Map 端排序及 Combine 过程所示。
图 7 . Map 端排序及 Combine 过程

Reducer 先对从 Mapper 接收的数据进行排序，再交由用户自定义的 reduce 方法进行处理，得到新的<key,value>对，并作为 WordCount 的输出结果，图 15Reduce 端排序及输出结果所示。
图 8 . Reduce 端排序及输出结果流程图

清单 1 . 第一代 Hadoop MapReduce WordCount 示例代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
// 开始 Map 过程
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
//遍历 Map 里面的字符串
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
//开始 Reduce 过程
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 2) {
System.err.println("Usage: wordcount <in> <out>");
System.exit(2);
}
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

收藏分享评分

回复引用

订阅 TOP

返回列表