[MapReduce] wordcount case practice

ZSYL 2021-09-15 08:53:59

1. Local testing

1) demand

In the given text file statistics output the total number of times each word appears

(1) input data

 Insert picture description here

ss ss
cls cls
jiao
banzhang
xue
hadoop

(2) Expected output data

banzhang 1
cls 2
hadoop 1
jiao 1
ss 2
xue 1

2) Demand analysis

according to MapReduce Programming specification , Write separately Mapper,Reducer,Driver.

demand : Count the number of words in a pile of files (WordCount Case study )

 Insert picture description here
3) Environmental preparation

(1) establish maven engineering ,MapReduceDemo

(2) stay pom.xml Add the following dependencies to the file

 <dependencies>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.1.3</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>1.7.30</version>
</dependency>
</dependencies>

(2) In the project src/main/resources Under the table of contents , Create a new file , Name it “log4j.properties”, Fill in the file with :

log4j.rootLogger=INFO, stdout
log4j.appender.stdout=org.apache.log4j.ConsoleAppender
log4j.appender.stdout.layout=org.apache.log4j.PatternLayout
log4j.appender.stdout.layout.ConversionPattern=%d %p [%c] - %m%n
log4j.appender.logfile=org.apache.log4j.FileAppender
log4j.appender.logfile.File=target/spring.log
log4j.appender.logfile.layout=org.apache.log4j.PatternLayout
log4j.appender.logfile.layout.ConversionPattern=%d %p [%c] - %m%n

(3) Create a package name :com.zs.mapreduce.wordcount

4) Programming

(1) To write Mapper class

package com.zs.mapreduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper; // 2.x 3.x,mapred:1.x
import java.io.IOException;
/** * KEYIN,map Phase input key The type of :LongWritable * VALUEIN,map Stage input value type :Text * KEYOUT,map Stage output Key type :Text * VALUEOUT,map Stage output value type :IntWritable */
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

// Define global variables , Save resources 
private Text outK = new Text();
private IntWritable outV = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

// super.map(key, value, context);
// 1. Get a row 
String line = value.toString();
// 2. cutting 
String[] words = line.split(" ");
// 3. Cycle writing 
for (String word : words) {

// encapsulation outK
outK.set(word);
// Write 
context.write(outK, outV);
}
}
}

(2) To write Reducer class

package com.zs.mapreduce.wordcount;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
/** * KEYIN,reduce:LongWritable * VALUEIN,reduce Stage input value type :Text * KEYOUT,reduce Stage output Key type :Text * VALUEOUT,reduce Stage output value type :IntWritable */
public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable outV = new IntWritable();
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {

int sum = 0;
// Add up 
for (IntWritable value : values) {

sum += value.get();
}
outV.set(sum);
// Write 
context.write(key, outV);
}
}

(3) To write Driver Drive class

package com.zs.mapreduce.wordcount;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class WordCountDriver {

public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {

// 1. obtain job
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
// 2. Set up jar Package path 
job.setJarByClass(WordCountDriver.class);
// 3. relation mapper and reducer
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
// 4. Set up map Output kv type 
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
// 5. Set the final output kV type 
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
// 6. Set input path and output path 
FileInputFormat.setInputPaths(job, new Path("D:\\software\\hadoop\\input\\inputword"));
FileOutputFormat.setOutputPath(job, new Path("D:\\software\\hadoop\\output\\output1"));
// 7. Submit job
boolean result = job.waitForCompletion(true);
System.exit(result ? 0 : 1);
}
}

5) Local testing

(1) It needs to be configured first HADOOP_HOME Variables and Windows Operational dependency

(2) stay IDEA/Eclipse Run the program on

 Insert picture description here

2. Submit to cluster test

Test on the cluster

(1) use maven hit jar package , The packaged plug-ins that need to be added depend on

<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>

Be careful : If the project shows a Red Cross . On the project Right click ->maven->Reimport Refresh it .

(2) Type the program as jar package

 Insert picture description here
 Insert picture description here

(3) Modify without dependencies jar Package name for wc.jar, And copy the jar Package to Hadoop Clustered /opt/module/hadoop-3.1.3 route .

Use XShell Transferred to the Linux Can be !

(4) start-up Hadoop colony

[zs@hadoop102 hadoop-3.1.3]sbin/start-dfs.sh
[zs@hadoop103 hadoop-3.1.3]$ sbin/start-yarn.sh

(5) perform WordCount Program

[zs@hadoop102 hadoop-3.1.3]$ hadoop jar wc.jar com.zs.mapreduce.wordcount.WordCountDriver /user/zs/input /user/zs/output

come on. !

thank !

Strive !

Please bring the original link to reprint ,thank
Similar articles

2021-09-15

2021-09-15

2021-09-15