Spark At present, it supports a variety of distributed deployment modes : One 、Standalone Deploy Mode; Two Amazon EC2、; 3、 ... and 、Apache Mesos; Four 、Hadoop YARN. The first way is to deploy alone , You don't need to have a dependent resource manager , The other three need to be spark Deploy to the corresponding resource manager .

In addition to the multiple ways of deployment , Newer versions of Spark Support for multiple hadoop platform , For instance from 0.8.1 Versions are supported separately Hadoop 1 (HDP1, CDH3)、CDH4、Hadoop 2 (HDP2, CDH5). at present Cloudera The company's CDH5 In use CM When installing , Direct choice Spark Service installation .

at present Spark The latest version is 1.0.0.

We'll take 1.0.0 edition , Let's see how to achieve Spark Installation of distributed cluster :

One 、Spark 1.0.0 need JDK1.6 Or later , We use jdk 1.6.0_31;

Two 、Spark 1.0.0 need Scala 2.10 Or later , We use scala 2.10.3;

3、 ... and 、 from Download the appropriate bin Package to install , We choose CDH4 Version of spark-1.0.0-bin-cdh4.tgz; Download to tongjihadoop165 On ;

Four 、 decompression bin package :tar –zxf spark-1.0.0-bin-cdh4.tgz;

5、 ... and 、 rename :mv spark-1.0.0-bin-cdh4 spark-1.0.0-cdh4;

6、 ... and 、cd spark-1.0.0-cdh4 ;

mv ./conf/ ./conf/

7、 ... and 、vi ./conf/ Add the following :

export SCALA_HOME=/usr/lib/scala-2.10.3

export JAVA_HOME=/usr/java/jdk1.6.0_31







SPARK_MASTER_IP This means master Of IP Address ;SPARK_MASTER_PORT This is master port ;SPARK_MASTER_WEBUI_PORT This is to check the operation of the cluster WEB UI Port number ;SPARK_WORKER_PORT This is all about worker The port of Number ;SPARK_WORKER_MEMORY This configuration is for each worker Running memory .

8、 ... and 、vi ./conf/ slaves  Each row of a worker The host name , The contents are as follows :

Nine 、( Optional ) Set up  SPARK_HOME  environment variable , And will  SPARK_HOME/bin  Join in  PATH:

vi /etc/profile , Add the following :

export SPARK_HOME=/usr/lib/spark-1.0.0-cdh4


Ten 、 take tongjihadoop165 Upper spark Copied to the tongjihadoop166 and tongjihadoop167 On :

sudo scp -r hadoop@  /usr/lib

install scala You can also copy files remotely and modify environment variable files in this way /etc/profile, Don't forget after the change source.

11、 ... and 、 perform    ./sbin/     start-up spark colony ;

If start-all Mode cannot start the related process normally , Can be in $SPARK_HOME/logs Directory to view the relevant error information . Actually , You can look like Hadoop Start the related process separately , stay master Run the following command on the node :

stay Master On the implementation :./sbin/

stay Worker On the implementation :./sbin/ 3 spark:// --webui-port 8090

Twelve 、 Check if the process starts , perform jps command , You can see Worker Process or Master process . Then you can go to WEB UI Check out http://tongjihadoop165:8090/ You can see everything work  node , And their  CPU  Number and memory information .

13、 ... and 、Local mode demo

such as :./bin/run-example SparkLR 2 local   perhaps   ./bin/run-example SparkPi 2 local

These are two examples of the former calculating linear regression , Iterative calculation ; The latter is to calculate the PI

fourteen 、 Start interactive mode :./bin/spark-shell --master spark:// , If in conf/ Configuration of the MASTER( add a sentence export MASTER=spark://${SPARK_MASTER_IP}:${SPARK_MASTER_PORT}), You can use it directly   ./bin/spark-shell Launched the .

spark-shell As an application , Is to submit the assignment to spark colony , then spark Clusters are assigned to specific worker To deal with it ,worker Local files are read when processing jobs .

This shell It's modified scala shell, Open one like this shell Will be in WEB UI You can see a running Application, Here's the picture :

At the bottom is the run complete Applications,workers A list is a list of nodes in a cluster .

We can open it here shell Next pair HDFS Do some calculations with the data on the , stay shell Enter... In turn :

A、val file = sc.textFile("hdfs://")  # This is loading HDFS Documents in

B、         # This is to calculate the number of characters in the file

Operation of the , Here's the picture :

It turns out that there are 346658513 Characters . It's very fast. It takes less than 3s.

perhaps B Stage execution val count = file.flatMap(line => line.split("\t")).map(word => (word, 1)).reduceByKey(_+_) and count.saveAsTextFile("hdfs://") Store the calculation results in HDFS Upper /spark Under the table of contents .

It can also be executed ./bin/spark-shell --master local[2] , Start a local shell ,[2] You can specify the number of threads , The default is 1.

perform exit You can quit shell.

15、 ... and 、 perform    ./sbin/   stop it spark colony

It can also be done through a separate process stop Script termination

Be careful : Three machines spark The directory must be the same , because master Will log in to worker Carry out the order ,master Think worker Of spark The path is the same as yourself .

