Hadoop 和 Spark 集群安装

一、 Hadoop分布式集群搭建

1 集群部署准备

节点名称 hostName IP地址
Master spark 192.168. 59.137
Slave1 sparkslave 192.168. 59.138

采用两台CentOS 虚拟器,详细信息如下:

[root@spark ~]# uname -a
Linux spark 3.10.0-693.el7.x86_64 #1 SMP Tue Aug 22 21:09:27 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
[root@spark ~]# cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core)

2 修改主机名

2.1 以root身份登录Master节点,修改/etc/hostname文件

HOSTNAME=spark

CentOS 7以下版本可能需要修改/etc/sysconfig/network

2.2 修改/etc/hosts文件:

192.168.59.138 sparkslave

2.3 重启系统:reboot,重启后验证hostName显示为spark,即为成功

    hostname

2.4 对Slave做同样配置

3 SSH免密码登陆

3.1 所有节点创建用户hadoop

useradd -u XXXX -g hadoop -d /home/hadoop -c "Hadoop User." -m -s /bin/bash hadoop  
passwd hadoop   

3.2 Master节点免密登录

在Master节点进行如下操作。
3.2.1 首先保证系统中已经安装了ssh服务,然后用hadoop身份登录

su - hadoop

3.2.2 然后,以hadoop的身份生成公钥和私钥,一路回车即可。

ssh-keygen -t rsa
Generating public/private rsa key pair.
Enter file in which to save the key (/home/hadoop/.ssh/id_rsa):
Created directory '/home/hadoop/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/hadoop/.ssh/id_rsa.
Your public key has been saved in /home/hadoop/.ssh/id_rsa.pub.
The key fingerprint is:
SHA256:KlyIXT+aFZrvqtl6Mt5UdwnflSd1pAkJYlz+YRAxYVw hadoop@spark
The key's randomart image is:
+---[RSA 2048]----+
|       .ooX*E. .+|
|       ..+.o.. +o|
|      . . o o +.o|
|   o o + . = + o.|
|  . o + S . = .  |
|   . . B o .     |
|    o = .        |
|    o*..         |
|   .=*+..        |
+----[SHA256]-----+

完成后在/home/hadoop/.ssh/目录下就能看到生成了公钥和私钥:id_rsa,id_rsa.pub

[hadoop@spark ~]$ cd .ssh
[hadoop@spark .ssh]$ ls -ltr
total 8
-rw-r--r--. 1 hadoop hadoop  394 Mar 15 13:02 id_rsa.pub
-rw-------. 1 hadoop hadoop 1679 Mar 15 13:02 id_rsa

3.2.3 接着,将公钥写入authorized_keys文件:

[hadoop@spark .ssh]$ ssh-copy-id hadoop@sparkslave
/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/home/hadoop/.ssh/id_rsa.pub"
The authenticity of host 'sparkslave (192.168.59.138)' can't be established.
ECDSA key fingerprint is SHA256:3yVmsQP6CIq8vj0Pd3lOW/q98EqFlF2g1YxyjFZD6Dk.
ECDSA key fingerprint is MD5:90:00:e5:c6:c6:3f:1a:73:67:2b:72:b8:1b:f4:5c:33.
Are you sure you want to continue connecting (yes/no)? yes
/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
hadoop@sparkslave's password:

Number of key(s) added: 1

Now try logging into the machine, with:   "ssh 'hadoop@sparkslave'"
and check to make sure that only the key(s) you wanted were added.

检查sparkslave:

[hadoop@sparkslave .ssh]$ ls -ltr
total 16
-rw-r--r--. 1 hadoop hadoop  399 Mar 15 13:02 id_rsa.pub
-rw-------. 1 hadoop hadoop 1675 Mar 15 13:02 id_rsa
-rw-------. 1 hadoop hadoop  394 Mar 15 13:04 authorized_keys
-rw-r--r--. 1 hadoop hadoop  182 Mar 15 13:12 known_hosts

把自己的密钥也加入到authorized_keys末尾,(为了使用hadoop用户启动应用时不需要密码):

 cat id_rsa.pub >> authorized_keys

3.2.4 测试是否需要密码,第一次需要输入yes,之后就不用了,能免密登录,则证明成功。
ssh hadoop@sparkslave

3.3 生成并上传公钥

为了能够让Master免密码登陆各个Slave节点,且Slave节点也能用hadoop用户免密登录到Master节点,在各个Slave节点上进行如下操作

3.3.1 首先保证系统中已经安装了ssh服务,然后用hadoop身份登录

su - hadoop

3.3.2 然后,以hadoop身份生成公钥和私钥,期间有提示,一路回车即可。

ssh-keygen -t rsa

3.3.3 然后,将公钥上传到Master节点。期间会提示输入Master节点上hadoop用户的密码,正确输入开始上传

ssh-copy-id hadoop@spark

把自己的密钥也加入到authorized_keys末尾(为了使用hadoop用户启动应用时不需要密码):

 cat id_rsa.pub >> authorized_keys

3.3.4 测试
传完后,在Master节点测试是否需要密码。
ssh hadoop@spark

若可以免密码登陆,则证明ssh免密登录配置成功。

4 安装JDK

4.1 卸载自带的open jdk

首先需要卸载系统自带的open jdk。

[root@spark ~]# rpm -qa | grep java
tzdata-java-2017b-1.el7.noarch
python-javapackages-3.4.1-11.el7.noarch
java-1.8.0-openjdk-1.8.0.131-11.b12.el7.x86_64
java-1.8.0-openjdk-headless-1.8.0.131-11.b12.el7.x86_64
javapackages-tools-3.4.1-11.el7.noarch

将上述出现的每个已经安装的软件卸载,执行命令:

rpm -e –nodes java-1.8.0-openjdk-1.8.0.131-11.b12.el7.x86_64
rpm -e --nodeps java-1.8.0-openjdk-headless-1.8.0.131-11.b12.el7.x86_64

注意:要想卸载全部的jdk,每条卸载命令都要执行

4.2 安装sun jdk

mkdir –p /usr/local/java

去官网下载JDK.

tar -zxvf jdk-8u162-linux-x64.tar.gz

然后编辑系统配置文件:

vi  /etc/profile
#JAVA ENV
export JAVA_HOME=/usr/local/java/jdk1.8.0_162
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$PATH:$JAVA_HOME/bin

保存退出之后,执行命令,使配置文件立即生效,否则需要重启。

source /etc/profile
最后,我们可以在终端执行命令,java –version,出现如下提示,即证明成功:

[root@spark jdk1.8.0_162]# java -version
java version "1.8.0_162"
Java(TM) SE Runtime Environment (build 1.8.0_162-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.162-b12, mixed mode)

5. 安装配置Hadoop

5.1 确保目录权限

su - root
mkdir -p /usr/local/hadoop
mkdir -p /var/local/hadoop
chmod -R 777 /usr/local/hadoop   #设置权限
chmod -R 777 /var/local/hadoop   #设置权限
chown -R hadoop:hadoop /usr/local/hadoop #设置所属
chown -R hadoop:hadoop /var/local/hadoop #设置所属
su - hadoop
mkdir -p /var/local/hadoop/tmp
mkdir -p /var/local/hadoop/dfs/name
mkdir -p /var/local/hadoop/dfs/data

5.2 下载Hadoop

从hadoop的官网(http://www.apache.org/dyn/closer.cgi/hadoop/common)选择合适的版本,在此我选择了较新版本2.7.5(为了配合Spark版本)

5.3 解压

su - haddop
tar -xzvf hadoop-2.7.5.tar.gz
mv hadoop-2.7.5 /usr/local/hadoop/

5.4 更改环境变量,/etc/profile文件中添加Hadoop的环境变量,如下:

#Hadoop环境变量
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

改完后执行source /etc/profile命令立即生效。

5.5 修改配置文件(在hadoop-2.7.5/etc/hadoop下面)

(1)配置core-site.xml

<configuration>
   <property>
        <name>hadoop.tmp.dir</name>
        <value>/var/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
   </property>
   <property>
        <name>fs.defaultFS</name>
        <value>hdfs://spark:9000</value>
   </property>
   <property>
        <name>io.file.buffer.size</name>
        <value>131072</value>
  </property>
</configuration>

(2)配置 hdfs-site.xml

<configuration>
<property>
  <name>dfs.http.address</name>
  <value>spark:50070</value>
  <description>The address and the base port where the dfs namenode web ui will listen on.If the port is 0 then the server will start on a free port.
  </description>
</property>
<property>
  <name>dfs.namenode.secondary.http-address</name>
  <value>sparkslave:50090</value>                                 
 </property>
  <property>
   <name>dfs.namenode.name.dir</name>
   <value>/var/local/hadoop/dfs/name</value>
 </property>

 <property>
  <name>dfs.datanode.data.dir</name>
  <value>/var/local/hadoop/dfs/data</value>
  </property>

 <property>
  <name>dfs.replication</name>
  <value>2</value>
 </property>

 <property>
  <name>dfs.webhdfs.enabled</name>
  <value>true</value>
 </property>
</configuration>

注意: 按官方教程只需要配置 fs.defaultFS 和 dfs.replication 就可以运行,但是如果没有配置 hadoop.tmp.dir 参数,则默认使用的临时目录为 /tmp/hadoo-hadoop,而这个目录在重启时可能被系统清理掉,导致必须重新执行 format 。同时也指定 dfs.namenode.name.dir 和 dfs.datanode.data.dir,否则在接下来的步骤中可能会出错。

(3)配置yarn-site.xml

<configuration>
<!-- Site specific YARN configuration properties -->
<property>  
    <name>yarn.resourcemanager.hostname</name>  
    <value>spark</value>  
</property>  

<property>  
    <name>yarn.nodemanager.aux-services</name>  
    <value>mapreduce_shuffle</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.address</name>  
    <value>spark:8032</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.scheduler.address</name>  
    <value>spark:8030</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.resource-tracker.address</name>  
    <value>spark:8031</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.admin.address</name>  
    <value>spark:8033</value>  
</property>  
<property>  
    <name>yarn.resourcemanager.webapp.address</name>  
    <value>spark:8088</value>  
</property> 

</configuration>

(4)配置mapred-site.xml
复制mapred-site.xml.template并重命名为mapred-site.xml,将文件修改为

<configuration>
<property>
   <name>mapreduce.framework.name</name>
   <value>yarn</value>
   <final>true</final> 
 </property>
 <property>
   <name>mapreduce.jobtracker.http.address</name>
   <value>spark:50030</value>
 </property>
 <property>
  <name>mapreduce.jobhistory.address</name>
  <value>spark:10020</value>
 </property>
 <property>
  <name>mapreduce.jobhistory.webapp.address</name>
  <value>spark:19888</value>
 </property>
 <property>
  <name>mapred.job.tracker</name>
  <value>http://spark:9001</value>
 </property>
</configuration>

(5)slaves

[hadoop@sparkslave hadoop]$ cat slaves
spark
sparkslave

(6)配置hadoop-env.sh,修改:

export JAVA_HOME=/usr/local/java/jdk1.8.0_162     #这里要使用绝对路径

(7) 配置Slave
将Master上的hadoop-2.7.5整个目录拷贝到其他Slave节点下面
注意修改目录权限

su - root
mkdir -p /usr/local/hadoop
mkdir -p /var/local/hadoop
chmod -R 777 /usr/local/hadoop   #设置权限
chmod -R 777 /var/local/hadoop   #设置权限
chown -R hadoop:root /usr/local/hadoop #设置所属
chown -R hadoop:root /var/local/hadoop #设置所属
su - hadoop
mkdir -p /var/local/hadoop/tmp
mkdir -p /var/local/hadoop/dfs/name
mkdir -p /var/local/hadoop/dfs/data

仿造3.4. 修改profile.
Copy hadoop:
scp -r /usr/local/hadoop/hadoop-2.7.5 hadoop@sparkslave:/usr/local/hadoop/

(8)执行namenode 的 format操作

hdfs namenode -format

出现Exitting with status 0 表示成功,若为 Exitting with status 1 则是出错。

18/03/26 15:40:37 INFO util.ExitUtil: Exiting with status 0
18/03/26 15:40:37 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at spark/192.168.59.187
************************************************************/

(9)启动hadoop集群

[hadoop@spark ~]$ start-dfs.sh
Starting namenodes on [spark]
spark: starting namenode, logging to /usr/local/hadoop/hadoop-2.7.5/logs/hadoop-hadoop-namenode-spark.out
sparkslave: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.5/logs/hadoop-hadoop-datanode-sparkslave.out
spark: starting datanode, logging to /usr/local/hadoop/hadoop-2.7.5/logs/hadoop-hadoop-datanode-spark.out
Starting secondary namenodes [sparkslave]
sparkslave: starting secondarynamenode, logging to /usr/local/hadoop/hadoop-2.7.5/logs/hadoop-hadoop-secondarynamenode-sparkslave.out
[hadoop@spark ~]$ start-yarn.sh
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/hadoop-2.7.5/logs/yarn-hadoop-resourcemanager-spark.out
sparkslave: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.5/logs/yarn-hadoop-nodemanager-sparkslave.out
spark: starting nodemanager, logging to /usr/local/hadoop/hadoop-2.7.5/logs/yarn-hadoop-nodemanager-spark.out

启动完成,可以用jps命令查看,是否启动成功。
Mster节点:

[hadoop@spark ~]$ jps
15011 NodeManager
14468 NameNode
15224 Jps
14894 ResourceManager

Slave1节点:

[hadoop@sparkslave hadoop]$ jps
10705 NodeManager
10867 Jps
10565 SecondaryNameNode
10478 DataNode

java程序启动以后,会在/tmp目录下生成一个hsperfdata_username的文件夹,这个文件夹的文件,就是以java进程的pid命名。因此使用jps查看当前进程的时候,其实就是把/tmp/hsperfdata_username中的文件名遍历一遍之后输出。如果/tmp/hsperfdata_username的文件所有者和文件所属用户组与启动进程的用户不一致的话,在进程启动之后,就没有权限写/tmp/hsperfdata_username,所以/tmp/hsperfdata_username是一个空文件,理所当然jps也就没有任何显示。

[root@spark ~]# chown -R hadoop /tmp/hsperfdata_hadoop
[root@spark ~]# chgrp -R hadoop /tmp/hsperfdata_hadoop

(10)向hdfs写入,进行测试

[hadoop@spark ~]$ hadoop fs -mkdir /test
[hadoop@spark ~]$ hadoop fs -ls /
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2018-03-15 13:31 /test

查看namenode界面:http://spark:50070/
resourcemanager界面:http://spark:8080/
nodemanager界面:http://spark:8042/

image.png

如果不能访问,有可能是防火墙的问题。

[root@spark ~]# systemctl stop firewalld.service
[root@spark ~]# firewall-cmd --state
not running
[root@spark ~]# systemctl disable firewalld.service
Removed symlink /etc/systemd/system/multi-user.target.wants/firewalld.service.
Removed symlink /etc/systemd/system/dbus-org.fedoraproject.FirewallD1.service.

二、 Spark 安装

1 安装Scala

(1)从官网(http://www.scala-lang.org/download/all.html)下载Scala,选择的版本是Scala 2.12.4,解压后放到/usr/local/hadoop目录下

(2) 修改/etc/profile:

#JAVA ENV
export JAVA_HOME=/usr/local/java/jdk1.8.0_162
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5
#Scala ENV
export SCALA_HOME=/usr/local/hadoop/scala-2.12.4
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$PATH

(3) 在终端执行命令

[hadoop@spark hadoop]$ source /etc/profile
[hadoop@spark hadoop]$ scala
Welcome to Scala 2.12.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162).
Type in expressions for evaluation. Or try :help.
scala>

2 下载安装Spark

(1) 获取安装包
从官网http://spark.apache.org/downloads.html下载压缩包,由于我的Hadoop版本是2.7.5,对应下载Pre-built for Hadoop 2.7 and later版本的spark-2.3.0的tgz包,下载后解压,到/usr/local/hadoop目录下。

2.1 配置Spark

(1) spark-env.sh配置

[hadoop@spark conf]$ pwd
/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/conf
[hadoop@spark conf]$ cp spark-env.sh.template spark-env.sh
[hadoop@spark conf]$ vi spark-env.sh
[hadoop@spark conf]$ cat spark-env.sh
#JAVA_HOME
export JAVA_HOME=/usr/local/java/jdk1.8.0_162
#Hadoop_HOME
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5
#Scala_HOME
export SCALA_HOME=/usr/local/hadoop/scala-2.12.4
#Spark_HOME
export SPARK_HOME=/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7
export SPARK_MASTER_IP=spark
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_CORES=2
[hadoop@spark ~]$ echo " export HADOOP_CONF_DIR=/usr/local/hadoop/hadoop-2.7.5/etc/hadoop" >> /usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/conf/spark-env.sh

(2) slaves配置

[hadoop@spark conf]$ cp slaves.template slaves
[hadoop@spark conf]$ vi slaves
[hadoop@spark conf]$ cat slaves
# A Spark Worker will be started on each of the machines listed below.
#
spark
sparkslave

(3) profile配置

#JAVA ENV
export JAVA_HOME=/usr/local/java/jdk1.8.0_162
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export HADOOP_HOME=/usr/local/hadoop/hadoop-2.7.5
#Scala ENV
export SCALA_HOME=/usr/local/hadoop/scala-2.12.4
#Spark ENV
export SPARK_HOME=/usr/local/hadoop/spark-2.3.0-bin-hadoop2.7
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$SCALA_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

Spark的 standalone运行模式
启动Spark的master和worker服务,这是Spark的 standalone运行模式

(4) 复制到其他节点
在Master节点上安装配置完成Spark后,将整个spark目录拷贝到其他节点,并在各个节点上更改/etc/profile文件中的环境变量

(5) 测试
在Master节点启动集群

[hadoop@spark ~]$ /usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to /usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-spark.out
sparkslave: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-sparkslave.out
spark: starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/hadoop/spark-2.3.0-bin-hadoop2.7/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-spark.out
[hadoop@spark ~]$ jps
15011 NodeManager
14468 NameNode
15957 Jps
15737 Master
15837 Worker
14894 ResourceManager

Slave:

[hadoop@sparkslave hadoop]$ jps
10705 NodeManager
11473 Worker
10565 SecondaryNameNode
10478 DataNode
11599 Jps

打开浏览器输入Master:8080,看到如下活动的Workers,证明安装配置并启动成功:

image.png

Spark on YARN运行模式
只需要在Hadoop分布式集群中任选一个节点安装配置Spark即可,不要集群安装。因为Spark应用程序提交到YARN后,YARN会负责集群资源的调度。
我们保留Master节点上的Spark,修改去除Slave上的安装目录:

[hadoop@sparkslave hadoop]$ cd /usr/local/hadoop/
[hadoop@sparkslave hadoop]$ ls
hadoop-2.7.5  scala-2.12.4  spark-2.3.0-bin-hadoop2.7
[hadoop@sparkslave hadoop]$
[hadoop@sparkslave hadoop]$
[hadoop@sparkslave hadoop]$
[hadoop@sparkslave hadoop]$ mv spark-2.3.0-bin-hadoop2.7 spark-2.3.0-bin-hadoop2.7-bak
[hadoop@sparkslave hadoop]$ ls
hadoop-2.7.5  scala-2.12.4  spark-2.3.0-bin-hadoop2.7-bak
[hadoop@sparkslave hadoop]$ pwd
/usr/local/hadoop

(6) spark-shell运行在YARN上

(1)运行在yarn-client上
执行命令spark-shell --master yarn --deploy-mode client:

[hadoop@spark ~]$ spark-shell --master yarn --deploy-mode client
2018-03-26 16:30:49 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-03-26 16:31:28 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
Spark context Web UI available at http://spark:4040
Spark context available as 'sc' (master = yarn, app id = application_1522051219440_0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

如果出现以下输出:

[hadoop@spark ~]$ spark-shell --master yarn --deploy-mode client
2018-03-26 16:30:49 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2018-03-26 16:31:28 WARN  Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2018-03-26 16:36:28  ERROR spark.SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
    at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
    at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
    at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
    at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)
    at $line3.$read$$iw$$iw.<init>(<console>:15)
    at $line3.$read$$iw.<init>(<console>:42)
    at $line3.$read.<init>(<console>:44)
    at $line3.$read$.<init>(<console>:48)
    at $line3.$read$.<clinit>(<console>)
    at $line3.$eval$.$print$lzycompute(<console>:7)
    at $line3.$eval$.$print(<console>:6)
    at $line3.$eval.$print(<console>)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:786)
    at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1047)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:638)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:637)
    at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
    at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
    at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:637)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:569)
    at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:565)
    at scala.tools.nsc.interpreter.ILoop.interpretStartingWith(ILoop.scala:807)
    at scala.tools.nsc.interpreter.ILoop.command(ILoop.scala:681)
    at scala.tools.nsc.interpreter.ILoop.processLine(ILoop.scala:395)
    at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply$mcV$sp(SparkILoop.scala:38)
    at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
    at org.apache.spark.repl.SparkILoop$$anonfun$initializeSpark$1.apply(SparkILoop.scala:37)
    at scala.tools.nsc.interpreter.IMain.beQuietDuring(IMain.scala:214)
    at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:37)
    at org.apache.spark.repl.SparkILoop.loadFiles(SparkILoop.scala:98)
    at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply$mcZ$sp(ILoop.scala:920)
    at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
    at scala.tools.nsc.interpreter.ILoop$$anonfun$process$1.apply(ILoop.scala:909)
    at scala.reflect.internal.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:97)
    at scala.tools.nsc.interpreter.ILoop.process(ILoop.scala:909)
    at org.apache.spark.repl.Main$.doMain(Main.scala:70)
    at org.apache.spark.repl.Main$.main(Main.scala:53)
    at org.apache.spark.repl.Main.main(Main.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:755)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
2018-03-26 16:39:08  WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
2018-03-26 16:39:09 WARN metrics.MetricsSystem: Stopping a MetricsSystem that is not running
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
  at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)
  at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)
  at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
  at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:909)
  at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
  at org.apache.spark.repl.Main$.createSparkSession(Main.scala:97)
  ... 47 elided
<console>:14: error: not found: value spark
       import spark.implicits._
              ^
<console>:14: error: not found: value spark
       import spark.sql
              ^
Spark context available as 'sc' (master = yarn, app id = application_1522051219440_0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

由于是在虚拟机上运行,虚拟内存可能超过了设定的数值。解决办法:
先停止YARN服务,然后修改yarn-site.xml,增加如下内容:

<property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
        <description>Whether virtual memory limits will be enforced for containers</description>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>4</value>
        <description>Ratio between virtual memory to physical memory when setting memory limits for containers</description>
    </property>

(2)YARN WEB
打开YARN WEB页面:192.168.59.187:8088
可以看到Spark shell应用程序正在运行,单击ID号链接,可以看到该应用程序的详细信息。


image.png
image.png
scala> val rdd=sc.parallelize(1 to 100,5)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24

scala> rdd.count
res0: Long = 100

scala>
image.png
image.png

推荐阅读更多精彩内容