Spark官方编程指南—の—详解加实践

介绍

本篇文章主要摘自Spark官网的Spark Programming Guide,在之前的一篇文章中已经有对这里面一些概念的基本介绍,这里就不再赘述了。(参见Spark常用概念
本篇文章的主要思想是根据代码解读JavaRDD和JavaPairRDD的常用API。
下面开始吧。。。

连接Spark

使用Maven或者SBT来创建本地Java/Scala应用的工程。
下面展示下如何在Windows环境中单机编译并运行Spark的Java代码(Scala的代码类似)

使用IDEA

创建一个新的maven工程,其中pom.xml的内容参见下面:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.paulHome.app</groupId>
    <artifactId>learnSparkJavaApi</artifactId>
    <version>1.0</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <dependencies>
        <dependency> <!-- Spark dependency -->
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql_2.10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.10</artifactId>
            <version>2.1.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming_2.10 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.10</artifactId>
            <version>2.1.0</version>
        </dependency>

        <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-mllib_2.11 -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>2.1.0</version>
        </dependency>
        <dependency>
            <groupId>com.github.fommil.netlib</groupId>
            <artifactId>all</artifactId>
            <version>1.1.2</version>
            <type>pom</type>
        </dependency>

    </dependencies>

</project>

这个文件我添加的比较全了,包括SQL、Streanming、MLlib都加进去了。当然还有一些其他的关于Maven的配置参见另一篇文章搭建虚拟机Spark环境另外多说一句,记得在IDEA的Maven的配置中选上自动下载源码文件,这样方便后面阅读学习。本机调试Spark程序的最大好处就是可以断点debug,可以很好的来阅读源码理解源码。

然后根据你喜好,创建好自己的工程文件。我自己的情况见下图所示(另外多说一句,安装JDK的时候千万别放在默认的带空格的目录Program Files下面,这就是个坑,如果你还需要用到HDFS,也就是再安装Haoop的时候就会踩到。不过我现在就没改,因为不确定是否要在家里用到Hadoop,不过后面用到的话我肯定会改的):

学习Spark的工程目录结构

接下来就是配置Run了,主要是VM -option写上:-Dspark.master=local[4](估计是这篇文章的第一个大重点了吧)

Run configuration

为了方便大家学习,我把这个Java源码也放上去吧,是之前边学边随手写的,主要是学习官网的每一条示例语句,所以代码没啥主题。都是第一次写Spark应用时写的代码(是的,我才刚学不久)。

/**
 * Created by Paul Yang on 2017/4/15.
 */
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaDoubleRDD;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.broadcast.Broadcast;
import org.apache.spark.storage.StorageLevel;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.util.AccumulatorV2;
import org.apache.spark.util.LongAccumulator;
import scala.Tuple2;
import scala.collection.immutable.List;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Iterator;
import java.util.regex.Pattern;

public class simpleRddMain {

    //Used to sum
    static int countSum = 0;

    public static void main(String[] args) {

        SparkConf conf = new SparkConf().setAppName("simple RDD opt")
                .set("spark.hadoop.validateOutputSpecs", "false");
        JavaSparkContext sc = new JavaSparkContext(conf);

        //parallel a RDD
        ArrayList<Integer> intList = new ArrayList<Integer>(){{
            add(1);
            add(2);
            add(3);
            add(4);
            add(5);
        }};

        JavaRDD<Integer> integerRdd = sc.parallelize(intList); // Get a RDD from a list.
        System.out.println("Integer RDD:");
        integerRdd.collect();

        //Lambda expressions
        JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");
        JavaRDD<Integer> intLineLength = stringRdd.map(s -> s.length());
        intLineLength.persist(StorageLevel.MEMORY_ONLY());
        int totalLen = intLineLength.reduce((a, b) -> a + b);
        System.out.println("Lines(" + stringRdd.count() + ")<<<Lambda expressions>>>: Total len = " + totalLen);

        //anonymous inner class or a name one
        class GetLenFunc implements Function<String, Integer> {
            @Override
            public Integer call(String s) throws Exception {
                return s.length();
            }
        }
        JavaRDD<Integer> funcLineLengths = stringRdd.map( new GetLenFunc() );
        int funcTotalLen = funcLineLengths.reduce( new Function2<Integer, Integer, Integer>() {
           public Integer call (Integer a, Integer b) {return a + b;}
        });
        System.out.println("<<<anonymous inner class or a name one>>>: Total Len = " + funcTotalLen);


        //Wordcount Process
//        JavaRDD<String> wordsRdd = stringRdd.flatMap(new FlatMapFunction<String, String>() {
//            @Override
//            public Iterator<String> call(String line) throws Exception {
//                return Arrays.asList( line.split(" ")).iterator();
//            }
//        });
        JavaRDD<String> wordsRdd = stringRdd.flatMap(s -> Arrays.asList(s.split(" ")).iterator());
        JavaPairRDD<String, Integer> eachWordRdd = wordsRdd.mapToPair(s -> new Tuple2(s, 1));
        JavaPairRDD<String, Integer> wordCntRdd = eachWordRdd.reduceByKey( (a, b) -> a + b );
        wordCntRdd.collect();
        wordCntRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
                System.out.println(stringIntegerTuple2._1 + "@@@" + stringIntegerTuple2._2);
            }
        });

        //Understanding closures
        integerRdd.foreach(new VoidFunction<Integer>() {
            @Override
            public void call(Integer integer) throws Exception {
                countSum += integer.intValue();
            }
        });
        System.out.println("#~~~~~scope and life cycle of variables and methods~~~~~~# countSum = " + countSum);

        //Working with Key-Value Pairs
        JavaPairRDD<String, Integer> strIntPairRdd = stringRdd.mapToPair(s -> new Tuple2(s, 1));
        JavaPairRDD<String, Integer> strCountRdd = strIntPairRdd.reduceByKey((a, b) -> a + b);
        //strCountRdd.sortByKey();
        strCountRdd.collect();
        System.out.println("###Working with Key-Value Pairs### :" + strCountRdd.toString());
        strCountRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {
            @Override
            public void call(Tuple2<String, Integer> stringIntegerTuple2) throws Exception {
                System.out.println(stringIntegerTuple2._1 + ":" + stringIntegerTuple2._2);
            }
        });

        //Broadcast Variables
        Broadcast<double[]> broadcastVar = sc.broadcast(new double[] {1.1, 2.2, 3.3});
        broadcastVar.value();

        //Accumulator
        LongAccumulator longAccum = sc.sc().longAccumulator();
        integerRdd.foreach(x -> longAccum.add(x));
        System.out.println("\n\n\nAccumulator: " + longAccum.value() + "\n\n\n\n");

        //AccumulatorV2
        class MyVector {
            double[] vals;

            public MyVector(int vecLen) {
                vals = new double[vecLen];
            }

            public void reset() {
                for(int i = 0; i < vals.length; i++) {
                    vals[i] = 0;
                }
            }

            public void add(MyVector inVec) {
                for(int i = 0; i < vals.length; i++) {
                    vals[i] += inVec.vals[i];
                }
            }
        }
        class VectorAccumulatorV2 extends AccumulatorV2<MyVector,MyVector> {
            private MyVector selfVect = null;

            public VectorAccumulatorV2(int vecLen) {
                selfVect = new MyVector(vecLen);
            }

            @Override
            public boolean isZero() {
                for(int i = 0; i < selfVect.vals.length; i++) {
                    if(selfVect.vals[i] != 0) return false;
                }
                return true;
            }

            @Override
            public AccumulatorV2<MyVector, MyVector> copy() {
                VectorAccumulatorV2 ret = new VectorAccumulatorV2(copy().value().vals.length);
                return ret;
            }

            @Override
            public void reset() {
                selfVect.reset();
            }

            @Override
            public void add(MyVector v) {
                selfVect.add(v);
            }

            @Override
            public void merge(AccumulatorV2<MyVector, MyVector> other) {
                MyVector minVec = null, maxVec = null;
                if(other.value().vals.length < selfVect.vals.length) {
                    minVec = other.value();
                    maxVec = selfVect;
                }
                else {
                    minVec = selfVect;
                    maxVec = other.value();
                }
                //TODO: merge together.
            }

            @Override
            public MyVector value() {
                return selfVect;
            }
        }
        VectorAccumulatorV2 myVecAcc = new VectorAccumulatorV2(5);
        sc.sc().register(myVecAcc, "MyVectorAcc1");


    }
}

点击运行后你可能会遇到两个错误,第一个就是下面这个:

java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

这个错误其实可以忽略的,因为找不到可运行的Hadoop bin还是可以继续跑下去的,就是不用HDFS罢了。
如果纠结想去掉这个Error那也只需要两步走:

  1. 去下载winutils.exe:我发的这个是Hadoop2.7版本的Github链接,和你在别的bolg里面找到的旧版本不一样的(不过其实貌似没啥区别都一样用)
  2. 设置环境变量。把上面下载下来的目录(bin的上级目录)加入到环境变量HADOOP_HOME中。
  3. 重启IDEA(想了想还是写了这一步)

如果你愿意的话可以跟踪下报错的代码,加个断点,然后你还可以找到另一个方法来就是在程序中加入配置语句来解决这个问题,这样HADOOP_HOME就可以不配置了,因为在你的window上可能已经有Haoop了,你又不想改或者不想在bin里面添加winutils相关文件。

第二个报错的地方可能是:

JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");

显示找不到这个文件,其实运行上面的代码完全不需要去Spark官网下载任何release包,因为我们有Maven已经帮我们自动下载搞定了。这里只是因为不想另外构造数据文件,所以还是用release包中的文件。然后把路径写对就ok了,记得要写盘符,不然默认就在IDEA的工程目录中去找了。
解决这两个错误后应该就可以顺利看到运行结果了。

看结果的时候你可能会嫌Spark自带的输出日志太多了,略烦,那么还可以修改输出的级别限制输出,主要是把log4j.rootCategory=INFO, console改为log4j.rootCategory=WARN, console即可抑制Spark把INFO级别的日志打到控制台上。而如果要显示更全面的信息,可以把INFO改为DEBUG。
log4j.properties内如如下:

log4j.rootLogger=${root.logger}
root.logger=WARN,console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
shell.log.level=WARN
log4j.logger.org.eclipse.jetty=WARN
log4j.logger.org.spark-project.jetty=WARN
log4j.logger.org.spark-project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
log4j.logger.org.apache.spark.repl.Main=${shell.log.level}
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=${shell.log.level}

这个文件需要放到程序能自动读取加载的地方,比如resources目录下:

log4j.properties的文件位置,放在resources目录下

这样再run的时候log看起来就清爽多了。

初始化Spark

这部分内容在官网上主要是说在代码里使用SparkConf来建立一个JavaSparkContext。我在之前的文章中已经有介绍了,这里也不再赘述了。
当然另外使用shell来运行也不是本文章的重点,我们的重点是学习Spark实打实的RDD API。
那就继续看下面的重点RDD了。。。

RDDs

RDD的概念和特点还是参考上面的我之前写的文章《Spark常用概念》,我觉得已经写的比较有概括性和归纳性了。

产生RDD

RDD总的来说有两种方法得到:

  • 从代码中Parallelize得到
/** Distribute a local Scala collection to form an RDD. */
  def parallelize[T](list: java.util.List[T]): JavaRDD[T] =
    parallelize(list, sc.defaultParallelism)

读源码知道方法parallelize()的输入是一个List<T>,输出是一个JavaRDD<T>

  • 从文件中读取得到
/**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  def textFile(path: String): JavaRDD[String] = sc.textFile(path)

方法textFile()的输入是一个字符串,这个字符串可以是一个具体的文件,或者是一个目录。如果是目录的话就会自动读取这个目录下面所有的文件,然后返回一个JavaRDD<String>。

  • 其他API
    当然还有一些其他的生成RDD的API,比如很有用的创建一个空的RDD:
/** Get an RDD that has no partitions or elements. */
  def emptyRDD[T]: JavaRDD[T] = {
    implicit val ctag: ClassTag[T] = fakeClassTag
    JavaRDD.fromRDD(new EmptyRDD[T](sc))
  }

以及其他

/** Distribute a local Scala collection to form an RDD. */
  def parallelizePairs[K, V](list: java.util.List[Tuple2[K, V]]): JavaPairRDD[K, V] =
    parallelizePairs(list, sc.defaultParallelism)

/** Distribute a local Scala collection to form an RDD. */
  def parallelizeDoubles(list: java.util.List[java.lang.Double]): JavaDoubleRDD =
    parallelizeDoubles(list, sc.defaultParallelism)

上面都是SparkContext这个Class中的方法

RDD操作

RDD的操作分为两种:1、transformation;2、action。transformation就是将RDD的每一个elements进行映射变形,或许是1对1的map,或许是1对N(N>=0)的flatMap,又或许是加入键值映射成(Key,Value)形势的mapToPair。而action操作是对RDD的符合条件的elements进行计算然后返回一个值。下面重点介绍下几个常用的transformation以及action的API:
主要来自于JavaRDD以及JavaPairRDD中。

  • Transformation
    • map
      这个方法是将一个function接口的实现类应用到RDD的每一个元素上,然后返回一个新的RDD
/**
   * Return a new RDD by applying a function to all elements of this RDD.
   */
  def map[R](f: JFunction[T, R]): JavaRDD[R] =
    new JavaRDD(rdd.map(f)(fakeClassTag))(fakeClassTag)

比如我有一个数据集格式是:id,value
那么直接sc.textFile("pathOfFile")后的得到的是String类型的element组成的RDD,需要按照格式解析下,然后具体的map方法就是:

//Lambda表达式写法
JavaRDD<String[]> strArrayIdValue = stringRdd.map(s -> s.split(",", -1));
//非Lambda表达式写法
JavaRDD<String[]> strArrayIdValue = stringRdd.map(new Function<String, String[]>() {
            @Override
            public String[] call(String v1) throws Exception {
                return v1.split(",", -1);
            }
        });

不过接下来为了更好的展示返回值的类型,我就不再用Lambda表达式的格式来写了。

  • filter
    这个是按照一个过滤规则将能返回true的元素保留下来,返回false的不保留从而产生一个新的RDD。
    还是上面的例子,假设数据集中有格式错误,或者数据缺失的数据,简单认为String的数组个数不为2就是要扔掉的,那么:
        JavaRDD<String[]> strFiltedRdd = strArrayIdValue.filter(new Function<String[], Boolean>() {
            @Override
            public Boolean call(String[] v1) throws Exception {
                return v1.length == 2;
            }
        });
  • flatMap
    和map类似,不同的是每条element不是一定映射为另一个新的element,而是1对N的映射,其中N >= 0,所以假如上面的例子中数据value的值是按照value1|value2|value3...来构造的。那么去掉id,将各个value值保存到一个RDD是这么写:
        JavaRDD<String> strValueNRdd = strFiltedRdd.flatMap(new FlatMapFunction<String[], String>() {
            @Override
            public Iterator<String> call(String[] strings) throws Exception {
                return Arrays.asList(strings[1].split("\\|", -1)).iterator();
            }
        });

这样就将一个element中的value合集分成了每个element都只包含一个value的新RDD了。

  • mapPartitions
    这个是按照RDD在每个partition分区上进行映射。源码定义如下:
  /**
   * Return a new RDD by applying a function to each partition of this RDD.
   */
  def mapPartitions[U](f: FlatMapFunction[JIterator[T], U]): JavaRDD[U] = {
    def fn: (Iterator[T]) => Iterator[U] = {
      (x: Iterator[T]) => f.call(x.asJava).asScala
    }
    JavaRDD.fromRDD(rdd.mapPartitions(fn)(fakeClassTag[U]))(fakeClassTag[U])
  }

将之前的map方法装换成这个后:

        JavaRDD<String[]> strArrayIdValue = stringRdd.mapPartitions(new FlatMapFunction<Iterator<String>, String[]>() {
            @Override
            public Iterator<String[]> call(Iterator<String> stringIterator) throws Exception {
                ArrayList<String[]> arrList = new ArrayList<String[]>();
                arrList.add(stringIterator.next().split(",", -1));
                return  arrList.iterator();
            }
        });
  • union
    这个是将输入的RDD以及调用的源RDD进行合并。产生一个新的RDD。这个API在读取多个独立的文件并产生一个RDD时比较有用,比如:
        JavaRDD<String> unionAllFilesRdd = sc.emptyRDD();
        for(String name : fileNames) {
            unionAllFilesRdd = unionAllFilesRdd.union(sc.textFile(name));
        }
  • intersection
    返回输入和源的RDD的交集,并且不重复。
    源码说明:
  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: JavaRDD[T]): JavaRDD[T] = wrapRDD(rdd.intersection(other.rdd))
  • distinct
    返回RDD中不重复的element组成的新RDD,也就是去重操作。没有入参。
    源码:
  /**
   * Return a new RDD containing the distinct elements in this RDD.
   */
  def distinct(): JavaRDD[T] = wrapRDD(rdd.distinct())
  • subtract
    该方法的作用是将存在于本RDD中的element但是不存在于输入RDD中的element找出来组合成一个新的RDD。
  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be less than or equal to us.
   */
  def subtract(other: JavaRDD[T]): JavaRDD[T] = wrapRDD(rdd.subtract(other))

综合前面几个方法,如果我们要做一个这个任务,将RDD A中不同于RDD B的数据加入到B中,并将A中与B重复的部分按照某种条件替换。
1. 需要得到A对于B的不同集:AoutB = A.subtract(B)
2. 将满足替换条件的A的子集找出来:replaceCandidateA = A.filter(判断条件)
3. 找出真正能替换的A的子集:realReplaceA = replaceCandidateA.intersection(B)
4. 找出要丢弃的B的子集:discardB = B.intersection(realReplaceA)
5. 丢弃后剩下的B的内容:B = B.subtract(discardB)
6. 合并替换集以及新加集到B中:newB = B.union(realReplaceA).union(AoutB)

  • mapToPair
    这个API是用来将JavaRDD转成JavaPairRDD,也就是将Key提出来,比如之前的id,value数据,在转成[id,value]的String数组后,可以生成让id为key的JavaPairRDD:
        JavaPairRDD<String, String> keyValuePairRdd = strFiltedRdd.mapToPair(new PairFunction<String[], String, String>() {
            @Override
            public Tuple2<String, String> call(String[] strings) throws Exception {
                return new Tuple2<>(strings[0], strings[1]);
            }
        });

接下来会介绍一些JavaPairRDD上独有的API

  • groupByKey
  /**
   * Group the values for each key in the RDD into a single sequence. Hash-partitions the
   * resulting RDD with the existing partitioner/parallelism level.
   *
   * @note If you are grouping in order to perform an aggregation (such as a sum or average) over
   * each key, using `JavaPairRDD.reduceByKey` or `JavaPairRDD.combineByKey`
   * will provide much better performance.
   */
  def groupByKey(): JavaPairRDD[K, JIterable[V]] =
    fromRDD(groupByResultToJava(rdd.groupByKey()))

这个API用起来不是很爽,不能自定义一些组合方式,而且执行细节需要注意下,参见:Avoid GroupByKey以及深入理解groupByKey、reduceByKey

  • reduceByKey
    将PairRDD中的每一个元素按照Key的值,进行Value的“相加”,“相加”的具体操作由实现接口Function2的类完成。
    源码:
  /**
   * Merge the values for each key using an associative and commutative reduce function. This will
   * also perform the merging locally on each mapper before sending results to a reducer, similarly
   * to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/
   * parallelism level.
   */
  def reduceByKey(func: JFunction2[V, V, V]): JavaPairRDD[K, V] = {
    fromRDD(reduceByKey(defaultPartitioner(rdd), func))
  }

可见reduceByKey并不会改变V的类型。比如我们把之前的PairRDD做下value的合并,代码如下:

        JavaPairRDD<String, String> byIdValuesPairRdd = keyValuePairRdd.reduceByKey(new Function2<String, String, String>() {
            @Override
            public String call(String v1, String v2) throws Exception {
                return v1+"|"+v2;
            }
        });

这里的例子中V的类型是String,其实在实际用的时候也可以先将String map成ArrayList<String>然后再reduceByKey合成一个大的ArrayList。
那么有没有一个方法可以从String直接transformation到ArrayList<String>呢?接着往下看吧。

  • aggregateByKey
    这个方法的入参比较多,主要原因是这个方法的目的是将RDD的每个元素按照Key合并成U,因为U的类型不同于V,所以需要指明V如何和U合并(第二个入参),以及U和U的合并方法(第三个入参),而且还需要给出一个最初始的U(比如是一个空集合,或者是0对于整数想加,或者是1对于整数相乘;第一个入参)。好戏来了,先看源码:
  /**
   * Aggregate the values of each key, using given combine functions and a neutral "zero value".
   * This function can return a different result type, U, than the type of the values in this RDD,
   * V. Thus, we need one operation for merging a V into a U and one operation for merging two U's.
   * The former operation is used for merging values within a partition, and the latter is used for
   * merging values between partitions. To avoid memory allocation, both of these functions are
   * allowed to modify and return their first argument instead of creating a new U.
   */
  def aggregateByKey[U](zeroValue: U, seqFunc: JFunction2[U, V, U], combFunc: JFunction2[U, U, U]):
      JavaPairRDD[K, U] = {
    implicit val ctag: ClassTag[U] = fakeClassTag
    fromRDD(rdd.aggregateByKey(zeroValue)(seqFunc, combFunc))
  }

具体实践 duang duang duang:

        JavaPairRDD<String, ArrayList<String>> keyValuelistPairRdd = keyValuePairRdd.aggregateByKey(new ArrayList<String>(), new Function2<ArrayList<String>, String, ArrayList<String>>() {
            @Override
            public ArrayList<String> call(ArrayList<String> v1, String v2) throws Exception {
                v1.add(v2);
                return v1;
            }
        }, new Function2<ArrayList<String>, ArrayList<String>, ArrayList<String>>() {
            @Override
            public ArrayList<String> call(ArrayList<String> v1, ArrayList<String> v2) throws Exception {
                v1.addAll(v2);
                return v1;
            }
        });

通过三个入参搞定了这个从String -> ArrayList<String>的转变。
这么好的机会自然不能错过用Lambda来秀下:

    val initialSet = mutable.HashSet.empty[String]
    val addToSet = (s: mutable.HashSet[String], v: String) => s += v
    val mergePartitionSets = (p1: mutable.HashSet[String], p2: mutable.HashSet[String]) => p1 ++= p2

    val keyValuelistPairRdd = keyValuePairRdd.aggregateByKey(initialSet)(addToSet, mergePartitionSets)

怎么画风突变成Scala了,我也没办法啊,我能有什么办法,用Java实在写不出来,不知道这个怎么转成Lambda。。。。(Java药丸啊)。如果有人知道这个用Java咋通过Lambda写,麻烦在评论里告知下,不胜感谢!

        JavaPairRDD<String, ArrayList<String>> keyValuelistLambda = keyValuePairRdd.aggregateByKey(new ArrayList<String>(), (uList,vStr) -> {uList.add(vStr); return uList;}, (u1, u2) -> {u1.addAll(u2); return u1;});

额。。。,我自己还是想出来一个写法(不然太丢人了,虽然正经学Java才3个月,但是这不是借口啊),不过看起来有点怪怪的感觉,不过如果大家有更好的写法还是非常欢迎能在评论区展示下。谢谢!

  • sortByKey
    当K可以排序时,可以使用这个方法来对其排序,默认是升序排序。源码:
  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements in
   * ascending order. Calling `collect` or `save` on the resulting RDD will return or output an
   * ordered list of records (in the `save` case, they will be written to multiple `part-X` files
   * in the filesystem, in order of the keys).
   */
  def sortByKey(): JavaPairRDD[K, V] = sortByKey(true)
  /**
   * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling
   * `collect` or `save` on the resulting RDD will return or output an ordered list of records
   * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in
   * order of the keys).
   */
  def sortByKey(ascending: Boolean): JavaPairRDD[K, V] = {
    val comp = com.google.common.collect.Ordering.natural().asInstanceOf[Comparator[K]]
    sortByKey(comp, ascending)
  }
  • join
    这个方法是对于两个PairRDD按照Key进行取交集,如果k在本RDD和输入RDD中都存在,那么就加入返回的RDD中,且RDD的每一个元素为k, (v1, v2),其中后面是一个Tuple,v1来自于本RDD,v2来自于输入RDD。源码:
  /**
   * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each
   * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and
   * (k, v2) is in `other`. Uses the given Partitioner to partition the output RDD.
   */
  def join[W](other: JavaPairRDD[K, W], partitioner: Partitioner): JavaPairRDD[K, (V, W)] =
    fromRDD(rdd.join(other, partitioner))
        JavaPairRDD<String, Tuple2<String, String>> joinRDD = byIdValuesPairRdd.join(keyValuePairRdd);
  • leftOuterJoin & rightOuterJoin & fullOuterJoin
    join还有三个变身版本,我们可以通过实践看看各自的用法:
        ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(1, "str1"));
                add(new Tuple2<>(1, "str11"));
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(4, "str44"));
            };
        };
        JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);

        ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(3, "str3"));
                add(new Tuple2<>(4, "str4"));
                add(new Tuple2<>(5, "str5"));
                add(new Tuple2<>(7, "str77"));
            }
        };
        JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);

        JavaPairRDD<Integer, Tuple2<String, String>> joinRdd = paralPairRdd.join(otherParalPairRdd);
        joinRdd.foreach(s -> System.out.println("join*"+ s.toString()));

        JavaPairRDD<Integer, Tuple2<String, Optional<String>>> leftOuterJoinRdd = paralPairRdd.leftOuterJoin(otherParalPairRdd);
        leftOuterJoinRdd.foreach(s -> System.out.println("leftOuterJoin*"+ s.toString()));

        JavaPairRDD<Integer, Tuple2<Optional<String>, String>> rightOuterJoinRdd = paralPairRdd.rightOuterJoin(otherParalPairRdd);
        rightOuterJoinRdd.foreach(s -> System.out.println("rightOuterJoin*"+ s.toString()));

        JavaPairRDD<Integer, Tuple2<Optional<String>, Optional<String>>> fullOuterJoinRdd = paralPairRdd.fullOuterJoin(otherParalPairRdd);
        fullOuterJoinRdd.foreach(s -> System.out.println("fullOuterJoin*"+ s.toString()));

上面代码的运行结果为:

join(2,(str2,str2))
join
(4,(str44,str4))

leftOuterJoin(2,(str2,Optional[str2]))
leftOuterJoin
(4,(str44,Optional[str4]))
leftOuterJoin(1,(str1,Optional.empty))
leftOuterJoin
(1,(str11,Optional.empty))
rightOuterJoin*(2,(Optional[str2],str2))

rightOuterJoin(4,(Optional[str44],str4))
rightOuterJoin
(3,(Optional.empty,str3))
rightOuterJoin(7,(Optional.empty,str77))
rightOuterJoin
(5,(Optional.empty,str5))

fullOuterJoin(4,(Optional[str44],Optional[str4]))
fullOuterJoin
(2,(Optional[str2],Optional[str2]))
fullOuterJoin(3,(Optional.empty,Optional[str3]))
fullOuterJoin
(7,(Optional.empty,Optional[str77]))
fullOuterJoin(1,(Optional[str1],Optional.empty))
fullOuterJoin
(1,(Optional[str11],Optional.empty))
fullOuterJoin*(5,(Optional.empty,Optional[str5]))
可见join就是取交集,left是就是本RDD有的key的Value集合,right就是输入RDD的key的value集合,full就是并集。而且不保证key的顺序,只保证value的顺序。

  • cogroup
    这个方法是将两个或者多个PairRDD(如果入参是多个RDD,那么有几个就合并几个)合并在一起,如果其中一个RDD的某个key,在另一个RDD中没有出现,那么就要记录一个空集合。Talk is cheap, let's show you the codes and the run result.
    源码为先:
  /**
   * For each key k in `this` or `other`, return a resulting RDD that contains a tuple with the
   * list of values for that key in `this` as well as `other`.
   */
  def cogroup[W](other: JavaPairRDD[K, W]): JavaPairRDD[K, (JIterable[V], JIterable[W])] =
    fromRDD(cogroupResultToJava(rdd.cogroup(other)))

有多个重载版的方法,加上调用方法的RDD自身,最多支持一口气4个RDD的cogroup


cogroup的各个重载版本

动手实践:

        ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(1, "str1"));
                add(new Tuple2<>(1, "str11"));
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(4, "str44"));
            };
        };
        JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);

        ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(3, "str3"));
                add(new Tuple2<>(4, "str4"));
                add(new Tuple2<>(5, "str5"));
            }
        };
        JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);
        JavaPairRDD<Integer, Tuple2<Iterable<String>, Iterable<String>>> coGroupRdd = paralPairRdd.cogroup(otherParalPairRdd);
        coGroupRdd.foreach(s -> System.out.println("+++"+ s.toString()));

看下运行结果:

+++(3,([],[str3]))
+++(1,([str1, str11],[]))
+++(2,([str2],[str2]))
+++(4,([str44],[str4]))
+++(5,([],[str5]))
可以看到Key的顺序是不保证的,但是key内value的顺序是有保证的。

  • intersection
    这个方法的作用是取本RDD和输入RDD的交集,和join的区别在与join会把Key一样Value不一样的一起留下,但是intersection只会留下Key和Value都一样的数据:
  /**
   * Return the intersection of this RDD and another one. The output will not contain any duplicate
   * elements, even if the input RDDs did.
   *
   * @note This method performs a shuffle internally.
   */
  def intersection(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
    new JavaPairRDD[K, V](rdd.intersection(other.rdd))

看个实例:

        ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(1, "str1"));
                add(new Tuple2<>(1, "str11"));
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(4, "str44"));
            };
        };
        JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);

        ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(3, "str3"));
                add(new Tuple2<>(4, "str4"));
                add(new Tuple2<>(5, "str5"));
                add(new Tuple2<>(7, "str77"));
            }
        };
        JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);

        JavaPairRDD<Integer, Tuple2<String, String>> joinRdd = paralPairRdd.join(otherParalPairRdd);
        joinRdd.foreach(s -> System.out.println("join*"+ s.toString()));

        JavaPairRDD<Integer, String> intersectRdd = paralPairRdd.intersection(otherParalPairRdd);
        intersectRdd.foreach(s -> System.out.println("intersection*" + s.toString()));

运行结果:

join(4,(str44,str4))
join
(2,(str2,str2))
intersection*(2,str2)
所以intertsection不会改变返回值的类型,但是join会改变,因为value被修改为了Tuple类型了。

  • subtract
    这个方法将本RDD中存在的元素但是不存在与输入RDD的元素取出,组成一个输出类型不变的RDD。也就是说这个不同的判断不只是根据Key来的,还包括了Value的值。只要Key和Value有一个不同,那么就会被取出作为返回RDD的部分数据。
  /**
   * Return an RDD with the elements from `this` that are not in `other`.
   *
   * Uses `this` partitioner/partition size, because even if `other` is huge, the resulting
   * RDD will be <= us.
   */
  def subtract(other: JavaPairRDD[K, V]): JavaPairRDD[K, V] =
    fromRDD(rdd.subtract(other))

实例代码:

        ArrayList<Tuple2<Integer, String>> idValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(1, "str1"));
                add(new Tuple2<>(1, "str11"));
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(4, "str44"));
            };
        };
        JavaPairRDD<Integer, String> paralPairRdd = sc.parallelizePairs(idValList);

        ArrayList<Tuple2<Integer, String>> otherIdValList = new ArrayList<Tuple2<Integer, String>>(){
            {
                add(new Tuple2<>(2, "str2"));
                add(new Tuple2<>(3, "str3"));
                add(new Tuple2<>(4, "str4"));
                add(new Tuple2<>(5, "str5"));
                add(new Tuple2<>(7, "str77"));
            }
        };
        JavaPairRDD<Integer, String> otherParalPairRdd = sc.parallelizePairs(otherIdValList);

        JavaPairRDD<Integer, String> substractRdd = paralPairRdd.subtract(otherParalPairRdd);
        substractRdd.foreach(s -> System.out.println("substract*" + s.toString()));

运行结果:
substract(4,str44)
substract
(1,str1)
substract*(1,str11)
*不过还要小心有坑,这里的数据结构是Tuple2<Integer, String>,如果换成Tuple2<Integer, String[]>,那么即使String[]的内容一样,也同样被认为是不同的值,切记!!!

  • coalesce
    这个方法可以用以减少RDD的分区到输入的参数个数上。说是效率比较高,但是用起来感觉无法让数据最后保存在一个文件上。
  /**
   * Return a new RDD that is reduced into `numPartitions` partitions.
   */
  def coalesce(numPartitions: Int): JavaPairRDD[K, V] = fromRDD(rdd.coalesce(numPartitions))
  • repartition
    这个方法主要是用来重新shuffle RDD的data,使用一种随机的方式来产生更多或者更少的分区并平衡它们。This always shuffles all data over the network.
 /**
  * Return a new RDD that has exactly numPartitions partitions.
  *
  * Can increase or decrease the level of parallelism in this RDD. Internally, this uses
  * a shuffle to redistribute data.
  *
  * If you are decreasing the number of partitions in this RDD, consider using `coalesce`,
  * which can avoid performing a shuffle.
  */
 def repartition(numPartitions: Int): JavaPairRDD[K, V] = fromRDD(rdd.repartition(numPartitions))

有时候会用它最后将结果RDD保存在一个part文件上。

     coGroupRdd.repartition(1).saveAsTextFile(fileName);
  • Actions
    • reduce
      这个其实和reduceByKey类似,不过因为reduceByKey的输出是各个Key的新Value的element的集合还是一个RDD,而reduce是对所有RDD的element的一个汇总,最后形成一个单独的输出,不再是RDD,所以这个操作归在action当中,而reducedByKey则属于transformation。
  /**
   * Reduces the elements of this RDD using the specified commutative and associative binary
   * operator.
   */
  def reduce(f: JFunction2[T, T, T]): T = rdd.reduce(f)

实际应用:

        JavaRDD<String> stringRdd = sc.textFile("G:/ImportantTools/spark-2.1.0-bin-hadoop2.7/README.md");
        JavaRDD<Integer> intLineLength = stringRdd.map(s -> s.length());
        intLineLength.persist(StorageLevel.MEMORY_ONLY());
        int totalLen = intLineLength.reduce((a, b) -> a + b);
  • collect
    重点看下源码里的Note:
  /**
   * Return an array that contains all of the elements in this RDD.
   *
   * @note this method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collect(): JList[T] =
    rdd.collect().toSeq.asJava
  • collectAsMap
    这是JavaPairRDD特有的API,可以返回一个原RDD中K,V对应关系的Map。
/**
   * Return the key-value pairs in this RDD to the master as a Map.
   *
   * @note this method should only be used if the resulting data is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def collectAsMap(): java.util.Map[K, V] = mapAsSerializableJavaMap(rdd.collectAsMap())

不过这里需要注意一点,这个返回的Map类型在广播broadcast中可能会有问题。
比如:

                final Map<String, MyInfoClass> kvMap = keyValuePairRDD.collectAsMap();
                final Broadcast<Map<String, MyInfoClass>> broadcastKvMap = sc.broadcast(kvMap);

上面代码是把一个JavaPairRDD转成Map,然后再broadcast出来,方便其他executor在执行诸如Map()等方法时使用,不过这么写有一定的概率产生一个错误:

17/06/14 11:43:53 INFO scheduler.DAGScheduler: ShuffleMapStage 1 (repartition at FromBSID2Gps.java:214) failed in 1.182 s due to Job aborted due to stage failure: Task 5 in stage 1.0 failed 4 times, most recent failure: Lost task 5.3 in stage 1.0 (TID 19, s36.dc.taiyear, executor 3): java.io.IOException: java.lang.UnsupportedOperationException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1213)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
.... ....
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.UnsupportedOperationException
at java.util.AbstractMap.put(AbstractMap.java:203)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:135)
at com.esotericsoftware.kryo.serializers.MapSerializer.read(MapSerializer.java:17)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:228)
at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:217)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1206)
... 20 more
所以为了避免被当做AbstractMap来处理,需要显式的指明Map的类型,比如下面这段代码。

                Map<String, MyInfoClass> kvMap = new HashMap<>();
                kvMap.putAll(keyValuePairRDD.collectAsMap());
                final Broadcast<Map<String, MyInfoClass>> broadcastKvMap = sc.broadcast(kvMap);
  • count
  /**
   * Return the number of elements in the RDD.
   */
  def count(): Long = rdd.count()
  • first
  /**
   * Return the first element in this RDD.
   */
  def first(): T = rdd.first()
  • take
  /**
   * Take the first num elements of the RDD. This currently scans the partitions *one by one*, so
   * it will be slow if a lot of partitions are required. In that case, use collect() to get the
   * whole RDD instead.
   *
   * @note this method should only be used if the resulting array is expected to be small, as
   * all the data is loaded into the driver's memory.
   */
  def take(num: Int): JList[T] =
    rdd.take(num).toSeq.asJava
  • saveAsTextFile
    这个不是保存到一个文件中,而是保存到这个文件夹中,根据partition的个数来生成文件个数。
  /**
   * Save this RDD as a text file, using string representations of elements.
   */
  def saveAsTextFile(path: String): Unit = {
    rdd.saveAsTextFile(path)
  }
  • countByKey
    用哈希Map作为返回值
  /** Count the number of elements for each key, and return the result to the master as a Map. */
  def countByKey(): java.util.Map[K, jl.Long] =
    mapAsSerializableJavaMap(rdd.countByKey()).asInstanceOf[java.util.Map[K, jl.Long]]
  • foreach
  /**
   * Applies a function f to all elements of this RDD.
   */
  def foreach(f: VoidFunction[T]) {
    rdd.foreach(x => f.call(x))
  }

其他RDD类型

从JavaRDDLike集成的结构图

下面介绍下

  • JavaDoubleRDD
管中窥豹,可见这个RDD主要负责一些数据统计方面的功能

共享变量

因为Spark的APP是序列化后分发到各个Worker节点上运行的,所以需要特殊的方法才能在各个worker节点上得到一个有效统一的全局变量值。

推荐阅读更多精彩内容