Spark03

Operator:

  • Transformation:延迟执行,来一个job;RDD=>RDD:map, filter, flatMap, sample, groupByKey, reduceByKey, union, join, cogroup, crossProduct, mapValues, sort, partitionBy
  • Action:触发执行,RDD=>others:count, collect, reduce, lookup, save

spark application是由多个job组成的,一个action就有一个job

Persistence

有瑕疵的部分:errors被读两次:

lines = sc.textFile("hdfs:// ... ")
errors = lines.filter(_.startsWith("ERROR"))
mysql_errors = errors.filter(_.contain("mysql")).count
http_errors = errors.filter(_.contain("http")).count
  • cache
def cache() : this.type = persist()
  • persist
def persist() : this.type = persist(StorageLevel.MEMORY_ONLY)
  • checkpoint

Submit

Scala / PySpark

/bin/spark-submit --help options:

  • --deploy-mode
  • --master (Cluster Managers)
    1. Yarn
    2. Mesos
    3. Standalone
    4. Kubernetes
    5. local
  • Driver & Executor resources (Cores & Memory)
    1. --driver-memory
    2. --driver-cores
    3. --num-executors
    4. --executor-memory
    5. --executor-cores
    6. --total-executor-cores
./bin/spark2-submit \
   --master yarn \
   --deploy-mode cluster \
   --driver-memory 8g \
   --executor-memory 16g \
   --executor-cores 2  \
   --class org.apache.spark.examples.SparkPi \
   /spark-home/examples/jars/spark-examples_versionxx.jar 80

Java

--jars xxx: running time dependency: e.g.

./bin/spark-submit 
--master yarn \
--deploy-mode cluster \
--conf "spark.sql.shuffle.partitions=20000" \
--jars "dependency1.jar,dependency2.jar"
--class com.sparkbyexamples.WordCountExample \
spark-by-examples.jar 

Driver, Executor

  • Driver: sparkConf, context, etc.
  • Executor: operator logics

How it works:


process

推荐阅读更多精彩内容