Spark-Core源码精读(12)、Task的提交流程分析

上一篇文章我们谈到，DAGScheduler将Job划分成由Stage组成的DAG后，就根据Stage的具体类型来生成ShuffleMapTask和ResultTask，然后使用TaskSet对其进行封装，最后调用TaskScheduler的submitTasks方法提交具体的TaskSet，而实际上是调用的TaskSchedulerImpl的submitTasks方法，下面我们就来分析具体Tasks提交的过程。

直接进入TaskSchedulerImpl的submitTasks方法：

override def submitTasks(taskSet: TaskSet) {
  val tasks = taskSet.tasks
  logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")
  // 使用synchronized来确保Task的同步提交
  this.synchronized {
    // 首先创建一个TaskSetManager，主要负责调度TaskSet中的Tasks
    // 默认的最大失败重试次数是4次，可以通过spark.task.maxFailures进行配置
    val manager = createTaskSetManager(taskSet, maxTaskFailures)
    val stage = taskSet.stageId
    val stageTaskSets =
      taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])
    stageTaskSets(taskSet.stageAttemptId) = manager
    val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>
      ts.taskSet != taskSet && !ts.isZombie
    }
    if (conflictingTaskSet) {
      throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +
        s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")
    }
    // 将TaskSetManager及TaskSet的属性信息添加到schedulableBuilder中
    // SchedulableBuilder会确定TaskSetManager的调度顺序是FIFO还是FAIR，默认是FIFO
    schedulableBuilder.addTaskSetManager(manager, manager.taskSet.properties)
    if (!isLocal && !hasReceivedTask) {
      starvationTimer.scheduleAtFixedRate(new TimerTask() {
        override def run() {
          if (!hasLaunchedTask) {
            logWarning("Initial job has not accepted any resources; " +
              "check your cluster UI to ensure that workers are registered " +
              "and have sufficient resources")
          } else {
            this.cancel()
          }
        }
      }, STARVATION_TIMEOUT_MS, STARVATION_TIMEOUT_MS)
    }
    hasReceivedTask = true
  }
  backend.reviveOffers()
}

进入backend的reviveOffers()方法：

override def reviveOffers() {
  driverEndpoint.send(ReviveOffers)
}

可以看到是给DriverEndpoint发送了一条ReviveOffers消息(我们在前面分析DriverEndpoint注册的时候就给自己发送了一条ReviveOffers消息)，DriverEndpoint接收到该消息后的处理如下：

case ReviveOffers =>
  makeOffers()

接下来看makeOffers()：

private def makeOffers() {
  // Filter out executors under killing
  // 过滤出Alive的Executors
  val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
  // 构建由WorkerOffer组成的集合，WorkerOffer就代表Executor上可用的计算资源
  val workOffers = activeExecutors.map { case (id, executorData) =>
    new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
  }.toSeq
  // 将Tasks提交到Executors上
  launchTasks(scheduler.resourceOffers(workOffers))
}

在makeOffers()方法中首先准备好可以用于计算的workOffers(代表所有可用的ExecutorBackend中可以使用的cores的信息)。

在launchTasks之前，我们重点看一下scheduler.resourceOffers(workOffers)都做了什么工作：

def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {
  // Mark each slave as alive and remember its hostname
  // Also track if new executor is added
  // 标记slave为alive并记录hostname的信息
  var newExecAvail = false
  for (o <- offers) {
    executorIdToHost(o.executorId) = o.host
    executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
    if (!executorsByHost.contains(o.host)) {
      executorsByHost(o.host) = new HashSet[String]()
      executorAdded(o.executorId, o.host)
      newExecAvail = true
    }
    // 跨机架的情况，我们这里不做考虑
    for (rack <- getRackForHost(o.host)) {
      hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
    }
  }
  // 将offers打乱，为的就是负载均衡
  // Randomly shuffle offers to avoid always placing tasks on the same set of workers.
  val shuffledOffers = Random.shuffle(offers)
  // Build a list of tasks to assign to each worker.
  // 这里我们清楚的看见每个Executor上的Tasks的个数取决于该Executor上可用的cores的个数
  val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))
  // 每个executor上可用的cores的个数组成的数组
  val availableCpus = shuffledOffers.map(o => o.cores).toArray
  val sortedTaskSets = rootPool.getSortedTaskSetQueue
  for (taskSet <- sortedTaskSets) {
    logDebug("parentName: %s, name: %s, runningTasks: %s".format(
      taskSet.parent.name, taskSet.name, taskSet.runningTasks))
    if (newExecAvail) {
      taskSet.executorAdded()
    }
  }
  // Take each TaskSet in our scheduling order, and then offer it each node in increasing order
  // of locality levels so that it gets a chance to launch local tasks on all of them.
  // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY
  var launchedTask = false
  for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
    do {
      launchedTask = resourceOfferSingleTaskSet(
          taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
    } while (launchedTask)
  }
  if (tasks.size > 0) {
    hasLaunchedTask = true
  }
  return tasks
}

可以看到TaskSchedulerImpl的resourceOffers方法的主要作用就是为每个Task具体分配计算资源，输入的就是可用的资源，输出的是由TaskDescription组成的二维数组(Seq[Seq[TaskDescription]])，TaskDescription中保存了executorId，即每个Task具体运行在哪个ExecutorBackend上，下面我们就具体分析该方法的执行过程：

对可用的资源进行标记并记录hostname，同时判断是否有新的executor加入

即resourceOffers方法的如下部分：

var newExecAvail = false
for (o <- offers) {
  executorIdToHost(o.executorId) = o.host
  executorIdToTaskCount.getOrElseUpdate(o.executorId, 0)
  if (!executorsByHost.contains(o.host)) {
    executorsByHost(o.host) = new HashSet[String]()
    executorAdded(o.executorId, o.host)
    newExecAvail = true
  }
  for (rack <- getRackForHost(o.host)) {
    hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host
  }
}

这部分不是我们考虑的重点，所以大家只要知道他的作用就可以了。

将所有可用的计算资源随机打散

对应的源码：

// Randomly shuffle offers to avoid always placing tasks on the same set of workers.
val shuffledOffers = Random.shuffle(offers)

通过Random.shuffle的方法将所有的计算资源重新进行“洗牌”，以追求最大化的负载均衡

根据每个ExecutorBackend的cores的个数声明类型为TaskDescription的ArrayBuffer数组

对应的源码：

val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))

此处只是声明，并没有加入TaskDescription

获得根据具体的调度策略排序后的TaskSets

对应的源码：

val sortedTaskSets = rootPool.getSortedTaskSetQueue

调度策略即SchedulableBuilder，具体的实现分为FIFOSchedulableBuilder和FairSchedulableBuilder，而默认使用的就是FIFO的调度策略。

如果有新的Executor加入，此时会调用TaskSet的executorAdded方法来获取最新的完整的可用计算资源

对应的源码：

for (taskSet <- sortedTaskSets) {
  logDebug("parentName: %s, name: %s, runningTasks: %s".format(
    taskSet.parent.name, taskSet.name, taskSet.runningTasks))
  if (newExecAvail) {
    taskSet.executorAdded()
  }
}

确定最高优先级本地性

var launchedTask = false
for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
  do {
    launchedTask = resourceOfferSingleTaskSet(
        taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
  } while (launchedTask)
}

这里的LocalityLevel从高到低依次为：PROCESS_LOCAL、NODE_LOCAL、NO_PREF、RACK_LOCAL、ANY，其中NO_PREF是指机器的本地性，因为一台机器上有可能有很多node。

下面我们来看这个resourceOfferSingleTaskSet方法：

private def resourceOfferSingleTaskSet(
    taskSet: TaskSetManager,
    maxLocality: TaskLocality,
    shuffledOffers: Seq[WorkerOffer],
    availableCpus: Array[Int],
    tasks: Seq[ArrayBuffer[TaskDescription]]) : Boolean = {
  var launchedTask = false
  for (i <- 0 until shuffledOffers.size) {
    val execId = shuffledOffers(i).executorId
    val host = shuffledOffers(i).host
    if (availableCpus(i) >= CPUS_PER_TASK) {
      try {
        for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {
          tasks(i) += task
          val tid = task.taskId
          taskIdToTaskSetManager(tid) = taskSet
          taskIdToExecutorId(tid) = execId
          executorIdToTaskCount(execId) += 1
          executorsByHost(host) += execId
          availableCpus(i) -= CPUS_PER_TASK
          assert(availableCpus(i) >= 0)
          launchedTask = true
        }
      } catch {
        case e: TaskNotSerializableException =>
          logError(s"Resource offer failed, task set ${taskSet.name} was not serializable")
          // Do not offer resources for this task, but don't throw an error to allow other
          // task sets to be submitted.
          return launchedTask
      }
    }
  }
  return launchedTask
}

首先循环遍历判断每个Executor上可用的cores的个数是否满足每个Task所需要的CPU的个数，默认CPUS_PER_TASK的个数为1；然后通过调用TaskSetManager的resourceOffer方法最终确定每个Task具体运行在哪个ExecutorBackend的具体的Locality Level：

@throws[TaskNotSerializableException]
def resourceOffer(
    execId: String,
    host: String,
    maxLocality: TaskLocality.TaskLocality)
  : Option[TaskDescription] =
{
  if (!isZombie) {
    val curTime = clock.getTimeMillis()
    var allowedLocality = maxLocality
    if (maxLocality != TaskLocality.NO_PREF) {
      allowedLocality = getAllowedLocalityLevel(curTime)
      if (allowedLocality > maxLocality) {
        // We're not allowed to search for farther-away tasks
        allowedLocality = maxLocality
      }
    }
    dequeueTask(execId, host, allowedLocality) match {
      case Some((index, taskLocality, speculative)) => {
        // Found a task; do some bookkeeping and return a task description
        val task = tasks(index)
        val taskId = sched.newTaskId()
        // Do various bookkeeping
        copiesRunning(index) += 1
        val attemptNum = taskAttempts(index).size
        val info = new TaskInfo(taskId, index, attemptNum, curTime,
          execId, host, taskLocality, speculative)
        taskInfos(taskId) = info
        taskAttempts(index) = info :: taskAttempts(index)
        // Update our locality level for delay scheduling
        // NO_PREF will not affect the variables related to delay scheduling
        if (maxLocality != TaskLocality.NO_PREF) {
          currentLocalityIndex = getLocalityIndex(taskLocality)
          lastLaunchTime = curTime
        }
        // Serialize and return the task
        // 将task序列化
        val startTime = clock.getTimeMillis()
        val serializedTask: ByteBuffer = try {
          Task.serializeWithDependencies(task, sched.sc.addedFiles, sched.sc.addedJars, ser)
        } catch {
          // If the task cannot be serialized, then there's no point to re-attempt the task,
          // as it will always fail. So just abort the whole task-set.
          case NonFatal(e) =>
            val msg = s"Failed to serialize task $taskId, not attempting to retry it."
            logError(msg, e)
            abort(s"$msg Exception during serialization: $e")
            throw new TaskNotSerializableException(e)
        }
        // 判断序列化后的大小是否超过了TaskSetManager的限制，默认是100k
        if (serializedTask.limit > TaskSetManager.TASK_SIZE_TO_WARN_KB * 1024 &&
            !emittedTaskSizeWarning) {
          emittedTaskSizeWarning = true
          logWarning(s"Stage ${task.stageId} contains a task of very large size " +
            s"(${serializedTask.limit / 1024} KB). The maximum recommended task size is " +
            s"${TaskSetManager.TASK_SIZE_TO_WARN_KB} KB.")
        }
        addRunningTask(taskId)
        // We used to log the time it takes to serialize the task, but task size is already
        // a good proxy to task serialization time.
        // val timeTaken = clock.getTime() - startTime
        val taskName = s"task ${info.id} in stage ${taskSet.id}"
        logInfo(s"Starting $taskName (TID $taskId, $host, partition ${task.partitionId}," +
          s"$taskLocality, ${serializedTask.limit} bytes)")
        // 向DAGScheduler汇报Task开始
        sched.dagScheduler.taskStarted(task, info)
        // 返回TaskDescription
        return Some(new TaskDescription(taskId = taskId, attemptNumber = attemptNum, execId,
          taskName, index, serializedTask))
      }
      case _ =>
    }
  }
  None
}

该方法传入的maxLocality就是上面TaskSchedulerImpl的resourceOffers方法中那个for循环中传入的maxLocality，该循环是按照上面提到的Locality Level的级别由高到低进行的，所以最优的计算本地性就是PROCESS_LOCAL，最后返回Some(TaskDescription)给resourceOfferSingleTaskSet方法中，然后的一系列操作是更新数据结构以及从可用的计算资源中减掉刚才Task使用的cores的个数(默认每个Task使用一个core)，最后将分配好的Task(Seq[ArrayBuffer[TaskDescription]])返回给TaskSchedulerImpl的resourceOffers方法，而该方法又将最后的结果(Seq[Seq[TaskDescription]])返回给CoarseGrainedSchedulerBackend的makeOffers方法，最后执行launchTask：

private def makeOffers() {
  // Filter out executors under killing
  val activeExecutors = executorDataMap.filterKeys(executorIsAlive)
  val workOffers = activeExecutors.map { case (id, executorData) =>
    new WorkerOffer(id, executorData.executorHost, executorData.freeCores)
  }.toSeq
  launchTasks(scheduler.resourceOffers(workOffers))
}

这里需要补充一点：我们之前已经分析过DAGScheduler是从数据的层面，也就是RDD的层面考虑的preferredLocation，即DAGScheduler部分已经确定了Task要被发到哪个Executor上运行；而TaskScheduler是从具体计算Task的角度考虑计算的本地性，也就是说具体的计算是发生在内存中还是发生在本地磁盘等等(PROCESS_LOCAL、NODE_LOCAL...)，由此也印证了DAGScheduler负责高层的调度任务，而TaskScheduler负责底层的调度任务。

接下来我们就进入launchTasks方法：

private def launchTasks(tasks: Seq[Seq[TaskDescription]]) {
  for (task <- tasks.flatten) {
    // 注意此处是对TaskDescription进行序列化操作
    val serializedTask = ser.serialize(task)
    // 判断序列化的大小是否超过限制
    if (serializedTask.limit >= akkaFrameSize - AkkaUtils.reservedSizeBytes) {
      scheduler.taskIdToTaskSetManager.get(task.taskId).foreach { taskSetMgr =>
        try {
          var msg = "Serialized task %s:%d was %d bytes, which exceeds max allowed: " +
            "spark.akka.frameSize (%d bytes) - reserved (%d bytes). Consider increasing " +
            "spark.akka.frameSize or using broadcast variables for large values."
          msg = msg.format(task.taskId, task.index, serializedTask.limit, akkaFrameSize,
            AkkaUtils.reservedSizeBytes)
          taskSetMgr.abort(msg)
        } catch {
          case e: Exception => logError("Exception in error callback", e)
        }
      }
    }
    else {
      // 获得该TaskDescription要发送到的那个Executor的信息
      val executorData = executorDataMap(task.executorId)
      // 减掉使用的计算资源，即cores
      executorData.freeCores -= scheduler.CPUS_PER_TASK
      // 将Task发送到具体的Executor上
      executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))
    }
  }
}

这里akkaFrameSize的默认大小是128M(Spark 1.6.3版本，可以通过spark.akka.frameSize进行配置)，而AkkaUtils.reservedSizeBytes的大小是200k，也就是说序列化后的大小不能超过128MB-200k，最后如果小于该限制就会将task发送到具体的ExecutorBackend上。

ExecutorBackend(Standalone模式下就是CoarseGrainedExecutorBackend)接收到该消息后就会调用具体的executor的launchTask方法去执行task：

case LaunchTask(data) =>
  if (executor == null) {
    logError("Received LaunchTask command but executor was null")
    System.exit(1)
  } else {
    val taskDesc = ser.deserialize[TaskDescription](data.value)
    logInfo("Got assigned task " + taskDesc.taskId)
    executor.launchTask(this, taskId = taskDesc.taskId, attemptNumber = taskDesc.attemptNumber,
      taskDesc.name, taskDesc.serializedTask)
  }

由此可见真正发送tasks的不是TaskSetManager，而是CoarseGrainedSchedulerBackend，TaskSetManager只是负责监控task的运行。

至此Tasks的提交过程执行完成，下一篇文章我们将继续分析Tasks的运行过程。

本文参照的是Spark 1.6.3版本的源码，同时给出Spark 2.1.0版本的连接：

Spark 1.6.3 源码

Spark 2.1.0 源码

本文为原创，欢迎转载，转载请注明出处、作者，谢谢！

最后编辑于：2017.12.06 03:21:45

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 157,924评论 4赞 360
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 66,902评论 1赞 290
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 107,716评论 0赞 239
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,783评论 0赞 203
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,166评论 3赞 286
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,510评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,784评论 2赞 311
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,476评论 0赞 196
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,196评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,459评论 2赞 243
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 31,978评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,321评论 2赞 252
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 32,964评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,046评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,803评论 0赞 193
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,530评论 2赞 271
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,420评论 2赞 265

Spark-Core源码精读(12)、Task的提交流程分析

对可用的资源进行标记并记录hostname，同时判断是否有新的executor加入

将所有可用的计算资源随机打散

根据每个ExecutorBackend的cores的个数声明类型为TaskDescription的ArrayBuffer数组

获得根据具体的调度策略排序后的TaskSets

如果有新的Executor加入，此时会调用TaskSet的executorAdded方法来获取最新的完整的可用计算资源

确定最高优先级本地性

推荐阅读更多精彩内容