Spark Core源码精读计划#16:通过ExecutorAllocationManager实现动态Executor分配

目录

前言

按照SparkContext初始化的顺序,接下来就轮到调度系统的三大金刚——SchedulerBackend、TaskScheduler、DAGScheduler——出场了。与它们相关的细节非常多,绝不是一两篇文章能够讲清楚的,所以我们之后讲到Spark作业执行时,再自然地回过头详细看它们。本篇来讲解SparkContext初始化的倒数第二个组件:Executor分配管理器,即ExecutorAllocationManager。前面已经讲过,ExecutorAllocationManager可以通过与集群管理器联系,根据当前的负载动态增加或删除Executor,是一个比较智能的机制。

初始化

SparkContext中初始化ExecutorAllocationManager的细节在代码#2.13中有,因此这里只讲解其初始化的具体流程:

  1. 判断是否要启用Executor动态分配。如果配置项spark.dynamicAllocation.enabled为true,并且满足以下两条件之一:配置项spark.dynamicAllocation.testing为true,或者当前不是本地模式,就启用Executor动态分配。(为节省篇幅,后面会用s.d指代spark.dynamicAllocation前缀)
  2. 判断SchedulerBackend的实现类是否继承了ExecutorAllocationClient特征,目前只有CoarseGrainedSchedulerBackend是如此。如果是,就用SchedulerBackend、ListenerBus、SparkConf和BlockManagerMaster的实例构造出ExecutorAllocationManager。
  3. 调用ExecutorAllocationManager.start()方法启动之。

ExecutorAllocationManager类的成员属性

ExecutorAllocationManager类的成员属性甚多,以下清单基本示出了全部的。

代码#16.1 - o.a.s.ExecutorAllocationManager类的成员属性

  private val minNumExecutors = conf.get(DYN_ALLOCATION_MIN_EXECUTORS)
  private val maxNumExecutors = conf.get(DYN_ALLOCATION_MAX_EXECUTORS)
  private val initialNumExecutors = Utils.getDynamicAllocationInitialExecutors(conf)

  private val schedulerBacklogTimeoutS = conf.getTimeAsSeconds(
    "spark.dynamicAllocation.schedulerBacklogTimeout", "1s")
  private val sustainedSchedulerBacklogTimeoutS = conf.getTimeAsSeconds(
    "spark.dynamicAllocation.sustainedSchedulerBacklogTimeout", s"${schedulerBacklogTimeoutS}s")
  private val executorIdleTimeoutS = conf.getTimeAsSeconds(
    "spark.dynamicAllocation.executorIdleTimeout", "60s")
  private val cachedExecutorIdleTimeoutS = conf.getTimeAsSeconds(
    "spark.dynamicAllocation.cachedExecutorIdleTimeout", s"${Integer.MAX_VALUE}s")

  private val tasksPerExecutor =
    conf.getInt("spark.executor.cores", 1) / conf.getInt("spark.task.cpus", 1)

  private var numExecutorsToAdd = 1
  private var numExecutorsTarget = initialNumExecutors

  private val executorsPendingToRemove = new mutable.HashSet[String]
  private val executorIds = new mutable.HashSet[String]

  private var addTime: Long = NOT_SET
  private val removeTimes = new mutable.HashMap[String, Long]

  val listener = new ExecutorAllocationListener
  private val executor =
    ThreadUtils.newDaemonSingleThreadScheduledExecutor("spark-dynamic-executor-allocation")
  private var localityAwareTasks = 0
  private var hostToLocalTaskCount: Map[String, Int] = Map.empty

下面我们分类来描述这些成员属性的含义。

Executor相关配置

  • minNumExecutors/maxNumExecutors:分别对应配置项s.d.minExecutors/maxExecutors,代表动态分配过程中最小和最大的Executor数量,默认值为0和无穷大。
  • initialNumExecutors:Executor的初始数量,用Utils.getDynamicAllocationInitialExecutors()方法来确定,其值是s.d.minExecutors、s.d.initialExecutors、spark.executor.instances三个参数的较大值。
  • tasksPerExecutor:每个Executor执行的Task数的近似值,由spark.executor.cores与spark.task.cpus两个参数共同决定。

时长配置

  • schedulerBacklogTimeoutS:由配置项s.d.schedulerBacklogTimeout指定,表示当有Task等待超过该时长时,就开始动态分配资源,默认1s。
  • sustainedSchedulerBacklogTimeoutS:由配置项s.d.sustainedSchedulerBacklogTimeout指定,表示动态分配资源仍未达标时,每次再分配的时间间隔,默认与schedulerBacklogTimeoutS相同。
  • executorIdleTimeoutS:由配置项s.d.executorIdleTimeout指定,表示Executor处于空闲状态(没有执行Task)的超时,超时后会移除Executor,默认值为60s。
  • cachedExecutorIdleTimeoutS:由配置项s.d.cachedExecutorIdleTimeout指定,表示持有缓存块的Executor的空闲超时。由于缓存不能随意被清理,因此其默认值为无穷大。

计数器、缓存与其他

  • numExecutorsToAdd:下次动态分配要添加的Executor数量。
  • numExecutorsTarget:在当前时刻的Executor目标数量。这个计数主要是为了在Executor突然大量丢失的异常情况下,能够快速申请到需要的数目。
  • executorsPendingToRemove:即将被移除但还没被杀掉的Executor ID缓存。
  • executorIds:所有目前已知的Executor ID缓存。
  • addTime:本次触发Executor添加的时间戳。
  • removeTimes:Executor将要被删除时的ID与时间戳的映射。
  • listener:ExecutorAllocationListener类型的监听器,用于监听与Executor相关的事件,包括Stage和Task提交与完成,Executor添加与删除等等。
  • executor:单线程的调度线程池,用来执行周期性检查并动态分配Executor的任务。
  • localityAwareTasks:所有当前活跃的Stage中,具有本地性偏好(就是数据尽量位于本地节点)的Task数量,
  • hostToLocalTaskCount:每个物理节点上运行的Task数目的近似值。

ExecutorAllocationManager具体实现

启动

在初始化过程中,已经调用了ExecutorAllocationManager.start()方法,下面来看该方法的代码。

代码#16.2 - o.a.s.ExecutorAllocationManager.start()方法

  def start(): Unit = {
    listenerBus.addToManagementQueue(listener)
    val scheduleTask = new Runnable() {
      override def run(): Unit = {
        try {
          schedule()
        } catch {
          case ct: ControlThrowable =>
            throw ct
          case t: Throwable =>
            logWarning(s"Uncaught exception in thread ${Thread.currentThread().getName}", t)
        }
      }
    }
    executor.scheduleWithFixedDelay(scheduleTask, 0, intervalMillis, TimeUnit.MILLISECONDS)
    client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, hostToLocalTaskCount)
  }

可见,ExecutorAllocationManager启动时,会先将ExecutorAllocationListener注册到LiveListenerBus中。然后会创建执行schedule()方法的任务,并用调度线程池executor以默认100ms的间隔定期执行。最后,调用ExecutorAllocationClient(其实就是CoarseGrainedSchedulerBackend)的requestTotalExecutors()方法,请求分配Executor,该方法在今后讲解SchedulerBackend时会提到。

下面先来看看schedule()方法。

调度动态调整逻辑

代码#16.3 - o.a.s.ExecutorAllocationManager.schedule()方法

  private def schedule(): Unit = synchronized {
    val now = clock.getTimeMillis
    updateAndSyncNumExecutorsTarget(now)

    val executorIdsToBeRemoved = ArrayBuffer[String]()
    removeTimes.retain { case (executorId, expireTime) =>
      val expired = now >= expireTime
      if (expired) {
        initializing = false
        executorIdsToBeRemoved += executorId
      }
      !expired
    }
    if (executorIdsToBeRemoved.nonEmpty) {
      removeExecutors(executorIdsToBeRemoved)
    }
  }

可见,schedule()方法做了两件事:调用updateAndSyncNumExecutorsTarget()方法重新计算并同步当前所需的Executor数量,调用removeExecutors()方法删掉那些已经判定为过期的Executor。

重新计算Executor数量

以下就是updateAndSyncNumExecutorsTarget()方法以及其调用的maxNumExecutorsNeeded()方法的源码。

代码#16.4 - o.a.s.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget()方法

  private def updateAndSyncNumExecutorsTarget(now: Long): Int = synchronized {
    val maxNeeded = maxNumExecutorsNeeded

    if (initializing) {
      0
    } else if (maxNeeded < numExecutorsTarget) {
      val oldNumExecutorsTarget = numExecutorsTarget
      numExecutorsTarget = math.max(maxNeeded, minNumExecutors)
      numExecutorsToAdd = 1

      if (numExecutorsTarget < oldNumExecutorsTarget) {
        client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, hostToLocalTaskCount)
        logDebug(s"Lowering target number of executors to $numExecutorsTarget (previously " +
          s"$oldNumExecutorsTarget) because not all requested executors are actually needed")
      }
      numExecutorsTarget - oldNumExecutorsTarget
    } else if (addTime != NOT_SET && now >= addTime) {
      val delta = addExecutors(maxNeeded)
      logDebug(s"Starting timer to add more executors (to " +
        s"expire in $sustainedSchedulerBacklogTimeoutS seconds)")
      addTime = now + (sustainedSchedulerBacklogTimeoutS * 1000)
      delta
    } else {
      0
    }
  }

  private def maxNumExecutorsNeeded(): Int = {
    val numRunningOrPendingTasks = listener.totalPendingTasks + listener.totalRunningTasks
    (numRunningOrPendingTasks + tasksPerExecutor - 1) / tasksPerExecutor
  }

该方法的流程如下:

  1. 调用maxNumExecutorsNeeded()计算出当前所需的最大Executor数量maxNeeded。其计算方法是:从监听器取得等待中的Task计数与运行中的Task计数,将两者相加并减1,最后除以每个Executor上运行Task数的估计值。
  2. 如果ExecutorAllocationManager仍然在初始化,就直接返回0。注意该方法的返回值是Executor数量的变化量,而不是总数。
  3. 检查maxNeeded与上述numExecutorsTarget值的大小关系。如果目标Executor数量超过了最大需求数,就将numExecutorsTarget设置为maxNeeded与minNumExecutors的较大值,然后调用ExecutorAllocationClient.requestTotalExecutors()方法。此时会通知集群管理器取消未执行的Executor,并且不再添加新的Executor,返回减少的Executor数量。
  4. 如果目标Executor数量小于最大需求数,并且当前的时间戳比上一次添加Executor的时间戳要新,就调用addExecutors()方法。此时会通知集群管理器新添加Executor,更新addTime记录的时间戳,返回增加的Executor数量。

下面我们就分别来看看减少与增加Executor的逻辑。先来看减少Executor的方法。

减少Executor

下面是Executor空闲时的回调方法,该方法由监听器调用。受限于篇幅,本文不展开讲ExecutorAllocationListener的细节,并且不是很难,看官可以自行参考。

代码#16.5 - o.a.s.ExecutorAllocationManager.onExecutorIdle()方法

  private def onExecutorIdle(executorId: String): Unit = synchronized {
    if (executorIds.contains(executorId)) {
      if (!removeTimes.contains(executorId) && !executorsPendingToRemove.contains(executorId)) {
        val hasCachedBlocks = blockManagerMaster.hasCachedBlocks(executorId)
        val now = clock.getTimeMillis()
        val timeout = {
          if (hasCachedBlocks) {
            now + cachedExecutorIdleTimeoutS * 1000
          } else {
            now + executorIdleTimeoutS * 1000
          }
        }
        val realTimeout = if (timeout <= 0) Long.MaxValue else timeout // overflow
        removeTimes(executorId) = realTimeout
        logDebug(s"Starting idle timer for $executorId because there are no more tasks " +
          s"scheduled to run on the executor (to expire in ${(realTimeout - now)/1000} seconds)")
      }
    } else {
      logWarning(s"Attempted to mark unknown executor $executorId idle")
    }
  }

该方法首先确定removeTimes和executorsPendingToRemove缓存中都不存在当前的Executor ID,然后判断该Executor是否缓存了块。如果有缓存块,就将其超时时间设为无限大,否则就按正常的空闲超时来处理。最后将这个Executor的ID与其计划被删除的时间戳存入removeTimes映射。接下来就是代码#16.3中调用的removeExecutors()方法。

代码#16.6 - o.a.s.ExecutorAllocationManager.removeExecutors()方法

  private def removeExecutors(executors: Seq[String]): Seq[String] = synchronized {
    val executorIdsToBeRemoved = new ArrayBuffer[String]

    logInfo("Request to remove executorIds: " + executors.mkString(", "))
    val numExistingExecutors = allocationManager.executorIds.size - executorsPendingToRemove.size

    var newExecutorTotal = numExistingExecutors
    executors.foreach { executorIdToBeRemoved =>
      if (newExecutorTotal - 1 < minNumExecutors) {
        logDebug(s"Not removing idle executor $executorIdToBeRemoved because there are only " +
          s"$newExecutorTotal executor(s) left (minimum number of executor limit $minNumExecutors)")
      } else if (newExecutorTotal - 1 < numExecutorsTarget) {
        logDebug(s"Not removing idle executor $executorIdToBeRemoved because there are only " +
          s"$newExecutorTotal executor(s) left (number of executor target $numExecutorsTarget)")
      } else if (canBeKilled(executorIdToBeRemoved)) {
        executorIdsToBeRemoved += executorIdToBeRemoved
        newExecutorTotal -= 1
      }
    }
    if (executorIdsToBeRemoved.isEmpty) {
      return Seq.empty[String]
    }

    val executorsRemoved = if (testing) {
      executorIdsToBeRemoved
    } else {
      client.killExecutors(executorIdsToBeRemoved, adjustTargetNumExecutors = false,
        countFailures = false, force = false)
    }

    client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, hostToLocalTaskCount)
    newExecutorTotal = numExistingExecutors
    if (testing || executorsRemoved.nonEmpty) {
      executorsRemoved.foreach { removedExecutorId =>
        newExecutorTotal -= 1
        logInfo(s"Removing executor $removedExecutorId because it has been idle for " +
          s"$executorIdleTimeoutS seconds (new desired total will be $newExecutorTotal)")
        executorsPendingToRemove.add(removedExecutorId)
      }
      executorsRemoved
    } else {
      logWarning(s"Unable to reach the cluster manager to kill executor/s " +
        s"${executorIdsToBeRemoved.mkString(",")} or no executor eligible to kill!")
      Seq.empty[String]
    }
  }

该方法的执行流程如下:

  1. 计算剩余的Executor数目。
  2. 遍历要删除的Executor ID列表,判断删除之后剩余的Executor数是否小于最小允许的Executor数量与目标Executor数量,如果是的话,该Executor就不能删除。反之,如果根据canBeKilled()方法判断出executorIds缓存中存在该Executor,并且尚未进入executorsPendingToRemove,就将其标记为可删除。
  3. 调用ExecutorAllocationClient.killExecutor()方法,真正地杀掉Executor。再调用requestTotalExecutors()方法,重新申请新的Executor数目。
  4. 如果要删除的Executor列表中有最终未被杀掉的,就将它们再次加入executorsPendingToRemove缓存中,等待删除。

最后,监听器会调用Executor减少后的回调方法onExecutorRemoved(),该方法主要是清理各个缓存,逻辑很简单,不再赘述。

增加Executor

最后来看代码#16.4中调用的addExecutors()方法。

代码#16.7 - o.a.s.ExecutorAllocationManager.addExecutors()方法

  private def addExecutors(maxNumExecutorsNeeded: Int): Int = {
    if (numExecutorsTarget >= maxNumExecutors) {
      logDebug(s"Not adding executors because our current target total " +
        s"is already $numExecutorsTarget (limit $maxNumExecutors)")
      numExecutorsToAdd = 1
      return 0
    }
    val oldNumExecutorsTarget = numExecutorsTarget
    numExecutorsTarget = math.max(numExecutorsTarget, executorIds.size)
    numExecutorsTarget += numExecutorsToAdd
    numExecutorsTarget = math.min(numExecutorsTarget, maxNumExecutorsNeeded)
    numExecutorsTarget = math.max(math.min(numExecutorsTarget, maxNumExecutors), minNumExecutors)

    val delta = numExecutorsTarget - oldNumExecutorsTarget
    if (delta == 0) {
      if (listener.pendingTasks == 0 && listener.pendingSpeculativeTasks > 0) {
        numExecutorsTarget =
          math.max(math.min(maxNumExecutorsNeeded + 1, maxNumExecutors), minNumExecutors)
      } else {
        numExecutorsToAdd = 1
        return 0
      }
    }

    val addRequestAcknowledged = try {
      testing ||
        client.requestTotalExecutors(numExecutorsTarget, localityAwareTasks, hostToLocalTaskCount)
    } catch {
      case NonFatal(e) =>
        logInfo("Error reaching cluster manager.", e)
        false
    }
    if (addRequestAcknowledged) {
      val executorsString = "executor" + { if (delta > 1) "s" else "" }
      logInfo(s"Requesting $delta new $executorsString because tasks are backlogged" +
        s" (new desired total will be $numExecutorsTarget)")
      numExecutorsToAdd = if (delta == numExecutorsToAdd) {
        numExecutorsToAdd * 2
      } else {
        1
      }
      delta
    } else {
      logWarning(
        s"Unable to reach the cluster manager to request $numExecutorsTarget total executors!")
      numExecutorsTarget = oldNumExecutorsTarget
      0
    }
  }

Executor目标值numExecutorsTarget的计算逻辑用语言描述很麻烦,因此不再多讲,看官读一遍代码就能明白个大概。唯一特别需要注意的是numExecutorsToAdd * 2这句话,它说明增加Executor时,每次申请的新Executor数目是指数级别增长的。为什么要采用这种策略?根据经验,多数App在启动时只需要少量的Executor就可以满足计算需求,但一旦资源紧张时,用指数增长可以使申请到满足需求的资源的次数降低。

总结

在提笔写这篇文章之前,我曾先入为主地认为ExecutorAllocationManager的逻辑没有如此复杂,现在看来是我过于天真了啊。

实在很累了,去休息一下,凌晨3点还要起床看欧冠决赛。

YNWA!

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 151,511评论 1 330
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 64,495评论 1 273
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 101,595评论 0 225
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 42,558评论 0 190
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 50,715评论 3 270
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 39,672评论 1 192
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 31,112评论 2 291
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 29,837评论 0 181
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 33,417评论 0 228
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 29,928评论 2 232
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 31,316评论 1 242
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 27,773评论 2 234
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 32,253评论 3 220
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 25,827评论 0 8
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 26,440评论 0 180
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 34,523评论 2 249
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 34,583评论 2 249

推荐阅读更多精彩内容