k8s-调度算法

k8s-调度算法

  1. 预选算法,过滤nodes
  2. 优选算法,对nodes打分

ps. scheduler_algorithm

预选

方法签名

func Predicates(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {}

总共20个预选过程

ps. 不拙劣地翻译了,直接看代码注解吧。代码出处:kubernetes-master\pkg\scheduler\algorithm\predicates\predicates.go

volume

  • NoDiskConflict(重要): evaluates if a pod can fit due to the volumes it requests, and those that are already mounted. If there is already a volume mounted on that node, another pod that uses the same volume can't be scheduled there.
  • NewMaxPDVolumeCountPredicate(重要): creates a predicate which evaluates whether a pod can fit based on the number of volumes which match a filter that it requests, and those that are already present.
    The predicate looks for both volumes used directly, as well as PVC volumes that are backed by relevant volume types, counts the number of unique volumes, and rejects the new pod if it would place the total count over the maximum.
  • NewVolumeZonePredicate(重要): evaluates if a pod can fit due to the volumes it requests, given that some volumes may have zone scheduling constraints. The requirement is that any volume zone-labels must match the equivalent zone-labels on the node. It is OK for the node to have more zone-label constraints (for example, a hypothetical replicated volume might allow region-wide access)
    Currently this is only supported with PersistentVolumeClaims, and looks to the labels only on the bound PersistentVolume.
    Working with volumes declared inline in the pod specification (i.e. not using a PersistentVolume) is likely to be harder, as it would require determining the zone of a volume during scheduling, and that is likely to require calling out to the cloud provider. It seems that we are moving away from inline volume declarations anyway.
  • NewVolumeBindingPredicate: evaluates if a pod can fit due to the volumes it requests, for both bound and unbound PVCs.
    For PVCs that are bound, then it checks that the corresponding PV's node affinity is satisfied by the given node.
    For PVCs that are unbound, it tries to find available PVs that can satisfy the PVC requirements and that the PV node affinity is satisfied by the given node.
    The predicate returns true if all bound PVCs have compatible PVs with the node, and if all unbound
    PVCs can be matched with an available and node-compatible PV.

pod

  • PodFitsResources(重要): checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
  • PodMatchNodeSelector(重要): checks if a pod node selector matches the node label.
  • PodFitsHost(重要): checks if a pod spec node name matches the current node.
  • CheckNodeLabelPresence: checks whether all of the specified labels exists on a node or not, regardless of their value
    If "presence" is false, then returns false if any of the requested labels matches any of the node's labels, otherwise returns true.
    If "presence" is true, then returns false if any of the requested labels does not match any of the node's labels, otherwise returns true.
    Consider the cases where the nodes are placed in regions/zones/racks and these are identified by labels
    In some cases, it is required that only nodes that are part of ANY of the defined regions/zones/racks be selected
    Alternately, eliminating nodes that have a certain label, regardless of value, is also useful A node may have a label with "retiring" as key and the date as the value and it may be desirable to avoid scheduling new pods on this node
  • checkServiceAffinity: is a predicate which matches nodes in such a way to force that ServiceAffinity.labels are homogenous for pods that are scheduled to a node. (i.e. it returns true IFF this pod can be added to this node such that all other pods in the same service are running on nodes with the exact same ServiceAffinity.label values).
    For example:
    If the first pod of a service was scheduled to a node with label "region=foo",
    all the other subsequent pods belong to the same service will be schedule on
    nodes with the same "region=foo" label.
  • PodFitsHostPorts(重要): checks if a node has free ports for the requested pod ports.
  • GeneralPredicates: GeneralPredicates checks whether noncriticalPredicates and EssentialPredicates pass. noncriticalPredicates are the predicates that only non-critical pods need,noncriticalPredicates就是PodFitsResources
  • EssentialPredicates : are the predicates that all pods, including critical pods, need, 包括PodFitsHostPodFitsHostPortsPodMatchNodeSelector
  • InterPodAffinityMatches: checks if a pod can be scheduled on the specified node with pod affinity/anti-affinity configuration.

node

ps. 通过kubectl describe no {node-name}查看node状态:

  • CheckNodeUnschedulablePredicate: checks if a pod can be scheduled on a node with Unschedulable spec.检查node的unschedulable状态
  • PodToleratesNodeTaints: checks if a pod tolerations can tolerate the node taints,node taints污点机制
  • PodToleratesNodeNoExecuteTaints: checks if a pod tolerations can tolerate the node's NoExecute taints
  • CheckNodeMemoryPressurePredicate(重要): checks if a pod can be scheduled on a node reporting memory pressure condition.
  • CheckNodeDiskPressurePredicate(重要): checks if a pod can be scheduled on a node reporting disk pressure condition.
  • CheckNodePIDPressurePredicate: checks if a pod can be scheduled on a node reporting pid pressure condition.
  • CheckNodeConditionPredicate: checks if a pod can be scheduled on a node reporting out of disk, network unavailable and not ready condition. Only node conditions are accounted in this predicate.

优选

ps. 代码出处:kubernetes-master\pkg\scheduler\algorithm\priorities

ResourceAllocationPriority

// ResourceAllocationPriority contains information to calculate resource allocation priority.
type ResourceAllocationPriority struct {
    Name   string
    scorer func(requested, allocable *schedulercache.Resource, includeVolumes bool, requestedVolumes int, allocatableVolumes int) int64
}

// PriorityMap priorities nodes according to the resource allocations on the node.
// It will use `scorer` function to calculate the score.
func (r *ResourceAllocationPriority) PriorityMap(
    pod *v1.Pod,
    meta interface{},
    nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) 
  • balancedResourceScorer(重要): favors nodes with balanced resource usage rate.
    should NOT be used alone, and MUST be used together ith LeastRequestedPriority. It calculates the difference between the cpu and memory fraction f capacity, and prioritizes the host based on how close the two metrics are to each other.
    计算公式:10 - variance(cpuFraction,memoryFraction,volumeFraction)*10
    选择各个资源使用最均衡的node
  • leastResourceScorer(重要): favors nodes with fewer requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and rioritizes based on the minimum of the average of the fraction of requested to capacity.
    计算公式:(cpu((capacity-sum(requested))10/capacity) + memory((capacity-sum(requested))10/capacity))/2
    选择最空闲的node
  • mostResourceScorer: favors nodes with most requested resources. It calculates the percentage of memory and CPU requested by pods scheduled on the node, and prioritizes based on the maximum of the average of the fraction of requested to capacity.
    计算公式: (cpu(10 * sum(requested) / capacity) + memory(10 * sum(requested) / capacity)) / 2
    尽量用尽一个node的资源
  • requested_to_capacity_ratio: assigns 1.0 to resource when all capacity is available and 0.0 when requested amount is equal to capacity.

image_locality(重要)

favors nodes that already have requested pod container's images.
It will detect whether the requested images are present on a node, and then calculate a score ranging from 0 to 10
based on the total size of those images.

  • If none of the images are present, this node will be given the lowest priority.
  • If some of the images are present on a node, the larger their sizes' sum, the higher the node's priority.

interpod_affinity(重要)

compute a sum by iterating through the elements of weightedPodAffinityTerm and adding "weight" to the sum if the corresponding PodAffinityTerm is satisfied for that node; the node(s) with the highest sum are the most preferred.
Symmetry need to be considered for preferredDuringSchedulingIgnoredDuringExecution from podAffinity & podAntiAffinity,
symmetry need to be considered for hard requirements from podAffinity

node_affinity(重要)

scheduling preferences indicated in PreferredDuringSchedulingIgnoredDuringExecution. Each time a node match a preferredSchedulingTerm, it will a get an add of preferredSchedulingTerm.Weight. Thus, the more preferredSchedulingTerms the node satisfies and the more the preferredSchedulingTerm that is satisfied weights, the higher score the node gets.

node_label

checks whether a particular label exists on a node or not, regardless of its value.
If presence is true, prioritizes nodes that have the specified label, regardless of value.
If presence is false, prioritizes nodes that do not have the specified label.

node_prefer_avoid_pods

priorities nodes according to the node annotation "scheduler.alpha.kubernetes.io/preferAvoidPods".

selector_spreading(重要)

  • SelectorSpreadPriority: spreads pods across hosts, considering pods belonging to the same service,RC,RS or StatefulSet.
    When a pod is scheduled, it looks for services, RCs,RSs and StatefulSets that match the pod, then finds existing pods that match those selectors.
    It favors nodes that have fewer existing matching pods.
    i.e. it pushes the scheduler towards a node where there's the smallest number of pods which match the same service, RC,RSs or StatefulSets selectors as the pod being scheduled.
  • ServiceAntiAffinityPriority: spreads pods by minimizing the number of pods belonging to the same service on given machine

taint_toleration

prepares the priority list for all the nodes based on the number of intolerable taints on the node. 详见taint-and-toleration

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 78,137评论 1 170
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 26,375评论 1 143
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 29,583评论 0 102
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 16,355评论 0 87
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 21,613评论 0 144
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 17,764评论 0 87
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 10,551评论 2 161
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 9,972评论 0 78
  • 想象着我的养父在大火中拼命挣扎,窒息,最后皮肤化为焦炭。我心中就已经是抑制不住地欢快,这就叫做以其人之道,还治其人...
    爱写小说的胖达阅读 8,500评论 5 111
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 11,772评论 0 129
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 10,516评论 1 124
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 11,318评论 0 127
  • 白月光回国,霸总把我这个替身辞退。还一脸阴沉的警告我。[不要出现在思思面前, 不然我有一百种方法让你生不如死。]我...
    爱写小说的胖达阅读 6,173评论 0 17
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 8,970评论 2 114
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 12,061评论 3 123
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 7,807评论 0 3
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 8,027评论 0 76
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 12,507评论 2 133
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 13,157评论 2 130

推荐阅读更多精彩内容