[k8s源码分析][kubelet] devicemanager 之 kubelet申请资源

1. 前言

转载请说明原文出处, 尊重他人劳动成果!

源码位置: https://github.com/nicktming/kubernetes/tree/tming-v1.13/pkg/kubelet/cm/devicemanager
分支: tming-v1.13 (基于v1.13版本)

k8s-device-plugin
分支: tming-v1.11(基于v1.11版本)

device manager and device plugin
1. [k8s源码分析][kubelet] devicemanager 之 pod_devices 和 checkpoint
2. [k8s源码分析][kubelet] devicemanager 之 使用device-plugin(模拟gpu)
3. [k8s源码分析][kubelet] devicemanager 之 device-plugin向kubelet注册
4. [k8s源码分析][kubelet] devicemanager 之 kubelet申请资源
5. [k8s源码分析][kubelet] devicemanager 之 重启kubelet和device-plugin

上文 [k8s源码分析][kubelet] devicemanager 之 device-plugin向kubelet注册 已经分析了device plugindevice manager注册的过程, 本文将分析kubelet申请资源的过程, 也就是向device manager申请资源.

Manager

// Manager manages all the Device Plugins running on a node.
type Manager interface {
    // Start starts device plugin registration service.
    Start(activePods ActivePodsFunc, sourcesReady config.SourcesReady) error
    // Allocate configures and assigns devices to pods. The pods are provided
    // through the pod admission attributes in the attrs argument. From the
    // requested device resources, Allocate will communicate with the owning
    // device plugin to allow setup procedures to take place, and for the
    // device plugin to provide runtime settings to use the device (environment
    // variables, mount points and device files). The node object is provided
    // for the device manager to update the node capacity to reflect the
    // currently available devices.
    Allocate(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error
    // Stop stops the manager.
    Stop() error
    // GetDeviceRunContainerOptions checks whether we have cached containerDevices
    // for the passed-in <pod, container> and returns its DeviceRunContainerOptions
    // for the found one. An empty struct is returned in case no cached state is found.
    GetDeviceRunContainerOptions(pod *v1.Pod, container *v1.Container) (*DeviceRunContainerOptions, error)
    // GetCapacity returns the amount of available device plugin resource capacity, resource allocatable
    // and inactive device plugin resources previously registered on the node.
    GetCapacity() (v1.ResourceList, v1.ResourceList, []string)
    GetWatcherHandler() watcher.PluginHandler
    // GetDevices returns information about the devices assigned to pods and containers
    GetDevices(podUID, containerName string) []*podresourcesapi.ContainerDevices
}

// DeviceRunContainerOptions contains the combined container runtime settings to consume its allocated devices.
type DeviceRunContainerOptions struct {
    // The environment variables list.
    Envs []kubecontainer.EnvVar
    // The mounts for the container.
    Mounts []kubecontainer.Mount
    // The host devices mapped into the container.
    Devices []kubecontainer.DeviceInfo
    // The Annotations for the container
    Annotations []kubecontainer.Annotation
}

Manager接口的定义, 也就是一个device manager必须要实现这些方法.

ManagerImpl

type monitorCallback func(resourceName string, devices []pluginapi.Device)
type ManagerImpl struct {
    // 地址 /var/lib/kubelet/device-plugins/kubelet.sock
    socketname string
    socketdir  string
    // resouceName与对应的endpoint
    endpoints map[string]endpointInfo // Key is ResourceName
    mutex     sync.Mutex
    // grpc
    server *grpc.Server
    wg     sync.WaitGroup
    // 该方法用于得到节点中active pods, 可以用于更新节点中的资源信息
    // 因为有些占有资源的pod已经运行完了, 就需要回收该pod的资源更新到device manager中
    activePods ActivePodsFunc
    sourcesReady config.SourcesReady
    // 回调函数
    callback monitorCallback
    // resouceName以及它所有healthy的设备
    healthyDevices map[string]sets.String
    // resouceName以及它所有unhealthy的设备
    unhealthyDevices map[string]sets.String
    // resouceName以及它已经分配出去的设备
    allocatedDevices map[string]sets.String
    // podDeivces保存着pod与该pod拥有的资源设备信息
    podDevices        podDevices
    // 持久化
    checkpointManager checkpointmanager.CheckpointManager
}
type endpointInfo struct {
    e    endpoint
    opts *pluginapi.DevicePluginOptions
}
func NewManagerImpl() (*ManagerImpl, error) {
    // pluginapi.KubeletSocket=/var/lib/kubelet/device-plugins/kubelet.sock
    return newManagerImpl(pluginapi.KubeletSocket)
}
func newManagerImpl(socketPath string) (*ManagerImpl, error) {
    klog.V(2).Infof("Creating Device Plugin manager at %s", socketPath)
    if socketPath == "" || !filepath.IsAbs(socketPath) {
        return nil, fmt.Errorf(errBadSocket+" %s", socketPath)
    }
    dir, file := filepath.Split(socketPath)
    manager := &ManagerImpl{
        endpoints: make(map[string]endpointInfo),
        socketname:       file,
        socketdir:        dir,
        healthyDevices:   make(map[string]sets.String),
        unhealthyDevices: make(map[string]sets.String),
        allocatedDevices: make(map[string]sets.String),
        podDevices:       make(podDevices),
    }
    manager.callback = manager.genericDeviceUpdateCallback
    // The following structs are populated with real implementations in manager.Start()
    // Before that, initializes them to perform no-op operations.
    // 在调用start方法的时候会传入新的activePods和sourcesReady
    manager.activePods = func() []*v1.Pod { return []*v1.Pod{} }
    manager.sourcesReady = &sourcesReadyStub{}
    checkpointManager, err := checkpointmanager.NewCheckpointManager(dir)
    if err != nil {
        return nil, fmt.Errorf("failed to initialize checkpoint manager: %v", err)
    }
    manager.checkpointManager = checkpointManager
    return manager, nil
}

ManagerImplManager接口的一个实现类. 有几个属性需要注意一下:
healthyDevices: resouceName以及它所有healthy的设备.
unhealthyDevices: resouceName以及它所有unhealthy的设备.
allocatedDevices: resouceName以及它已经分配出去的设备.
podDevices: 保存着pod与该pod拥有的资源设备信息. 参考 [k8s源码分析][kubelet] devicemanager 之 pod_devices 和 checkpoint
activePods: 该方法用于得到节点中active pods, 可以用于更新节点中的资源信息. 因为有些占有资源的pod已经运行完了, 就需要回收该pod的资源更新到device manager中.
callback: 回调函数, 在 [k8s源码分析][kubelet] devicemanager 之 device-plugin向kubelet注册 已经分析过了, 用于更新healthyDevicesunhealthyDevices.
endpoints: 一个map结构, 保存着resourceName与对应的endpoint信息. 在 [k8s源码分析][kubelet] devicemanager 之 device-plugin向kubelet注册 已经分析过了.

另外可以看到在初始化方法NewManagerImpl中默认使用的地址为/var/lib/kubelet/device-plugins/kubelet.sock.

[k8s源码分析][kubelet] devicemanager 之 device-plugin向kubelet注册 中有提到过Start方法会传入新的activePodssourcesReady, 这个是kubelet在启动device manager中会传进来告诉device manager如何获取到节点的active pods.

func (m *ManagerImpl) Start(activePods ActivePodsFunc, sourcesReady config.SourcesReady) error {
...
m.activePods = activePods
m.sourcesReady = sourcesReady
...
}

Allocate

func (m *ManagerImpl) Allocate(node *schedulercache.NodeInfo, attrs *lifecycle.PodAdmitAttributes) error {
    // 要申请资源的pod
    pod := attrs.Pod
    // 尝试为该pod分配资源
    err := m.allocatePodResources(pod)
    ...
    // 再次确认分配是否成功
    if _, podRequireDevicePluginResource := m.podDevices[string(pod.UID)]; !podRequireDevicePluginResource {
        return nil
    }
    // 分配成功 调整节点信息
    m.sanitizeNodeAllocatable(node)
    return nil
}

1. 调用allocatePodResources尝试为该pod分配资源, 有错误则返回.
2.podDevices中确认是否有分配信息, 没有则返回.
3. 分配成功, 调用sanitizeNodeAllocatable调整节点信息.

allocatePodResources
func (m *ManagerImpl) allocatePodResources(pod *v1.Pod) error {
    devicesToReuse := make(map[string]sets.String)
    for _, container := range pod.Spec.InitContainers {
        if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
            return err
        }
        m.podDevices.addContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
    }
    for _, container := range pod.Spec.Containers {
        if err := m.allocateContainerResources(pod, &container, devicesToReuse); err != nil {
            return err
        }
        m.podDevices.removeContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
    }
    return nil
}

为该pod中的所有容器分配资源.
1. 对于pod.Spec.Containers, 分配给每个container的设备是不一样的.
2. 对于pod.Spec.InitContainers, 因为等到所有的InitContainers运行结束后才会开始运行pod.Spec.Containers, 也就意味着InitContainers中分配得到的设备在InitContainers运行结束后就没有用处了, 所以这些设备可以分配给pod.Spec.Containers继续使用, 不然就浪费了, 因为别的pod也不可能申请到这些设备. 另外关于devicesToReuse会在最后部分有详细说明.

allocateContainerResources
func (m *ManagerImpl) allocateContainerResources(pod *v1.Pod, container *v1.Container, devicesToReuse map[string]sets.String) error {
    podUID := string(pod.UID)
    contName := container.Name
    allocatedDevicesUpdated := false
    for k, v := range container.Resources.Limits {
        resource := string(k)
        needed := int(v.Value())
        klog.V(3).Infof("needs %d %s", needed, resource)
        if !m.isDevicePluginResource(resource) {
            continue
        }
        // 更新一次资源设备信息
        if !allocatedDevicesUpdated {
            m.updateAllocatedDevices(m.activePods())
            allocatedDevicesUpdated = true
        }
        // 得到分配给该容器的设备
        allocDevices, err := m.devicesToAllocate(podUID, contName, resource, needed, devicesToReuse[resource])
        ...
        // 获得与该resourceName对应的endpoint, endpoint可以与该注册此resourceName的device plugin发送请求
        eI, ok := m.endpoints[resource]
        ...
        devs := allocDevices.UnsortedList()
        // 向device plugin发送请求 根据这些设备得到相关信息
        // 比如nvidia device plugin 会返回NVIDIA_VISIBLE_DEVICES=UUID of devs(那些分配的gpu的uuid)
        resp, err := eI.e.allocate(devs)
        ...
        // 加入到podDeivce中
        m.podDevices.insert(podUID, contName, resource, allocDevices, resp.ContainerResponses[0])
        ...
    }
    // 持久化到kubelet_internal_checkpoint中
    return m.writeCheckpoint()
}

// k8s-device-plugin/server.go
func (m *NvidiaDevicePlugin) Allocate(ctx context.Context, reqs *pluginapi.AllocateRequest) (*pluginapi.AllocateResponse, error) {
    devs := m.devs
    name := fmt.Sprintf("NVIDIA_VISIBLE_DEVICES/%v", resourceName)
    responses := pluginapi.AllocateResponse{}
    for _, req := range reqs.ContainerRequests {
        response := pluginapi.ContainerAllocateResponse{
            Envs: map[string]string{
                name: strings.Join(req.DevicesIDs, ","),
            },
        }

        for _, id := range req.DevicesIDs {
            if !deviceExists(devs, id) {
                return nil, fmt.Errorf("invalid allocation request: unknown device: %s", id)
            }
        }

        responses.ContainerResponses = append(responses.ContainerResponses, &response)
    }

    return &responses, nil
}

不影响主逻辑的代码去掉了.
1. 有容器的时候会调用一次updateAllocatedDevices方法更新当前device manager中关于资源设备的信息, 因为有些占有资源的pod可能已经terminating了.
2. 针对每一个资源(resouce)会做以下操作
2.1 调用devicesToAllocate去获取分配给该容器此资源(resouce)的具体设备, 如果返回有错误, 直接返回了, 后面的资源也没有必要再做了, 因为只要有一个资源无法申请成功, 该容器就会失败的.
2.2 获得与该resource对应的endpoint, 该endpoint可以与注册此resourcedevice plugin发送请求.
2.3device plugin发送请求, 根据这些设备得到相关容器运行信息, 比如nvidia device plugin会返回NVIDIA_VISIBLE_DEVICES=UUID of devs(那些分配的gpu的uuid), 这样在真正启动容器的时候nvidia docker会把宿主机上对应的gpu投射到该容器中.
2.4 将该pod(podUID)的容器(contName)申请到此资源(resource)的这些设备(allocDevices)加入到podDeivce中.
3. 将当前device manager的信息持久化到kubelet_internal_checkpoint. 这里再加一句, 为什么不把所有容器分配完了之后再持久化呢?我的理解是该容器已经确确实实分配到这些设备了, 已经加入到m.allocatedDevices中了(后面devicesToAllocate方法中有体现), 此pod后面的容器申请成功与否跟此容器没有任何关系, 因为是独立的关系, 所以此时只要一个容器申请成功, 那就可以写入磁盘了.

updateAllocatedDevices
func (m *ManagerImpl) updateAllocatedDevices(activePods []*v1.Pod) {
    if !m.sourcesReady.AllReady() {
        return
    }
    m.mutex.Lock()
    defer m.mutex.Unlock()
    activePodUids := sets.NewString()
    for _, pod := range activePods {
        activePodUids.Insert(string(pod.UID))
    }
    // allocatedPodUids代表device manager目前保存了哪些pods
    // activePodUids代表节点上真正还在运行的pods
    allocatedPodUids := m.podDevices.pods()
    // podsToBeRemoved代表那些在该节点上占有资源的pods已经terminating了
    podsToBeRemoved := allocatedPodUids.Difference(activePodUids)
    if len(podsToBeRemoved) <= 0 {
        return
    }
    klog.V(3).Infof("pods to be removed: %v", podsToBeRemoved.List())
    m.podDevices.delete(podsToBeRemoved.List())
    // Regenerated allocatedDevices after we update pod allocation information.
    m.allocatedDevices = m.podDevices.devices()
}

该方法的作用就是更新device manger的两个属性podDevicesallocatedDevices.
allocatedPodUids: 代表device manager目前保存了哪些pods.
activePodUids: 代表节点上真正还在运行的pods.
allocatedPodUids - activePodUids就是podsToBeRemoved, 也就是代表那些在该节点上占有资源的pods已经terminating了, 可以释放资源了, 也就是更新allocatedDevices中的数据.
更新podDevices是以前的数据过时了.

devicesToAllocate
func (m *ManagerImpl) devicesToAllocate(podUID, contName, resource string, required int, reusableDevices sets.String) (sets.String, error) {
    m.mutex.Lock()
    defer m.mutex.Unlock()
    needed := required
    // Gets list of devices that have already been allocated.
    // This can happen if a container restarts for example.
    // 查看该container是否有分配过信息
    devices := m.podDevices.containerDevices(podUID, contName, resource)
    if devices != nil {
        klog.V(3).Infof("Found pre-allocated devices for resource %s container %q in Pod %q: %v", resource, contName, podUID, devices.List())
        needed = needed - devices.Len()
        // A pod's resource is not expected to change once admitted by the API server,
        // so just fail loudly here. We can revisit this part if this no longer holds.
        if needed != 0 {
            // 如果以前分配的设备数量与当前要求的数量不一致 返回错误
            return nil, fmt.Errorf("pod %q container %q changed request for resource %q from %d to %d", podUID, contName, resource, devices.Len(), required)
        }
    }
    if needed == 0 {
        // No change, no work.
        return nil, nil
    }
    klog.V(3).Infof("Needs to allocate %d %q for pod %q container %q", needed, resource, podUID, contName)
    // Needs to allocate additional devices.
    if _, ok := m.healthyDevices[resource]; !ok {
        return nil, fmt.Errorf("can't allocate unregistered device %s", resource)
    }
    devices = sets.NewString()
    // Allocates from reusableDevices list first.
    // 从initContainer中取 当然也可以从上一个InitContainer中取 因为InitContainer是一个一个运行的
    for device := range reusableDevices {
        devices.Insert(device)
        needed--
        if needed == 0 {
            return devices, nil
        }
    }
    // Needs to allocate additional devices.
    if m.allocatedDevices[resource] == nil {
        m.allocatedDevices[resource] = sets.NewString()
    }
    // Gets Devices in use.
    devicesInUse := m.allocatedDevices[resource]
    // Gets a list of available devices.
    available := m.healthyDevices[resource].Difference(devicesInUse)
    if int(available.Len()) < needed {
        return nil, fmt.Errorf("requested number of devices unavailable for %s. Requested: %d, Available: %d", resource, needed, available.Len())
    }
    allocated := available.UnsortedList()[:needed]
    // 更新到allocatedDevices中
    for _, device := range allocated {
        m.allocatedDevices[resource].Insert(device)
        devices.Insert(device)
    }
    return devices, nil
}

其实这块的逻辑很简单, 就是一道简单的数学题.
devicesInUse := m.allocatedDevices[resource] 代表该资源以及分配出去的设备.
available := m.healthyDevices[resource].Difference(devicesInUse) 代表该资源目前可以分配的所有设备.
needed 代表该容器请求此资源的数量.

那就很简单了, 如果needed < available才可以分配成功, 成功的话会把该这些分配的设备加入到device managerallocatedDevices中.

另外有一点需要注意:
reusableDevices中取复用的设备, 也就是从initContainer中取, 当然也可以从上一个InitContainer中取, 因为InitContainer是一个一个运行的.
举一个例子, 比如该pod中所有的InitContainer都请求1gpu, 那么第一个InitContainer会申请到一个gpu, 第二个InitContainer是在第一个InitContainer运行结束后才开始运行的, 所以它申请的gpu就是第一个InitContainergpu, 它会复用. 后面第三个, 第四个, 第五个InitContainer都会复用前面的gpu.

所以现在回头来看allocatePodResources方法. 理解devicesToReuse的意思.

func (m *ManagerImpl) allocatePodResources(pod *v1.Pod) error {
    devicesToReuse := make(map[string]sets.String)
    for _, container := range pod.Spec.InitContainers {
        ...
        m.podDevices.addContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
    }
    for _, container := range pod.Spec.Containers {
        ...
        m.podDevices.removeContainerAllocatedResources(string(pod.UID), container.Name, devicesToReuse)
    }
    return nil
}

可以看到pod.Spec.Containers可以复用的资源是从InitContainer来, 因为要等到所有的InitContainer一个一个按顺序运行结束之后, 这些pod.Spec.Containers才会启动(同时启动), 按照上面的分析可以得到pod.Spec.Containers可以复用的资源就是max(每个initContainer申请资源的数量).

那如果所有InitContainer最大的那个InitContainer中申请了10gpu, 所有pod.Spec.Containers总共就申请了1gpu, 这种情况下该pod就会无缘无故浪费9gpu.
那如果所有InitContainer最大的那个InitContainer中申请了1gpu, 所有pod.Spec.Containers总共就申请了10gpu, 这种情况下该pod就会首先复用InitContainer中的那已经申请好1gpu, 另外9gpu再继续申请.

总结

conclusion.png

推荐阅读更多精彩内容