浅谈 k8s service&kube-proxy

背景

最开始听到同事 k8s 分享时比较困惑我的一个问题是 k8s 怎么实现一个私有 ip(虚拟 ip，以下简称 vip)到另一个私有ip收发包的。

不过其实我想知道的应该是 k8s 通信机制，它是怎么实现服务发现的，新建的 pod 是怎么感知到的，万一有些 pod 节点变更 vip 变了 k8s 是如何感知的。

基于这个问题，做一下关于 k8s service&kube-proxy 的分享。

Service&kube-proxy 概述

首先我建了一个 replcas = 4 lebel: app=service_test_pod的 python server deployment 来打出当前 pod 的 Hostname

apiVersion:  apps/v1
kind: Deployment
metadata:
  name: service-test
spec:
  replicas: 4
  selector:
    matchLabels:
      app: service_test_pod
  template:
    metadata:
      labels:
        app: service_test_pod
    spec:
      containers:
      - name: simple-http
        image: python:2.7
        imagePullPolicy: IfNotPresent
        command: ["/bin/bash"]
        args: ["-c", "echo \"<p>Hello from $(hostname)</p>\" > index.html; python -m SimpleHTTPServer 9999"]
        ports:
        - name: http
          containerPort: 9999

可以看到启动了4个 pod

service-test-69764ddb4c-4hr2x                                     1/1       Running             0          8s        172.18.234.21    brand6
service-test-69764ddb4c-gstft                                     1/1       Running             0          8s        172.18.83.225    belba2
service-test-69764ddb4c-nnv29                                     1/1       Running             0          8s        172.18.156.140   brand2
service-test-69764ddb4c-vx5pn                                     0/1       ContainerCreating   0          8s        <none>           belba3

我必须挨个 curl 才能得到他们的 hostname

lilingzhi@belba1 ~/k8s/test $ curl 172.18.234.21:9999
<p>Hello from service-test-69764ddb4c-4hr2x</p>

但不能让其他 pod 直接通过 vip 访问这些 pod ，需要一个更上层的一个抽象，把这4个提供相同服务的 pod 打包成一个对外的服务，通过某个入口地址来访问它，并且把请求均衡到4个pod上，这样一层的抽象包装是实现一个服务网络(service mesh)的基础。

而 k8s service 就是做这个的。

首先我们来看下什么是 k8s service:

A Kubernetes Service is an abstraction which defines a logical set of Pods and a policy by which to access them - sometimes called a micro-service. The set of Pods targeted by a Service is (usually) determined by a Label Selector (see below for why you might want a Service without a selector)

而打包pod成service并发布的微服务得支持 k8s 内部及外部的访问。

故 k8s service 提供了以下三种暴露 service 入口的模式:

ClusterIP: use a cluster-internal IP only - this is the default and is discussed above. Choosing this value means that you want this service to be reachable only from inside of the cluster.

NodePort: on top of having a cluster-internal IP, expose the service on a port on each node of the cluster (the same port on each node). You’ll be able to contact the service on any :NodePort address.

LoadBalancer: on top of having a cluster-internal IP and exposing service on a NodePort also, ask the cloud provider for a load balancer which forwards to the Service exposed as a :NodePort for each Node.

可以简单理解为 ClusterIp 是提供对内的访问入口，NodePort 和 LoadBalancer 是提供对外的，不过 LoadBalancer 是在暴露 NodePort 基础上提供可以接入外部的 LB。

那么我新建一个 service 给刚刚的 4 个 pod

lilingzhi@belba1 ~/k8s/test $ kubectl expose deployment service-test --type="NodePort" --port 9098 --target-port=9999
service/service-test exposed

可以看到

lilingzhi@belba1 ~/k8s/test $ kubectl get service -o wide
NAME                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                               AGE       SELECTOR
service-test           NodePort    172.19.97.3     <none>        9098:30255/TCP                        16s       app=service_test_pod

service-test 这儿就映射到 clusterIp 的 172.19.97.3:9098 端口上

再看下 endpoints

lilingzhi@belba1 ~/k8s/test $ kubectl get endpoints -o wide
NAME                   ENDPOINTS                                                               AGE
service-test           172.18.156.140:9999,172.18.193.66:9999,172.18.234.21:9999 + 1 more...   3m

可以看到也建了一个包含 4个 host:port 的元组的 endpoint

通过不断 curl service ip:port 会发现请求已经均衡到4个 pod 上了

lilingzhi@belba1 ~/k8s/test $ curl 172.19.97.3:9098
<p>Hello from service-test-69764ddb4c-4hr2x</p>
lilingzhi@belba1 ~/k8s/test $ curl 172.19.97.3:9098
<p>Hello from service-test-69764ddb4c-gstft</p>
lilingzhi@belba1 ~/k8s/test $ curl 172.19.97.3:9098
<p>Hello from service-test-69764ddb4c-4hr2x</p>
lilingzhi@belba1 ~/k8s/test $ curl 172.19.97.3:9098
<p>Hello from service-test-69764ddb4c-gstft</p>

由于 k8s service 路由是通过 kube-proxy 决定的，默认是走的 iptables 转发的(可换成用户态 proxy 或 ipvs)，所以查一下相应的 iptables 规则

$ sudo iptables -L -v -n -t nat

Chain KUBE-SERVICES (2 references)
    0     0 KUBE-MARK-MASQ  tcp  --  *      *      !172.18.0.0/16        172.19.97.3          /* default/service-test: cluster IP */ tcp dpt:9098
    0     0 KUBE-SVC-LY73ZDGF4KGO4YFJ  tcp  --  *      *       0.0.0.0/0            172.19.97.3          /* default/service-test: cluster IP */ tcp dpt:9098

看到有条 chain KUBE-SVC-LY73ZDGF4KGO4YFJ 定义了 service-test 的转发规则，于是查看相应的 chain

Chain KUBE-SVC-LY73ZDGF4KGO4YFJ (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SEP-2T6K76SEPIPV3QKW  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/service-test: */ statistic mode random probability 0.25000000000
    0     0 KUBE-SEP-75XULILUFIHXBBLY  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/service-test: */ statistic mode random probability 0.33332999982
    0     0 KUBE-SEP-FFUABGNOKSNXGH7E  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/service-test: */ statistic mode random probability 0.50000000000
    0     0 KUBE-SEP-MKK3T4CLASKIDMJS  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/service-test: */

可以看到这条 KUBE-SVC-LY73ZDGF4KGO4YFJ chain 里有4个下一跳的 chain KUBE-SEP-2T6K76SEPIPV3QKW ，KUBE-SEP-75XULILUFIHXBBLY, KUBE-SEP-FFUABGNOKSNXGH7E, KUBE-SEP-MKK3T4CLASKIDMJS, 他们分别对应4个 pod 的 vip，并且设置了 iptables probability, 由上至下分别为 25%, 33.33%, 50%, 100%, 由于 iptables 是顺序读的，这样确保了每个 pod 都是 25% 的请求分发的机率。

那么我们挑其中一条子 chain KUBE-SEP-2T6K76SEPIPV3QKW 跟进去看看

Chain KUBE-SEP-2T6K76SEPIPV3QKW (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-MARK-MASQ  all  --  *      *       172.18.156.140       0.0.0.0/0            /* default/service-test: */
    0     0 DNAT       tcp  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/service-test: */ tcp to:172.18.156.140:9999

果不其然，这儿的 source ip 就是对应 pod 的 vip, 一个请求就这样转发到 pod 上，可以看到一共要经过3层 chain。

chains

不过有二义性的地方在于，这儿的 kube-proxy 实际上并不起一个 proxy 的作用，而是 watch 变更并更新 iptables，也就是说，client 的请求直接通过 iptables 路由，所以如果我们直接修改 iptables 也是可以奏效的。

kube-proxy 源码分析

因为 kube-proxy 源码相对比较少，所以读了下源码，但还是蛮复杂的

kube-proxy 会作为 daemon 跑在每个节点上，对 api-server 中的 service & endpoint 进行 watch ,一旦检测到更新则往 iptables 里全量推送新的转发规则，那么我们根据 kubernetes/cmd/kube-proxy/proxy.go 里找到 kube-proxy 真正的入口函数 Run()

func (s *ProxyServer) Run() error {
    ...
    serviceConfig := config.NewServiceConfig(informerFactory.Core().V1().Services(), s.ConfigSyncPeriod)
    serviceConfig.RegisterEventHandler(s.ServiceEventHandler)
    go serviceConfig.Run(wait.NeverStop)

    endpointsConfig := config.NewEndpointsConfig(informerFactory.Core().V1().Endpoints(), s.ConfigSyncPeriod)
    endpointsConfig.RegisterEventHandler(s.EndpointsEventHandler)
    go endpointsConfig.Run(wait.NeverStop)

    // This has to start after the calls to NewServiceConfig and NewEndpointsConfig because those
    // functions must configure their shared informer event handlers first.
    go informerFactory.Start(wait.NeverStop)

    // Birth Cry after the birth is successful
    s.birthCry()

    // Just loop forever for now...
    s.Proxier.SyncLoop()
    return nil
}

这就是 kube-proxy 的入口函数，实际上即起一个 proxy server daemon

可以看到通过 informerFactory新建了2个 config 对象(serviceConfig, endpointsConfig), 这个 informerFactory 是什么呢？

k8s 里所有资源都存在 etcd 中提供 api 通过 apiserver 的接口访问，其中有个核心的公共组件即 informer 是对 apiserver 资源访问的一层包装，其中包括 api 访问, localcache 等...

所以这里的两个 config 对象即用来获取 etcd 中 service 和 endpoints 的信息，他们都调用了 RegisterEventHandler 注册了一个回调函数，这个函数用来监听变更并发送变更信号。

最后用 goroutine 跑起来。

之后的 informerFactory.Start() 则用来初始化 informer 对象注册的回调函数。

最后把 proxy server loop 跑起来 Proxier.SyncLoop() 等待信号。

接着我们先看下 service 的 NewServiceConfig 这个函数，因为 endpoint 估计也是类似的。

// NewServiceConfig creates a new ServiceConfig.
func NewServiceConfig(serviceInformer coreinformers.ServiceInformer, resyncPeriod time.Duration) *ServiceConfig {
    result := &ServiceConfig{
        lister:       serviceInformer.Lister(),
        listerSynced: serviceInformer.Informer().HasSynced,
    }

    serviceInformer.Informer().AddEventHandlerWithResyncPeriod(
        cache.ResourceEventHandlerFuncs{
            AddFunc:    result.handleAddService,
            UpdateFunc: result.handleUpdateService,
            DeleteFunc: result.handleDeleteService,
        },
        resyncPeriod,
    )

    return result
}

看到它结构体里就两个对象 lister 和 listerSynced，所以我们接着看下 serviceInformer

type ServiceInformer interface {
    Informer() cache.SharedIndexInformer
    Lister() v1.ServiceLister
}

看到是个通用对象 SharedIndexInformer 即上述提及的公共组件，故不往里深究。

接着回来看注册进去的回调函数，举个栗子，这儿的 UpdateFunc: result.handleUpdateService 最终会调到 iptables.go 里的 OnServiceUpdate

func (c *ServiceConfig) handleUpdateService(oldObj, newObj interface{}) {
    oldService, ok := oldObj.(*v1.Service)
    if !ok {
        utilruntime.HandleError(fmt.Errorf("unexpected object type: %v", oldObj))
        return
    }
    service, ok := newObj.(*v1.Service)
    if !ok {
        utilruntime.HandleError(fmt.Errorf("unexpected object type: %v", newObj))
        return
    }
    for i := range c.eventHandlers {
        glog.V(4).Infof("Calling handler.OnServiceUpdate")
        c.eventHandlers[i].OnServiceUpdate(oldService, service)
    }
}


func (proxier *Proxier) OnServiceUpdate(oldService, service *v1.Service) {
    if proxier.serviceChanges.Update(oldService, service) && proxier.isInitialized() {
        proxier.syncRunner.Run()
    }
}

所以监听到变更之后调用的这个 proxier.syncRunner.Run() 是什么呢，得看下这个 syncRunner 在做什么

proxier.syncRunner = async.NewBoundedFrequencyRunner("sync-runner", proxier.syncProxyRules, minSyncPeriod, syncPeriod, burstSyncs)

func (bfr *BoundedFrequencyRunner) Run() {
    // If it takes a lot of time to run the underlying function, noone is really
    // processing elements from <run> channel. So to avoid blocking here on the
    // putting element to it, we simply skip it if there is already an element
    // in it.
    select {
    case bfr.run <- struct{}{}:
    default:
    }
}

func (bfr *BoundedFrequencyRunner) Loop(stop <-chan struct{}) {
    glog.V(3).Infof("%s Loop running", bfr.name)
    bfr.timer.Reset(bfr.maxInterval)
    for {
        select {
        case <-stop:
            bfr.stop()
            glog.V(3).Infof("%s Loop stopping", bfr.name)
            return
        case <-bfr.timer.C():
            bfr.tryRun()
        case <-bfr.run:
            bfr.tryRun()
        }
    }
}

syncRunner 里注册了个 syncProxyRules 的回调函数，而刚刚 updateSync 中触发的 run 函数则用 select 发送了一个 bfr.run 信号，之前所提及的 proxy server 一旦收到信号就会跑一次 tryRun()调用到这个 syncProxyRules 的回调函数。

于是我们走到 syncProxyRules, 这个函数特别长，但简而言之就是做一些 iptables 的 ensure 和 update 操作。

这样一来，整个流程就走通了。

靠北，开发 k8s 的 google 工程师真的很机车，代码真的有点绕哦！

最后

最后我们来理一下 service && endpoint && kube-proxy 的关系。

service / endpoint 是pod对外暴露访问地址的封装，Kube-proxy 用来管理这些封装，做一些 ensure&update 的操作

kube-proxy model

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 158,560评论 4赞 361
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 67,104评论 1赞 291
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 108,297评论 0赞 243
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 43,869评论 0赞 204
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 52,275评论 3赞 287
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 40,563评论 1赞 216
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 31,833评论 2赞 312
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 30,543评论 0赞 197
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 34,245评论 1赞 241
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 30,512评论 2赞 244
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 32,011评论 1赞 258
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 28,359评论 2赞 253
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 33,006评论 3赞 235
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 26,062评论 0赞 8
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 26,825评论 0赞 194
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 35,590评论 2赞 273
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 35,501评论 2赞 268

浅谈 k8s service&kube-proxy

背景

Service&kube-proxy 概述

kube-proxy 源码分析

最后

推荐阅读更多精彩内容