Openshift生产环境部署配置事项

容器会在物联网中大放异彩

1. 主机配置推荐

master 16核 32GB 网卡带宽不低于1Gb。
CPU x86_64架构，核数和主机数线性递增，每增加一台主机增加0.1核。5台主机4.5核，总的核数为4+0.1 * 主机数
内存与主机数线性递增，每增一台主机增加200M内存，5台8G，总的内存数为7+0.2 * 主机数
node 40核 256GB 网卡带宽不低于1Gb
根据应用场景估算

2. 磁盘目录挂载

master
磁盘格式：xfs ftype=1
/ : 10GB
/var/log ：50GB
/var/lib/docker：100GB 做raid高可用
/var/lib/etcd [ssd]：20GB 做raid高可用
/var ：50GB 可根据实际进行调整，主要emptyDir的存储在/var/lib/origin目录下
node
磁盘格式：xfs ftype=1
/ : 10GB
/var/log ：50GB
/var/lib/docker：100GB 做raid高可用
/var ：50GB 可根据实际进行调整，主要emptyDir的存储在/var/lib/origin目录下

mkfs.xfs -n ftype=1 /path/to/your/device

说明：xfs文件格式，docker overlay2存储设备必须设置ftype=1。

3. 关闭swap

swapoff -a
cat /etc/fstab ## 注释掉swap

4. 打开seLinux enabled

sed -i 's/SELINUX=disabled/SELINUX=permissive/' /etc/selinux/config

5. 设置系统参数

$ cat /etc/sysctl.conf 
# 禁用整个系统所有接口的IPv6
net.ipv6.conf.all.disable_ipv6 = 1
vm.swappiness = 0
net.netfilter.nf_conntrack_max = 1000000
$ lsmod | grep conntrack || modprobe ip_conntrack
$ sysctl -w net.netfilter.nf_conntrack_max=1000000
$ sysctl -p /etc/sysctl.conf

6. 更改resolve.conf

$ cat /etc/resolv.conf
search cluster.local
nameserver 192.168.0.2

7. 时间同步

$ ansible all -m package -a 'name=chrony state=present'

## chronyd服务端配置
$ cat /etc/chrony.conf
server 55.15.226.193 iburst
allow 55.15.226.0/24
local stratum 10

强制同步时间

## chrony客户端配置
chronyc sources -v
systemctl stop chronyd
chronyd -q 'pool 55.15.226.193 iburst'

8.创建docker 用户组

groupadd docker

将普通用户添加到docker用户组

usermod -aG docker ${USER}

9. docker设置

/etc/sysconfig/docker-storage
DOCKER_STORAGE_OPTIONS="--storage-driver overlay2 "
/etc/sysconfig/docker
OPTIONS=" --log-opt max-size=1M --log-opt max-file=3 --live-restore=true "

设置docker与kubelet的cgroup driver为systemd。OpenShift默认安装就是设置的systemd，而社区版的kubelet默认是cgroupfs，需要注意。。

10. 网卡配置

配置网卡多队列：ethtool -l eth0查看网卡多队列Combined数
NetworkManager, 是一个提供网络检测和配置网络的工具，在Node节点需要使用它来自动配置节点的dnsmasq作为默认的网络入口。
网络设备的配置中/etc/sysconfig/network-scripts/ifcfg-eth*默认NM_CONTROLLED是被设置为yes,如果它被设置为no，那么NetworkManager应用将不会去自动创建dnsmasq相关的配置，所以此时需要手动配置dnsmasq。

添加文件

$ cat /etc/dnsmasq.d/origin-upstream-dns.conf
server=192.168.0.2
$ cat /etc/origin/node/resolv.conf
nameserver 192.168.0.2

参考 install-config-network-using-firewalld

11. 多网卡

管理网：集群间组件通信，Node与Master节点通信网络
业务网：应用间网络通信，pod间网络通信
存储网：与存储设备网络通信
还可以将与外部镜像仓库的网络也考虑进去
每个网络，使用两张网卡做bond，提高网络性能及可用性。
其中管理网与业务网必须互通，否则部分组件服务将不可用。

12.外部节点相关组件

时间同步服务（chronyd）
DNS(dnsmasq)
镜像仓库(docker-distribution)
负载均衡器（Haproxy）

13. 外部镜像仓库授权

将私有镜像仓库的CA文件拷贝到镜像仓库所在服务器的/etc/pki/ca-trust/source/anchors/目录下

$ ansible all -m copy -a 'src=registry.crt dest=/etc/pki/ca-trust/source/anchors/registry.crt'
$ update-ca-trust

为OpenShift节点设置默认的登录信息

$ # 在/etc/ansible/hosts中添加认证用户
oreg_auth_user="<用户名>"
oreg_auth_password="<密码>"

$ oc login <镜像仓库url> -u <用户名>
$ ansible -m copy -a 'src=/root/.docker dest=/var/lib/origin' all
$ ansible -m service -a 'name=origin-node state=restarted' all

14. 内核优化（openshift安装会自动配置）

$ ansible all -m package -a 'name=tuned state=present'
$ ansible all -m service -a 'name=tuned state=started enabled=true'
$ ansible all -m shell -a 'tuned-adm profile throughput-performance'

15. ansible设置reserved。

OpenShift官方推荐规则
通常，它需要保留5％-10％的节点资源来保护节点，越高越安全。
AWS的规则：
内存预留值（AWS）:

Reserved memory = 255MiB + 11MiB * MAX_POD_PER_INSTANCE

CPU预留值（AWS）:

6% of the first core
1% of the next core (up to 2 cores)
0.5% of the next 2 cores (up to 4 cores)
0.25% of any cores above 4 cores

GKE的规则：
内存预留值（GKE）:

255 MiB of memory for machines with less than 1 GB of memory
25% of the first 4GB of memory
20% of the next 4GB of memory (up to 8GB)
10% of the next 8GB of memory (up to 16GB)
6% of the next 112GB of memory (up to 128GB)
2% of any memory above 128GB

CPU预留值（GKE）::

6% of the first core
1% of the next core (up to 2 cores)
0.5% of the next 2 cores (up to 4 cores)
0.25% of any cores above 4 cores

例子：2 vCPU and 7.5GB

Allocatable memory = 0.25 * 4 (first 4GB) + 0.2 * 3.5 (remaining 3.5GB)
Allocatable CPU = 0.06 * 1 (first core) + 0.01 * 1 (second core)

Azure的规则：
内存预留值（Azure）:

255 MiB of memory for machines with less than 1 GB of memory
25% of the first 4GB of memory
20% of the next 4GB of memory (up to 8GB)
10% of the next 8GB of memory (up to 16GB)
6% of the next 112GB of memory (up to 128GB)
2% of any memory above 128GB

CPU预留值（Azure）:

核数 core	预留 millicores
1	60
2	100
4	140
8	180
16	260
32	420
64	740

另外：
Google和亚马逊产品的hard eviction threshold 为100MB，而AKS则为750MB。

[OSEv3:vars]

# 节点配置低的话可参考

openshift_node_kubelet_args={'pods-per-core': ['10'], 'max-pods': ['250'], 'image-gc-high-threshold': ['85'], 'image-gc-low-threshold': ['80'], 'system-reserved':['cpu=200m', 'memory=1G'], 'kube-reserved':['cpu=200m','memory=1G']}

# 节点配置高的话可参考

 openshift_node_kubelet_args={'pods-per-core': ['10'], 'max-pods': ['250'], 'image-gc-high-threshold': ['85'], 'image-gc-low-threshold': ['80'], 'system-reserved':['cpu=500m', 'memory=1G'], 'kube-reserved':['cpu=1','memory=2G']}

16. 配置集群对master控制台的public域名证书及应用Route路由服务的域名证书

openshift_master_cluster_hostname=master.example.com
openshift_master_cluster_public_hostname=master_public.example.com
openshift_master_default_subdomain=apps.example.com
openshift_master_named_certificates=[{"certfile": "/data/cert/master_public.example.com.crt", "keyfile": "/data/cert/master_public.example.com.key", "names": ["master_public.example.com"], "cafile": "/data/cert/example.com_ca.crt"}]
openshift_master_overwrite_named_certificates=true
openshift_hosted_router_certificate={"certfile": "/data/cert/apps.example.com.crt", "keyfile": "/data/cert/apps.example.com.key", "cafile": "/data/cert/example.com_ca.crt"}

其中各证书的文件名不要使用与Master组件默认的名字重复，否则会覆盖掉组件间的自签证书。

另外可以自签证书生成长有效期的相关证书。自签证书步骤如下：

根证书创建

$ openssl genrsa -out ca.key 2048
$ openssl req -new -x509 -days 36500 -key ca.key -out ca.crt -subj "/C=CN/ST=shanxi/L=taiyuan/O=cn/OU=test/CN=example.com"
$ #或者 openssl req -new -x509 -days 36500 -key ca.key -out ca.crt 手动输入配置

创建证书并使用根证书签发

$ openssl genrsa -out app.key 2048
$ openssl req -new -key app.key -out app.csr
$ openssl x509 -req -in app.csr -CA ca.crt -CAkey ca.key -out app.crt -days 3650  -CAcreateserial

使用 Openssl 工具查看证书信息

$ openssl x509 -in signed.crt -noout -dates
$ openssl x509 -in signed.crt -noout -subject
$ openssl x509 -in signed.crt -noout -text

17. 添加集群自动审批证书签发请求

OpenShift 3.11中默念Node的证书有效期为1年，满1年后会自动更新证书。更新证书时，该节点会向集群发送证书签发请求，批准之后才能继续添加到集群。

[OSEv3:vars]
openshift_master_bootstrap_auto_approve=true

说明：对于已经部署好的集群可以通过执行ansible-playbook来配置

# ansible-playbook -vvv openshift-ansible/playbooks/openshift-master/enable_bootstrap.yml -e openshift_master_bootstrap_auto_approve=true

18. ansible中设置Docker存储type及Docker与etcd额外磁盘

[OSEv3:vars]
# Docker setup for extra disks on nodes
container_runtime_docker_storage_setup_device=/dev/vdb
container_runtime_docker_storage_type=overlay2
openshift_node_local_quota_per_fsgroup=512Mi

[masters:vars]
container_runtime_extra_storage=[{'device': '/dev/vdc', 'path': '/var/lib/origin/openshift.local.volumes', 'options': 'gquota', 'filesystem': 'xfs', 'format': 'True'}, {'device': '/dev/vdd', 'path': '/var/lib/etcd', 'hosts': 'masters', 'filesystem': 'xfs', 'format': 'True'}]

[nodes:vars]
container_runtime_extra_storage=[{'device': '/dev/vdc', 'path': '/var/lib/origin/openshift.local.volumes', 'options': 'gquota', 'filesystem': 'xfs', 'format': 'True'}]

19. 设置日志自动归档

journal日志归档
减少/var/log/journal的日志，设置/etc/systemd/journald.conf

$ cat /etc/systemd/journald.conf
[Journal]
Storage=persistent
Compress=yes
#Seal=yes
#SplitMode=uid
SyncIntervalSec=1s
RateLimitInterval=1s
RateLimitBurst=10000
SystemMaxUse=1G
SystemKeepFree=20%
SystemMaxFileSize=10M
#RuntimeMaxUse=
#RuntimeKeepFree=
#RuntimeMaxFileSize=
MaxRetentionSec=3days
MaxFileSec=1day
ForwardToSyslog=False
#ForwardToKMsg=no
#ForwardToConsole=no
ForwardToWall=False
#TTYPath=/dev/console
#MaxLevelStore=debug
#MaxLevelSyslog=debug
#MaxLevelKMsg=notice
#MaxLevelConsole=info
#MaxLevelWall=emerg
$ systemctl restart systemd-journald

或者部署时更新以下文件内容(openshift 3.9以上)
roles/openshift_node/defaults/main.yml

...
journald_vars_to_replace:
- { var: Storage, val: persistent }
- { var: Compress, val: yes }
- { var: SyncIntervalSec, val: 1s }
- { var: RateLimitInterval, val: 1s }
- { var: RateLimitBurst, val: 10000 }
- { var: SystemMaxUse, val: 1G }
- { var: SystemKeepFree, val: 20% }
- { var: SystemMaxFileSize, val: 10M }
- { var: MaxRetentionSec, val: 3days }
- { var: MaxFileSec, val: 1day }
- { var: ForwardToSyslog, val: no }
- { var: ForwardToWall, val: no }
...

message日志归档
只收集warning以上的日志/etc/rsyslog.conf

$ cat /etc/rsyslog.conf
*.warning;mail.none;authpriv.none;cron.none  /var/log/messages

将message日志只保留最近三天的日志

$ cat /etc/logrotate.d/syslog
/var/log/cron
/var/log/messages
{
  daily
  rotate 3
  sharedscripts
  postrotate
     /bin/kill -HUP `cat /var/run/syslogd.pid 2> /dev/null` 2> /dev/null || true
  endscript
}

如果要设置普通用户可查看/var/log/messages文件，需要在/etc/rsyslog.conf配置的前面添加messages文件可读权限

$umask 0000

20. 定时清理主机上退出的容器、未被使用的卷和未被使用的镜像（发布非常频繁时需要）

$ cat /usr/bin/prune_docker.sh
#!/bin/bash
docker container prune -f # 删除所有退出状态的容器
docker volume prune -f # 删除未被使用的数据卷
docker image prune -f # 删除 dangling 或所有未被使用的镜像

做为定时任务定期作清理

$ crontab -l
0 0 * * * /usr/bin/prune_docker.sh >> /var/log/prune_docker.log 2>&1

21. 定时清理私有镜像仓库（发布非常频繁时需要）

$ cat > /usr/bin/cleanregistry.sh <<EOF
#!/bin/bash
oc login -u admin -p password
oc adm prune builds --orphans --keep-complete=25 --keep-failed=5 --keep-younger-than=60m --confirm
oc adm prune deployments --orphans --keep-complete=25 --keep-failed=10 --keep-younger-than=60m --confirm
#oc rollout latest docker-registry -n default
#sleep 20
oc adm prune images --keep-younger-than=400m --confirm
EOF
$ crontab -l
0 0 * * * /usr/bin/cleanregistry.sh >> /var/log/cleanregistry.log 2>&1

22. 注释掉origin-accounting.conf文件中的DefaultIOAccounting

$ cat /etc/systemd/system.conf.d/origin-accounting.conf
[Manager]
DefaultCPUAccounting=yes
DefaultMemoryAccounting=yes
# systemd v230 or newer
# DefaultIOAccounting=yes
# Deprecated, remove in future
DefaultBlockIOAccounting=yes

23. Pod与Service网段规划

集群的Service网段
集群的Pod网段
根据主机的配置规划好每台主机上Pod的网段

24. Router环境变量优化

ROUTER_THREADS 设置为CPU核数
ROUTER_MAX_CONNECTIONS 默认值是20000

25. Router设置默认503页面（服务不存在）

设置页面HTML，覆盖/var/lib/haproxy/conf/error-page-503.http文件
补充：Openshift自定义Router配置

26. 计算节点优化配置

MTU值：通常的以太网设置为1450，在巨型帧以太网中设置为8950
node的配置文件中

networkConfig:
  mtu: 1450

开启并行拉取镜像，提升效率。
node的配置文件中

kubeletArguments:
  serialize-image-pulls:
  - "false"

容器清理：通过kubelet自动清理退出的容器
node的配置文件中

kubeletArguments:
  minimum-container-ttl-duration:
    - "10s"
  maximum-dead-containers-per-container:
    - "1"
  maximum-dead-containers:
    - "20"

minimum-container-ttl-duration: 容器可以进行垃圾收集的最低时长。默认值为0，表示不限制。可以使用单位后缀来指定此设置的值，例如h表示小时，m表示分钟，s表示秒。
maximum-dead-containers-per-container：每个pod容器要保留的实例数。预设值为1。
maximum-dead-containers：节点中死容器总数的最大值。默认值为-1，表示无限制。

27. 证书的有效期设置更长年限（100年）

核心步骤是：

部署时设置主要ocp组件的过期时间，/etc/ansible/hosts
更新部署脚本中所有生成证书的地方，设置长年限的过期时间

具体操作参考笔者之前的文章：OpenShift部署时如何延长组件证书的有效期

参考文章：
linux journalctl 命令
 配置 logrotate 的终极指导
 Allocatable memory and CPU in Kubernetes Nodes
OpenShift容器云平台建设之部署前准备
 企业级容器云平台建设之功能汇总