k8s/kubeflow排障

记录一下部署kubeflow过程中遇到的问题。

部署ingress-nginx失败记录

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.0/deploy/static/provider/cloud/deploy.yaml
执行命令后提示如下错误:

1
time="2021-12-30T01:47:24Z" level=fatal msg="error creating kubernetes client config: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"

解决方法:
修改kube-apiserver.yaml,去掉—disable-admission-plugins=ServiceAccount,等待kube-apiserver重启,然后重新部署

Pod启动失败, “cni0” already has an IP address
1
Warning  FailedCreatePodSandBox  35s (x137911 over 4d21h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "de753a2dbc417b7a05f99044d63aa285dac37beddc2bcbe806c4d3ff1772d027" network for pod "coredns-78fcd69978-h4lqn": networkPlugin cni failed to set up pod "coredns-78fcd69978-h4lqn_kube-system" network: failed to set bridge addr: "cni0" already has an IP address different from 10.63.0.1/24

解决方法:

1
2
ifconfig cni0 down    
ip link delete cni0

安装MySQL报错如下,chown: changing ownership of ‘/var/lib/postgresql/data’: Operation not permitted

解决方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
# PV使用的是nfs
vim /etc/exports
# 增加 no_root_squash,修改为如下
/nfs 10.1.55.0/24(rw,sync,no_subtree_check,no_root_squash)

# 重启nfs server
/etc/init.d/nfs-kernel-server restart

# 修改deploy,增加readOnly: false
volumeMounts:
- name: mysql-storage
mountPath: /var/lib/mysql
readOnly: false

kubeflow/cache-server 启动失败, secret “webhook-server-tls” not found
1
MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found

解决方法:

1
把其它pod的状态弄到 Running 状态后,重启Pod 问题解决

登录kubeflow webui界面,创建Notebook Server, 报错如下
1
No default Storage Class is set. Can't create new Disks for the new Notebook. Please use an Existing Disk.

解决方法:

1
2
# 因为集群中己经配置过storageclass,因此直接将这个storageclass设置为默认的storageclass就行
kubectl patch storageclass managed-nfs-storage -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

创建notebook报错, 创建volume报错
1
kubeflow 403 could not find csrf cookie xsrf-token in the request

解决方法:

1
2
3
4
5
6
7
8
9
10
11
12
k edit deploy jupyter-web-app-deployment -n kubeflow

# 添加如下环境变量
- name: APP_SECURE_COOKIES
value: "false"


k edit deploy volumes-web-app-deployment -n kubeflow

# 添加如下环境变量
- name: APP_SECURE_COOKIES
value: "false"

kubeflow第一次登录未出现创建Namespace的页面

解决方法:

1
2
3
4
5
k edit deploy centraldashboard

# 修改为如下
- name: REGISTRATION_FLOW
value: "true"

如在部署前可修改文件apps/centraldashboard/upstream/base/params.env

1
2
3
4
CD_CLUSTER_DOMAIN=cluster.local
CD_USERID_HEADER=kubeflow-userid
CD_USERID_PREFIX=
CD_REGISTRATION_FLOW=true

coredns启动报错
1
2
3
4
5
6
root@ai-10-1-55-12-dev-sz:~# k logs coredns-6c76c8bb89-g98cq -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[FATAL] plugin/loop: Loop (127.0.0.1:40293 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 4247289002016518455.6908858903734866660."

解决方法: 删除/etc/resolv.conf中的本地地址nameserver 127.0.0.53

安装kubeflow 之后,从istio-ingressgateway进入的所有流量都会重定向到/auth/dex,带来了很多问题。

解决方法:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
k get envoyfilter -n istio-system
k delete envoyfilter authn-filter -n istio-system

k edit cm istio -n istio-system
# 新增如下内容,保存然后重启istiod
data:
mesh: |-
extensionProviders:
- name: "dex-auth-provider"
envoyExtAuthzHttp:
service: "authservice.istio-system.svc.cluster.local"
port: "8080" # The default port used by oauth2-proxy.
includeHeadersInCheck: ["authorization", "cookie", "x-auth-token"] # headers sent to the oauth2-proxy in the check request.
headersToUpstreamOnAllow: ["kubeflow-userid"] # headers sent to backend application when request is allowed.

# 新建AuthorizationPolicy,kf-ap.yaml, k apply -f kf-ap.yaml

apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: dex-auth
namespace: istio-system
spec:
selector:
matchLabels:
istio: ingressgateway
action: CUSTOM
provider:
# The provider name must match the extension provider defined in the mesh config.
name: dex-auth-provider
rules:
# The rules specify when to trigger the external authorizer.
- to:
- operation:
hosts: ["kubeflow.xxx.com"]

通过ingress-nginx访问harbor dashboard正常,后改为通过istio-ingressgateway访问dashboard登录报403
1
2022-01-21T06:43:14Z [DEBUG] [/lib/http/error.go:59]: {"errors":[{"code":"FORBIDDEN","message":"CSRF token invalid"}]}

一开始以为是istio-ingressgateway的限制,所以查看了所有的AuthorizationPolicy,发现默认对istio-ingressgateway的流量是放行的。后查看harbor-core发现流量己经进入到集群内,直觉可能跟https有关重新安装harbor解决问题。
解决方法:

1
2
3
helm uninstall harbor -n harbor
#https -> http
helm upgrade --cleanup-on-fail --install harbor . --namespace=harbor --set externalURL=http://tharbor.fiture.com

Istio 容器一直处于创建中。MountVolume.SetUp failed for volume “istio-token”:failed to fetch token: the API server does not have TokenRequest endpoints enabled

解决方法:

1
2
3
4
5
6
vim /etc/manifests/kube-apiserver.yaml
# 添加如下内容
- --service-account-key-file=/etc/kubernetes/pki/sa.pub
- --service-account-signing-key-file=/etc/kubernetes/pki/sa.key
- --service-account-issuer=api
- --service-account-api-audiences=api,vault,factors

docker login,Error response from daemon: Get “https://tharbor.fiture.com/v2/“: dial tcp 10.1.55.16:443: connect: connection refused

解决方法:

1
2
3
4
添加:insecure-registries
vim /etc/docker/daemon.json

insecure-registries: ["http://harbor.com"]

访问kubeflow dashboard显示空白

解决方法:

1
2
k edit vs dex -n auth
# host修改为对应的域名

登录kubeflow, 点击左侧菜单,显示Sorry, /jupyter/ is not a valid page

解决方法:原因是vs dex和vs centeraldashboard配置了host: kubeflow.xx.com, 都删除掉host就行

创建notebook提示Insufficient nvidia.com/gpu。

解决方法:

1
2
3
# 安装nvidia-device-plugin

k apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

当挂载目录文件过多导致pod启动失败, kubeflow notebook启动失败

解决方法:

1
2
3
4
5
6
7
8
9
10
11
# 1.istio层面的原因
k edit deploy istiod -n istio-system
# 添加如下环境变量
- name: ENABLE_LEGACY_FSGROUP_INJECTION
value: "false"

# 2.kubeflow的原因
k edit deploy notebook-controller-deployment
# 添加如下环境变量
- name: ADD_FSGROUP
value: "false"

Cephfs mount error

解决方法:

1
2
3
4
5
在每个节点上安装cefs-common
apt install ceph-common

# 对ceph key 进行base64加密
https://github.com/kubernetes-retired/kube-deploy/issues/264#issuecomment-292815926

点击tensorboard发生404

解决方法:删除tensorboard server.重新新建一个server,名称不能与notebook名称相同。

Metric-server not ready:k8s metrics server x509: cannot validate certificate for because it doesn’t contain any IP SANs”

解决方法:

1
2
# Deployment 启动参数增加--kubelet-insecure-tls
containers: - args: - --cert-dir=/tmp - --secure-port=4443 - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname - --kubelet-use-node-status-port - --metric-resolution=15s - ---kubelet-insecure-tls

[Docker] 错误之Error response from daemon: could not select device driver ““ with capabilities: [[gpu]]

解决方法:

1
2
apt-get install nvidia-container-runtime
systemctl restart docker

ping gitlab.xxx.com偶尔出现Name or service not known。

解决方法:

1
2
3
4
5
6
7
8
# 添加policy sequential 
k edit cm coredns -n kube-system
forward . /etc/resolv.conf {
max_concurrent 1000
policy sequential
}

# 然后删除corndns的pod