k8s/kubeflow排障

记录一下部署kubeflow过程中遇到的问题。

部署ingress-nginx失败记录

kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.0/deploy/static/provider/cloud/deploy.yaml
执行命令后提示如下错误：

1	time="2021-12-30T01:47:24Z" level=fatal msg="error creating kubernetes client config: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"

解决方法:
修改kube-apiserver.yaml,去掉—disable-admission-plugins=ServiceAccount,等待kube-apiserver重启，然后重新部署

Pod启动失败, “cni0” already has an IP address

Warning  FailedCreatePodSandBox  35s (x137911 over 4d21h)  kubelet  (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "de753a2dbc417b7a05f99044d63aa285dac37beddc2bcbe806c4d3ff1772d027" network for pod "coredns-78fcd69978-h4lqn": networkPlugin cni failed to set up pod "coredns-78fcd69978-h4lqn_kube-system" network: failed to set bridge addr: "cni0" already has an IP address different from 10.63.0.1/24

解决方法:

1 2	ifconfig cni0 down ip link delete cni0

安装MySQL报错如下,`chown: changing ownership of ‘/var/lib/postgresql/data’: Operation not permitted`

解决方法:

# PV使用的是nfs
vim /etc/exports
# 增加 no_root_squash，修改为如下
/nfs 10.1.55.0/24(rw,sync,no_subtree_check,no_root_squash)

# 重启nfs server
/etc/init.d/nfs-kernel-server restart

# 修改deploy,增加readOnly: false
volumeMounts:
- name: mysql-storage
  mountPath: /var/lib/mysql
  readOnly: false

kubeflow/cache-server 启动失败, secret “webhook-server-tls” not found

1	MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found

解决方法：

1	把其它pod的状态弄到 Running 状态后，重启Pod 问题解决

登录kubeflow webui界面，创建Notebook Server, 报错如下

1	No default Storage Class is set. Can't create new Disks for the new Notebook. Please use an Existing Disk.

解决方法：

1
2

# 因为集群中己经配置过storageclass,因此直接将这个storageclass设置为默认的storageclass就行
kubectl patch storageclass managed-nfs-storage -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

创建notebook报错, 创建volume报错

1	kubeflow 403 could not find csrf cookie xsrf-token in the request

解决方法:

k edit deploy jupyter-web-app-deployment -n kubeflow

# 添加如下环境变量
- name: APP_SECURE_COOKIES                                                                                                                
  value: "false"         
  
 
 k edit deploy volumes-web-app-deployment -n kubeflow 
 
# 添加如下环境变量
- name: APP_SECURE_COOKIES                                                                                                                
  value: "false"

kubeflow第一次登录未出现创建Namespace的页面

解决方法:

k edit deploy centraldashboard

# 修改为如下
- name: REGISTRATION_FLOW                                                                                                                 
  value: "true"

如在部署前可修改文件apps/centraldashboard/upstream/base/params.env

CD_CLUSTER_DOMAIN=cluster.local
CD_USERID_HEADER=kubeflow-userid
CD_USERID_PREFIX=
CD_REGISTRATION_FLOW=true

coredns启动报错

root@ai-10-1-55-12-dev-sz:~# k logs coredns-6c76c8bb89-g98cq -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[FATAL] plugin/loop: Loop (127.0.0.1:40293 -> :53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 4247289002016518455.6908858903734866660."

解决方法: 删除/etc/resolv.conf中的本地地址nameserver 127.0.0.53

安装kubeflow 之后,从istio-ingressgateway进入的所有流量都会重定向到/auth/dex，带来了很多问题。

解决方法:

k get envoyfilter -n istio-system
k delete envoyfilter authn-filter -n istio-system

k edit cm istio -n istio-system 
# 新增如下内容,保存然后重启istiod
data:                                                                                                                                                                                 
  mesh: |-                                                                                         
    extensionProviders:                                                                                                                                                               
    - name: "dex-auth-provider"                                                                                                                                                       
      envoyExtAuthzHttp:                                                                                                                                                              
        service: "authservice.istio-system.svc.cluster.local"                                                                                                                         
        port: "8080" # The default port used by oauth2-proxy.                                                                                                                         
        includeHeadersInCheck: ["authorization", "cookie", "x-auth-token"] # headers sent to the oauth2-proxy in the check request.                                                   
        headersToUpstreamOnAllow: ["kubeflow-userid"] # headers sent to backend application when request is allowed.

# 新建AuthorizationPolicy，kf-ap.yaml, k apply -f kf-ap.yaml

apiVersion: security.istio.io/v1beta1                                                                                                                                                 
kind: AuthorizationPolicy                                                                                                                                                             
metadata:                                                                                                                                                                             
  name: dex-auth                                                                                                                                                                      
  namespace: istio-system                                                                                                                                                             
spec:                                                                                                                                                                                 
  selector:                                                                                                                                                                           
    matchLabels:                                                                                                                                                                      
      istio: ingressgateway                                                                                                                                                           
  action: CUSTOM                                                                                                                                                                      
  provider:                                                                                                                                                                           
    # The provider name must match the extension provider defined in the mesh config.                                                                                                 
    name: dex-auth-provider                                                                                                                                                           
  rules:                                                                                                                                                                              
  # The rules specify when to trigger the external authorizer.                                                                                                                        
  - to:                                                                                                                                                                               
    - operation:                                                                                                                                                                      
        hosts: ["kubeflow.xxx.com"]

通过ingress-nginx访问harbor dashboard正常，后改为通过istio-ingressgateway访问dashboard登录报403

1	2022-01-21T06:43:14Z [DEBUG] [/lib/http/error.go:59]: {"errors":[{"code":"FORBIDDEN","message":"CSRF token invalid"}]}

一开始以为是istio-ingressgateway的限制，所以查看了所有的AuthorizationPolicy,发现默认对istio-ingressgateway的流量是放行的。后查看harbor-core发现流量己经进入到集群内，直觉可能跟https有关重新安装harbor解决问题。
解决方法:

1
2
3

helm uninstall harbor -n harbor
#https -> http
helm upgrade --cleanup-on-fail --install harbor . --namespace=harbor --set externalURL=http://tharbor.fiture.com

Istio 容器一直处于创建中。MountVolume.SetUp failed for volume “istio-token”:failed to fetch token: the API server does not have TokenRequest endpoints enabled

解决方法：

vim /etc/manifests/kube-apiserver.yaml
# 添加如下内容
- --service-account-key-file=/etc/kubernetes/pki/sa.pub
- --service-account-signing-key-file=/etc/kubernetes/pki/sa.key
- --service-account-issuer=api
- --service-account-api-audiences=api,vault,factors

解决方法：

添加:insecure-registries
vim /etc/docker/daemon.json 

insecure-registries: ["http://harbor.com"]

访问kubeflow dashboard显示空白

解决方法：

1 2	k edit vs dex -n auth # host修改为对应的域名

登录kubeflow, 点击左侧菜单，显示Sorry, /jupyter/ is not a valid page

解决方法:原因是vs dex和vs centeraldashboard配置了host: kubeflow.xx.com, 都删除掉host就行

创建notebook提示Insufficient nvidia.com/gpu。

解决方法:

1
2
3

# 安装nvidia-device-plugin

k apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml

当挂载目录文件过多导致pod启动失败, kubeflow notebook启动失败

解决方法:

# 1.istio层面的原因
k edit deploy istiod -n istio-system
# 添加如下环境变量
- name: ENABLE_LEGACY_FSGROUP_INJECTION
  value: "false"
  
# 2.kubeflow的原因
k edit deploy notebook-controller-deployment
# 添加如下环境变量
- name: ADD_FSGROUP
  value: "false"

Cephfs mount error

解决方法:

在每个节点上安装cefs-common
apt install ceph-common
 
# 对ceph key 进行base64加密
https://github.com/kubernetes-retired/kube-deploy/issues/264#issuecomment-292815926

点击tensorboard发生404

解决方法：删除tensorboard server.重新新建一个server,名称不能与notebook名称相同。

Metric-server not ready:k8s metrics server x509: cannot validate certificate for because it doesn’t contain any IP SANs”

解决方法:

1
2

# Deployment 启动参数增加--kubelet-insecure-tls
containers:                                                                             - args:                                                                                                                                                                      - --cert-dir=/tmp                                                                                                                                                      - --secure-port=4443                                                                                                                                                     - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname                                                                                                       - --kubelet-use-node-status-port                                                                                                                                        - --metric-resolution=15s                                                                                                                                              - ---kubelet-insecure-tls

[Docker] 错误之Error response from daemon: could not select device driver ““ with capabilities: [[gpu]]

解决方法:

1 2	apt-get install nvidia-container-runtime systemctl restart docker

ping gitlab.xxx.com偶尔出现Name or service not known。

解决方法:

# 添加policy sequential 
k edit cm coredns -n kube-system
        forward . /etc/resolv.conf {                                                                                                                                                                               
          max_concurrent 1000                                                                                                                                                                                     
          policy sequential                                                                                                                                                                                       
       }  
       
# 然后删除corndns的pod

部署ingress-nginx失败记录

Pod启动失败, “cni0” already has an IP address

安装MySQL报错如下,chown: changing ownership of ‘/var/lib/postgresql/data’: Operation not permitted

kubeflow/cache-server 启动失败, secret “webhook-server-tls” not found

登录kubeflow webui界面，创建Notebook Server, 报错如下

创建notebook报错, 创建volume报错

kubeflow第一次登录未出现创建Namespace的页面

coredns启动报错

安装kubeflow 之后,从istio-ingressgateway进入的所有流量都会重定向到/auth/dex，带来了很多问题。

通过ingress-nginx访问harbor dashboard正常，后改为通过istio-ingressgateway访问dashboard登录报403

Istio 容器一直处于创建中。MountVolume.SetUp failed for volume “istio-token”:failed to fetch token: the API server does not have TokenRequest endpoints enabled

docker login,Error response from daemon: Get “https://tharbor.fiture.com/v2/“: dial tcp 10.1.55.16:443: connect: connection refused

访问kubeflow dashboard显示空白

登录kubeflow, 点击左侧菜单，显示Sorry, /jupyter/ is not a valid page

创建notebook提示Insufficient nvidia.com/gpu。

当挂载目录文件过多导致pod启动失败, kubeflow notebook启动失败

Cephfs mount error

点击tensorboard发生404

Metric-server not ready:k8s metrics server x509: cannot validate certificate for because it doesn’t contain any IP SANs”

[Docker] 错误之Error response from daemon: could not select device driver ““ with capabilities: [[gpu]]

ping gitlab.xxx.com偶尔出现Name or service not known。

安装MySQL报错如下,`chown: changing ownership of ‘/var/lib/postgresql/data’: Operation not permitted`