记录一下部署kubeflow过程中遇到的问题。
部署ingress-nginx失败记录
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v1.1.0/deploy/static/provider/cloud/deploy.yaml
执行命令后提示如下错误:1
time="2021-12-30T01:47:24Z" level=fatal msg="error creating kubernetes client config: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory"
解决方法:
修改kube-apiserver.yaml,去掉—disable-admission-plugins=ServiceAccount,等待kube-apiserver重启,然后重新部署
Pod启动失败, “cni0” already has an IP address
1 | Warning FailedCreatePodSandBox 35s (x137911 over 4d21h) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "de753a2dbc417b7a05f99044d63aa285dac37beddc2bcbe806c4d3ff1772d027" network for pod "coredns-78fcd69978-h4lqn": networkPlugin cni failed to set up pod "coredns-78fcd69978-h4lqn_kube-system" network: failed to set bridge addr: "cni0" already has an IP address different from 10.63.0.1/24 |
解决方法:1
2ifconfig cni0 down
ip link delete cni0
安装MySQL报错如下,chown: changing ownership of ‘/var/lib/postgresql/data’: Operation not permitted
解决方法:1
2
3
4
5
6
7
8
9
10
11
12
13# PV使用的是nfs
vim /etc/exports
# 增加 no_root_squash,修改为如下
/nfs 10.1.55.0/24(rw,sync,no_subtree_check,no_root_squash)
# 重启nfs server
/etc/init.d/nfs-kernel-server restart
# 修改deploy,增加readOnly: false
volumeMounts:
- name: mysql-storage
mountPath: /var/lib/mysql
readOnly: false
kubeflow/cache-server 启动失败, secret “webhook-server-tls” not found
1 | MountVolume.SetUp failed for volume "webhook-tls-certs" : secret "webhook-server-tls" not found |
解决方法:1
把其它pod的状态弄到 Running 状态后,重启Pod 问题解决
登录kubeflow webui界面,创建Notebook Server, 报错如下
1 | No default Storage Class is set. Can't create new Disks for the new Notebook. Please use an Existing Disk. |
解决方法:1
2# 因为集群中己经配置过storageclass,因此直接将这个storageclass设置为默认的storageclass就行
kubectl patch storageclass managed-nfs-storage -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
创建notebook报错, 创建volume报错
1 | kubeflow 403 could not find csrf cookie xsrf-token in the request |
解决方法:1
2
3
4
5
6
7
8
9
10
11
12k edit deploy jupyter-web-app-deployment -n kubeflow
# 添加如下环境变量
- name: APP_SECURE_COOKIES
value: "false"
k edit deploy volumes-web-app-deployment -n kubeflow
# 添加如下环境变量
- name: APP_SECURE_COOKIES
value: "false"
kubeflow第一次登录未出现创建Namespace的页面
解决方法:1
2
3
4
5k edit deploy centraldashboard
# 修改为如下
- name: REGISTRATION_FLOW
value: "true"
如在部署前可修改文件apps/centraldashboard/upstream/base/params.env1
2
3
4CD_CLUSTER_DOMAIN=cluster.local
CD_USERID_HEADER=kubeflow-userid
CD_USERID_PREFIX=
CD_REGISTRATION_FLOW=true
coredns启动报错
1 | root@ai-10-1-55-12-dev-sz:~# k logs coredns-6c76c8bb89-g98cq -n kube-system |
解决方法: 删除/etc/resolv.conf中的本地地址nameserver 127.0.0.53
安装kubeflow 之后,从istio-ingressgateway进入的所有流量都会重定向到/auth/dex,带来了很多问题。
解决方法:1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35k get envoyfilter -n istio-system
k delete envoyfilter authn-filter -n istio-system
k edit cm istio -n istio-system
# 新增如下内容,保存然后重启istiod
data:
mesh: |-
extensionProviders:
- name: "dex-auth-provider"
envoyExtAuthzHttp:
service: "authservice.istio-system.svc.cluster.local"
port: "8080" # The default port used by oauth2-proxy.
includeHeadersInCheck: ["authorization", "cookie", "x-auth-token"] # headers sent to the oauth2-proxy in the check request.
headersToUpstreamOnAllow: ["kubeflow-userid"] # headers sent to backend application when request is allowed.
# 新建AuthorizationPolicy,kf-ap.yaml, k apply -f kf-ap.yaml
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: dex-auth
namespace: istio-system
spec:
selector:
matchLabels:
istio: ingressgateway
action: CUSTOM
provider:
# The provider name must match the extension provider defined in the mesh config.
name: dex-auth-provider
rules:
# The rules specify when to trigger the external authorizer.
- to:
- operation:
hosts: ["kubeflow.xxx.com"]
通过ingress-nginx访问harbor dashboard正常,后改为通过istio-ingressgateway访问dashboard登录报403
1 | 2022-01-21T06:43:14Z [DEBUG] [/lib/http/error.go:59]: {"errors":[{"code":"FORBIDDEN","message":"CSRF token invalid"}]} |
一开始以为是istio-ingressgateway的限制,所以查看了所有的AuthorizationPolicy,发现默认对istio-ingressgateway的流量是放行的。后查看harbor-core发现流量己经进入到集群内,直觉可能跟https有关重新安装harbor解决问题。
解决方法:1
2
3helm uninstall harbor -n harbor
#https -> http
helm upgrade --cleanup-on-fail --install harbor . --namespace=harbor --set externalURL=http://tharbor.fiture.com
Istio 容器一直处于创建中。MountVolume.SetUp failed for volume “istio-token”:failed to fetch token: the API server does not have TokenRequest endpoints enabled
解决方法:1
2
3
4
5
6vim /etc/manifests/kube-apiserver.yaml
# 添加如下内容
- --service-account-key-file=/etc/kubernetes/pki/sa.pub
- --service-account-signing-key-file=/etc/kubernetes/pki/sa.key
- --service-account-issuer=api
- --service-account-api-audiences=api,vault,factors
docker login,Error response from daemon: Get “https://tharbor.fiture.com/v2/“: dial tcp 10.1.55.16:443: connect: connection refused
解决方法:1
2
3
4添加:insecure-registries
vim /etc/docker/daemon.json
insecure-registries: ["http://harbor.com"]
访问kubeflow dashboard显示空白
解决方法:1
2k edit vs dex -n auth
# host修改为对应的域名
登录kubeflow, 点击左侧菜单,显示Sorry, /jupyter/ is not a valid page
解决方法:原因是vs dex和vs centeraldashboard配置了host: kubeflow.xx.com, 都删除掉host就行
创建notebook提示Insufficient nvidia.com/gpu。
解决方法:1
2
3# 安装nvidia-device-plugin
k apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
当挂载目录文件过多导致pod启动失败, kubeflow notebook启动失败
解决方法:1
2
3
4
5
6
7
8
9
10
11# 1.istio层面的原因
k edit deploy istiod -n istio-system
# 添加如下环境变量
- name: ENABLE_LEGACY_FSGROUP_INJECTION
value: "false"
# 2.kubeflow的原因
k edit deploy notebook-controller-deployment
# 添加如下环境变量
- name: ADD_FSGROUP
value: "false"
Cephfs mount error
解决方法:1
2
3
4
5在每个节点上安装cefs-common
apt install ceph-common
# 对ceph key 进行base64加密
https://github.com/kubernetes-retired/kube-deploy/issues/264#issuecomment-292815926
点击tensorboard发生404
解决方法:删除tensorboard server.重新新建一个server,名称不能与notebook名称相同。
Metric-server not ready:k8s metrics server x509: cannot validate certificate for because it doesn’t contain any IP SANs”
解决方法:1
2# Deployment 启动参数增加--kubelet-insecure-tls
containers: - args: - --cert-dir=/tmp - --secure-port=4443 - --kubelet-preferred-address-types=InternalIP,ExternalIP,Hostname - --kubelet-use-node-status-port - --metric-resolution=15s - ---kubelet-insecure-tls
[Docker] 错误之Error response from daemon: could not select device driver ““ with capabilities: [[gpu]]
解决方法:1
2apt-get install nvidia-container-runtime
systemctl restart docker
ping gitlab.xxx.com偶尔出现Name or service not known。
解决方法:1
2
3
4
5
6
7
8# 添加policy sequential
k edit cm coredns -n kube-system
forward . /etc/resolv.conf {
max_concurrent 1000
policy sequential
}
# 然后删除corndns的pod