AKS 集群 Helm 部署 Prometheus + Grafana 监控平台

张开发
2026/6/1 12:05:03 15 分钟阅读
AKS 集群 Helm 部署 Prometheus + Grafana 监控平台
1. 部署目标在 AKS 集群中通过 Helm 部署完整监控平台包括PrometheusGrafanaAlertmanagerkube-state-metricsnode-exporterPrometheus Adapter适用于国内网络环境节点可访问公网但访问 Docker Hub、registry.k8s.io 容易超时通过--kubeconfig ./kubeconfig.yaml远程操作 AKS 集群使用国内镜像源避免 ImagePullBackOff2. 前置条件确保本地机器已具备kubectl helm检查集群连接kubectl get nodes --kubeconfig ./kubeconfig.yaml创建 monitoring 命名空间kubectl create namespace monitoring --kubeconfig ./kubeconfig.yaml3. 添加 Helm 仓库helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo add grafana https://grafana.github.io/helm-charts helm repo update查看可用版本helm search repo prometheus-community/kube-prometheus-stack --versions | head4. 下载 资源包适用于离线环境https://github.com/prometheus-operator/kube-prometheus/tree/release-0.17#到github上下载离线资源包根据集群版本下载对应版本的资源包并上传服务器5. 检查文件此时进入 kube-prometheus 目录下安装 manifest/setup 目录下的所有 yaml 文件具体如下kubectl apply --server-side -f manifests/setup --force-conflicts6. 安装监控平台kubectl apply -f manifests/7. 查看 Pod 状态kubectl get pods -n monitoring --kubeconfig ./kubeconfig.yaml正常情况下应看到类似如果出现ImagePullBackOff ErrImagePull说明镜像无法从 Docker Hub 或 registry.k8s.io 拉取。8. 常见镜像拉取失败处理Grafana 镜像失败kubectl set image deployment/prometheus-grafana \ grafanaswr.cn-north-4.myhuaweicloud.com/ddn-k8s/docker.io/grafana/grafana:12.4.1 \ -n monitoring --kubeconfig ./kubeconfig.yamlkube-state-metrics 镜像失败kubectl set image deployment/prometheus-kube-state-metrics \ kube-state-metricsswr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/kube-state-metrics/kube-state-metrics:v2.18.0 \ -n monitoring --kubeconfig ./kubeconfig.yamlprometheus-adapter 镜像失败kubectl set image deployment/prometheus-prometheus-adapter \ prometheus-adapterswr.cn-north-4.myhuaweicloud.com/ddn-k8s/registry.k8s.io/prometheus-adapter/prometheus-adapter:v0.12.0 \ -n monitoring --kubeconfig ./kubeconfig.yaml删除异常 Podkubectl delete pod -n monitoring --all --kubeconfig ./kubeconfig.yaml重新检查kubectl get pods -n monitoring --kubeconfig ./kubeconfig.yaml9. 查看 Servicekubectl get svc -n monitoring --kubeconfig ./kubeconfig.yaml示例grafana LoadBalancer 10.0.131.131 20.x.x.x 3000:31054/TCP prometheus-kube-prometheus-prometheus ClusterIP alertmanager-prometheus-kube-prometheus-alertmanager ClusterIP10. 将 Grafana 改为 LoadBalancer如果 Grafana 默认不是 LoadBalancer可执行kubectl patch svc prometheus-grafana -n monitoring \ -p {spec:{type:LoadBalancer}} \ --kubeconfig ./kubeconfig.yaml查看公网 IPkubectl get svc prometheus-grafana -n monitoring --kubeconfig ./kubeconfig.yaml等待 EXTERNAL-IP 分配完成后访问http://EXTERNAL-IP:300011. Grafana 登录默认账号admin默认密码admin123如果忘记密码可以查看 Secretkubectl get secret -n monitoring | grep grafana kubectl get secret prometheus-grafana -n monitoring -o yaml --kubeconfig ./kubeconfig.yaml12. 配置 Prometheus 数据源Grafana 登录后ConnectionsData SourcesAdd data source选择 PrometheusURL 填写http://prometheus-kube-prometheus-prometheus.monitoring.svc:9090点击 Save Test。13. 推荐 Dashboard IDGrafana → Dashboards → Import推荐导入以下 Dashboard315Kubernetes cluster monitoring1860Node Exporter Full15757Kubernetes / Compute Resources / Cluster15759Kubernetes / Views / Nodes15760Kubernetes / Views / Pods15761Kubernetes / API Server13332kube-state-metrics优先建议导入1860157571575915760这些 Dashboard 基本覆盖节点 CPU节点内存节点磁盘Pod 状态Pod 重启DeploymentNamespaceAPI Server容器资源使用率15. 查看 Prometheus Targets访问http://prometheus-service-ip:9090/targets或通过端口转发kubectl port-forward svc/prometheus-kube-prometheus-prometheus 9090:9090 \ -n monitoring --kubeconfig ./kubeconfig.yaml本地访问http://127.0.0.1:9090/targets检查 Targets 是否都是 UP。16. 后续可扩展监控后续可继续监控Nginx IngressRedisMySQLPostgreSQLKafkaRabbitMQElasticsearchJVMSpring Boot自定义业务服务通常只需部署对应 exporter然后导入 Grafana Dashboard 即可。17. 常见排查命令查看 Podkubectl get pods -n monitoring --kubeconfig ./kubeconfig.yaml查看 Servicekubectl get svc -n monitoring --kubeconfig ./kubeconfig.yaml查看 Deploymentkubectl get deploy -n monitoring --kubeconfig ./kubeconfig.yaml查看镜像kubectl get deploy -n monitoring -o yaml | grep image:查看异常事件kubectl describe pod pod-name -n monitoring --kubeconfig ./kubeconfig.yaml查看日志kubectl logs pod-name -n monitoring --kubeconfig ./kubeconfig.yaml查看镜像拉取失败kubectl get pods -n monitoring | grep ImagePullBackOff删除异常 Podkubectl delete pod pod-name -n monitoring --kubeconfig ./kubeconfig.yaml重启 Deploymentkubectl rollout restart deployment deployment-name -n monitoring --kubeconfig ./kubeconfig.yaml灵感来自感谢博主https://cloud.tencent.com/developer/article/2216613

更多文章