K8s CRD Operator開發實戰:從CRD設計到Controller實現的6種生產模式
當你手寫2000行YAML才部署一個應用時
你有沒有經歷過這種痛苦——部署一個微服務需要手寫Deployment、Service、ConfigMap、Secret、Ingress、HPA、PDB等7種資源,每個環境還要改映像版本、副本數、環境變數?一個新同事入職,光是搞清楚部署流程就要一週?更可怕的是,某天凌晨3點線上故障,你發現ConfigMap裡的資料庫連接串寫錯了,但YAML檔案散落在5個Git倉庫裡,根本不知道改哪個?
這就是K8s原生API的「組合爆炸」問題:底層資源太細碎,缺少業務語義的抽象;YAML運維沒有校驗,一個縮排錯誤就能讓整個部署失敗;多環境配置沒有收斂,dev/staging/prod的YAML差異全靠人肉對比。
CRD + Operator改變了這一切。它讓你在K8s中定義自己的業務API,用Controller自動化所有的部署邏輯——你只需要寫一個MyApp資源,Operator自動建立7個子資源、處理版本升級、管理配置變更。本文將帶你從零開始,掌握6種CRD Operator開發的生產級模式。
核心概念速查表
| 概念 | 全稱 | 說明 |
|---|---|---|
| CRD | Custom Resource Definition | K8s中自定義資源類型的定義,聲明API路徑、欄位和校驗規則 |
| CR | Custom Resource | CRD定義的資源實例,使用者建立的具體自定義資源物件 |
| Operator | Operator Pattern | 透過CRD + Controller實現K8s應用自動化管理的模式 |
| Controller | Controller | 持續監控集群狀態並驅動實際狀態向期望狀態收斂的控制迴圈 |
| Reconcile | Reconciliation Loop | Controller的核心調諧邏輯,對比期望狀態與實際狀態的差異並執行操作 |
| Kubebuilder | Kubebuilder | 基於controller-runtime的Operator SDK,提供專案腳手架和程式碼生成 |
| controller-runtime | controller-runtime | 實現K8s Controller的Go庫,提供Watch/Reconcile/Event等核心機制 |
| Finalizer | Finalizer | 資源刪除前的攔截機制,確保Controller完成清理工作後才允許刪除 |
| Status Subresource | Status Subresource | 將Spec和Status分離為兩個獨立更新路徑,避免更新衝突 |
| Owner Reference | Owner Reference | 資源間的屬主關係,實現級聯刪除和垃圾回收 |
| Event | Event | K8s事件記錄,用於向使用者展示Controller的操作狀態和錯誤資訊 |
| Webhook | Admission Webhook | API請求的攔截校驗機制,包括Mutating和Validating兩種 |
六大挑戰:為什麼CRD Operator開發不是「寫個CRD就完事」
-
CRD Schema設計陷阱:欄位類型定義不精確,缺少OpenAPI v3校驗,導致使用者可以提交非法資料;版本相容性沒有提前規劃,v1到v2的遷移變成噩夢
-
Reconcile Loop冪等性:Controller的Reconcile可能被重複觸發,如果邏輯不是冪等的,就會建立重複資源或執行重複操作——這是最常見也最致命的bug
-
Status更新衝突:多個Controller同時更新同一資源的Status,或者Controller更新Status時Spec已經被修改,導致樂觀鎖衝突和更新丟失
-
Finalizer死鎖:Finalizer新增後如果Controller崩潰或被刪除,資源將永遠卡在Terminating狀態,無法刪除也無法重建
-
事件風暴:大規模集群中,一個CRD變更可能觸發數百個子資源的建立/更新,Controller來不及處理導致Workqueue堆積
-
生產級缺失:缺少Leader Election導致多副本Controller重複執行,缺少Graceful Shutdown導致Reconcile中斷,缺少Metrics導致無法監控Controller健康狀態
六步實戰:從CRD設計到Controller實現
第一步:CRD定義與OpenAPI v3 Schema
CRD定義(含完整校驗):
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: webapps.apps.toolsku.dev
spec:
group: apps.toolsku.dev
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required:
- image
- replicas
properties:
image:
type: string
pattern: '^[a-zA-Z0-9][a-zA-Z0-9._-]*:[a-zA-Z0-9][a-zA-Z0-9._-]*$'
replicas:
type: integer
minimum: 1
maximum: 100
port:
type: integer
minimum: 1
maximum: 65535
default: 8080
resources:
type: object
properties:
requests:
type: object
properties:
cpu:
type: string
pattern: '^([0-9]+m|[0-9]+(\.[0-9]+)?)$'
memory:
type: string
pattern: '^[0-9]+(Ki|Mi|Gi|Ti)$'
limits:
type: object
properties:
cpu:
type: string
memory:
type: string
env:
type: array
items:
type: object
required:
- name
properties:
name:
type: string
maxLength: 256
value:
type: string
valueFrom:
type: object
properties:
secretKeyRef:
type: object
required:
- name
- key
properties:
name:
type: string
key:
type: string
status:
type: object
properties:
availableReplicas:
type: integer
conditions:
type: array
items:
type: object
properties:
type:
type: string
status:
type: string
enum:
- "True"
- "False"
- Unknown
lastTransitionTime:
type: string
format: date-time
reason:
type: string
message:
type: string
subresources:
status: {}
scale:
specReplicasPath: .spec.replicas
statusReplicasPath: .status.availableReplicas
additionalPrinterColumns:
- name: Image
type: string
jsonPath: .spec.image
- name: Replicas
type: integer
jsonPath: .spec.replicas
- name: Available
type: integer
jsonPath: .status.availableReplicas
- name: Age
type: date
jsonPath: .metadata.creationTimestamp
scope: Namespaced
names:
plural: webapps
singular: webapp
kind: WebApp
shortNames:
- wa
第二步:Kubebuilder專案腳手架
# 初始化Kubebuilder專案
mkdir webapp-operator && cd webapp-operator
kubebuilder init --domain toolsku.dev --repo github.com/toolsku/webapp-operator
# 建立API(CRD + Controller)
kubebuilder create api --group apps --version v1alpha1 --kind WebApp
# 建立Webhook
kubebuilder create webhook --group apps --version v1alpha1 --kind WebApp --defaulting --programmatic-validation
# 專案結構
# ├── api/v1alpha1/
# │ ├── webapp_types.go # CRD類型定義
# │ ├── webapp_webhook.go # Webhook邏輯
# │ ├── webhook_suite_test.go
# │ └── groupversion_info.go
# ├── internal/controller/
# │ ├── webapp_controller.go # Controller調諧邏輯
# │ └── suite_test.go
# ├── cmd/
# │ └── main.go # 入口檔案
# ├── config/
# │ ├── crd/ # CRD YAML
# │ ├── rbac/ # RBAC配置
# │ ├── manager/ # Controller Manager部署
# │ └── samples/ # 範例CR
# └── Dockerfile
CRD類型定義(Go):
package v1alpha1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
type WebAppSpec struct {
Image string `json:"image"`
Replicas *int32 `json:"replicas"`
Port *int32 `json:"port,omitempty"`
Resources *WebAppResourceRequirements `json:"resources,omitempty"`
Env []WebAppEnvVar `json:"env,omitempty"`
}
type WebAppResourceRequirements struct {
Requests *ResourceList `json:"requests,omitempty"`
Limits *ResourceList `json:"limits,omitempty"`
}
type ResourceList struct {
CPU string `json:"cpu,omitempty"`
Memory string `json:"memory,omitempty"`
}
type WebAppEnvVar struct {
Name string `json:"name"`
Value string `json:"value,omitempty"`
ValueFrom *EnvVarSource `json:"valueFrom,omitempty"`
}
type EnvVarSource struct {
SecretKeyRef *SecretKeyRef `json:"secretKeyRef,omitempty"`
}
type SecretKeyRef struct {
Name string `json:"name"`
Key string `json:"key"`
}
type WebAppStatus struct {
AvailableReplicas int32 `json:"availableReplicas,omitempty"`
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
type WebApp struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec WebAppSpec `json:"spec,omitempty"`
Status WebAppStatus `json:"status,omitempty"`
}
type WebAppList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []WebApp `json:"items"`
}
func init() {
SchemeBuilder.Register(&WebApp{}, &WebAppList{})
}
第三步:Controller Reconcile Loop實現
package controller
import (
"context"
"fmt"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/api/meta"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
)
type WebAppReconciler struct {
client.Client
Scheme *runtime.Scheme
}
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := log.FromContext(ctx)
var webapp appsv1alpha1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
if errors.IsNotFound(err) {
logger.Info("WebApp resource not found, ignoring")
return ctrl.Result{}, nil
}
logger.Error(err, "Failed to get WebApp")
return ctrl.Result{}, err
}
if webapp.DeletionTimestamp != nil {
return r.handleDeletion(ctx, &webapp)
}
if err := r.ensureFinalizer(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
deployment, err := r.reconcileDeployment(ctx, &webapp)
if err != nil {
r.updateCondition(ctx, &webapp, "Available", metav1.ConditionFalse, "ReconcileError", err.Error())
return ctrl.Result{}, err
}
service, err := r.reconcileService(ctx, &webapp)
if err != nil {
r.updateCondition(ctx, &webapp, "Available", metav1.ConditionFalse, "ReconcileError", err.Error())
return ctrl.Result{}, err
}
_ = deployment
_ = service
if err := r.updateStatus(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
r.updateCondition(ctx, &webapp, "Available", metav1.ConditionTrue, "ReconcileSuccess", "WebApp is available")
return ctrl.Result{}, nil
}
func (r *WebAppReconciler) reconcileDeployment(ctx context.Context, webapp *appsv1alpha1.WebApp) (*appsv1.Deployment, error) {
logger := log.FromContext(ctx)
var deployment appsv1.Deployment
err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment)
if err != nil && errors.IsNotFound(err) {
desired := r.buildDeployment(webapp)
if err := ctrl.SetControllerReference(webapp, desired, r.Scheme); err != nil {
return nil, fmt.Errorf("設置OwnerReference失敗: %w", err)
}
logger.Info("建立Deployment", "name", desired.Name)
if err := r.Create(ctx, desired); err != nil {
return nil, fmt.Errorf("建立Deployment失敗: %w", err)
}
return desired, nil
} else if err != nil {
return nil, fmt.Errorf("查詢Deployment失敗: %w", err)
}
desired := r.buildDeployment(webapp)
if r.deploymentNeedsUpdate(&deployment, desired) {
deployment.Spec = desired.Spec
logger.Info("更新Deployment", "name", deployment.Name)
if err := r.Update(ctx, &deployment); err != nil {
return nil, fmt.Errorf("更新Deployment失敗: %w", err)
}
}
return &deployment, nil
}
func (r *WebAppReconciler) buildDeployment(webapp *appsv1alpha1.WebApp) *appsv1.Deployment {
replicas := int32(1)
if webapp.Spec.Replicas != nil {
replicas = *webapp.Spec.Replicas
}
port := int32(8080)
if webapp.Spec.Port != nil {
port = *webapp.Spec.Port
}
envVars := make([]corev1.EnvVar, 0, len(webapp.Spec.Env))
for _, e := range webapp.Spec.Env {
ev := corev1.EnvVar{Name: e.Name, Value: e.Value}
if e.ValueFrom != nil && e.ValueFrom.SecretKeyRef != nil {
ev.ValueFrom = &corev1.EnvVarSource{
SecretKeyRef: &corev1.SecretKeySelector{
LocalObjectReference: corev1.LocalObjectReference{Name: e.ValueFrom.SecretKeyRef.Name},
Key: e.ValueFrom.SecretKeyRef.Key,
},
}
}
envVars = append(envVars, ev)
}
return &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
Labels: map[string]string{
"app.kubernetes.io/name": "webapp",
"app.kubernetes.io/instance": webapp.Name,
"app.kubernetes.io/managed-by": "webapp-operator",
},
},
Spec: appsv1.DeploymentSpec{
Replicas: &replicas,
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"app.kubernetes.io/instance": webapp.Name,
},
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: map[string]string{
"app.kubernetes.io/name": "webapp",
"app.kubernetes.io/instance": webapp.Name,
},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
{
Name: "webapp",
Image: webapp.Spec.Image,
Ports: []corev1.ContainerPort{{ContainerPort: port}},
Env: envVars,
},
},
},
},
},
}
}
func (r *WebAppReconciler) reconcileService(ctx context.Context, webapp *appsv1alpha1.WebApp) (*corev1.Service, error) {
logger := log.FromContext(ctx)
port := int32(8080)
if webapp.Spec.Port != nil {
port = *webapp.Spec.Port
}
var svc corev1.Service
err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &svc)
if err != nil && errors.IsNotFound(err) {
desired := &corev1.Service{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
Labels: map[string]string{
"app.kubernetes.io/instance": webapp.Name,
"app.kubernetes.io/managed-by": "webapp-operator",
},
},
Spec: corev1.ServiceSpec{
Selector: map[string]string{
"app.kubernetes.io/instance": webapp.Name,
},
Ports: []corev1.ServicePort{
{Port: port, TargetPort: intstr.FromInt32(port)},
},
},
}
if err := ctrl.SetControllerReference(webapp, desired, r.Scheme); err != nil {
return nil, err
}
logger.Info("建立Service", "name", desired.Name)
if err := r.Create(ctx, desired); err != nil {
return nil, err
}
return desired, nil
} else if err != nil {
return nil, err
}
return &svc, nil
}
func (r *WebAppReconciler) deploymentNeedsUpdate(current, desired *appsv1.Deployment) bool {
if *current.Spec.Replicas != *desired.Spec.Replicas {
return true
}
if current.Spec.Template.Spec.Containers[0].Image != desired.Spec.Template.Spec.Containers[0].Image {
return true
}
return false
}
func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&appsv1alpha1.WebApp{}).
Owns(&appsv1.Deployment{}).
Owns(&corev1.Service{}).
Complete(r)
}
第四步:Status Subresource管理與Condition更新
package controller
import (
"context"
appsv1 "k8s.io/api/apps/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
"sigs.k8s.io/controller-runtime/pkg/log"
appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
)
const (
finalizerName = "apps.toolsku.dev/finalizer"
)
func (r *WebAppReconciler) updateStatus(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
var deployment appsv1.Deployment
err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment)
if err != nil {
return err
}
availableReplicas := int32(0)
if deployment.Status.AvailableReplicas > 0 {
availableReplicas = deployment.Status.AvailableReplicas
}
if webapp.Status.AvailableReplicas == availableReplicas {
return nil
}
webapp.Status.AvailableReplicas = availableReplicas
return r.Status().Update(ctx, webapp)
}
func (r *WebAppReconciler) updateCondition(ctx context.Context, webapp *appsv1alpha1.WebApp, condType string, status metav1.ConditionStatus, reason, message string) {
condition := metav1.Condition{
Type: condType,
Status: status,
Reason: reason,
Message: message,
ObservedGeneration: webapp.Generation,
}
meta.SetStatusCondition(&webapp.Status.Conditions, condition)
if err := r.Status().Update(ctx, webapp); err != nil {
log.FromContext(ctx).Error(err, "更新Condition失敗")
}
}
func (r *WebAppReconciler) ensureFinalizer(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
if !containsString(webapp.Finalizers, finalizerName) {
webapp.Finalizers = append(webapp.Finalizers, finalizerName)
if err := r.Update(ctx, webapp); err != nil {
return err
}
}
return nil
}
func (r *WebAppReconciler) handleDeletion(ctx context.Context, webapp *appsv1alpha1.WebApp) (ctrl.Result, error) {
logger := log.FromContext(ctx)
if !containsString(webapp.Finalizers, finalizerName) {
return ctrl.Result{}, nil
}
logger.Info("執行清理邏輯", "webapp", webapp.Name)
if err := r.cleanupExternalResources(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
webapp.Finalizers = removeString(webapp.Finalizers, finalizerName)
if err := r.Update(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
logger.Info("Finalizer清理完成", "webapp", webapp.Name)
return ctrl.Result{}, nil
}
func (r *WebAppReconciler) cleanupExternalResources(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
logger := log.FromContext(ctx)
logger.Info("清理外部資源", "webapp", webapp.Name)
return nil
}
func containsString(slice []string, s string) bool {
for _, item := range slice {
if item == s {
return true
}
}
return false
}
func removeString(slice []string, s string) []string {
var result []string
for _, item := range slice {
if item == s {
continue
}
result = append(result, item)
}
return result
}
第五步:Admission Webhook校驗
package v1alpha1
import (
"fmt"
"k8s.io/apimachinery/pkg/runtime"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/webhook"
"sigs.k8s.io/controller-runtime/pkg/webhook/admission"
)
func (w *WebApp) SetupWebhookWithManager(mgr ctrl.Manager) error {
return ctrl.NewWebhookManagedBy(mgr).
For(w).
Complete()
}
var _ webhook.Defaulter = &WebApp{}
func (w *WebApp) Default() {
if w.Spec.Port == nil {
port := int32(8080)
w.Spec.Port = &port
}
if w.Spec.Replicas == nil {
replicas := int32(1)
w.Spec.Replicas = &replicas
}
}
var _ webhook.Validator = &WebApp{}
func (w *WebApp) ValidateCreate() (admission.Warnings, error) {
if err := validateWebApp(w); err != nil {
return nil, err
}
return admission.Warnings{"WebApp建立後將自動部署Deployment和Service"}, nil
}
func (w *WebApp) ValidateUpdate(old runtime.Object) error {
oldWA := old.(*WebApp)
if *oldWA.Spec.Replicas > 0 && *w.Spec.Replicas == 0 {
return fmt.Errorf("不允許將replicas縮容到0,請直接刪除WebApp資源")
}
return validateWebApp(w)
}
func (w *WebApp) ValidateDelete() error {
return nil
}
func validateWebApp(w *WebApp) error {
if w.Spec.Image == "" {
return fmt.Errorf("image不能為空")
}
if w.Spec.Replicas != nil && *w.Spec.Replicas > 100 {
return fmt.Errorf("replicas不能超過100,當前值: %d", *w.Spec.Replicas)
}
if w.Spec.Port != nil {
if *w.Spec.Port < 1 || *w.Spec.Port > 65535 {
return fmt.Errorf("port必須在1-65535之間,當前值: %d", *w.Spec.Port)
}
}
for i, env := range w.Spec.Env {
if env.Name == "" {
return fmt.Errorf("env[%d].name不能為空", i)
}
if env.Value == "" && env.ValueFrom == nil {
return fmt.Errorf("env[%d]必須指定value或valueFrom", i)
}
}
return nil
}
第六步:生產級配置——Leader Election、Graceful Shutdown與Metrics
package main
import (
"context"
"flag"
"fmt"
"os"
"k8s.io/apimachinery/pkg/runtime"
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/healthz"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
"sigs.k8s.io/controller-runtime/pkg/metrics/server"
"sigs.k8s.io/controller-runtime/pkg/webhook"
appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
"github.com/toolsku/webapp-operator/internal/controller"
)
var (
scheme = runtime.NewScheme()
setupLog = ctrl.Log.WithName("setup")
)
func init() {
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
utilruntime.Must(appsv1alpha1.AddToScheme(scheme))
}
func main() {
var metricsAddr string
var enableLeaderElection bool
var probeAddr string
var webhookPort int
flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "Metrics服務位址")
flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "健康檢查位址")
flag.BoolVar(&enableLeaderElection, "leader-elect", true, "啟用Leader Election")
flag.IntVar(&webhookPort, "webhook-port", 9443, "Webhook連接埠")
opts := zap.Options{Development: true}
opts.BindFlags(flag.CommandLine)
flag.Parse()
ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
Metrics: server.Options{
BindAddress: metricsAddr,
},
WebhookServer: webhook.NewServer(webhook.Options{
Port: webhookPort,
}),
HealthProbeBindAddress: probeAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: "webapp-operator.toolsku.dev",
LeaderElectionNamespace: os.Getenv("POD_NAMESPACE"),
})
if err != nil {
setupLog.Error(err, "建立Manager失敗")
os.Exit(1)
}
if err := (&controller.WebAppReconciler{
Client: mgr.GetClient(),
Scheme: mgr.GetScheme(),
}).SetupWithManager(mgr); err != nil {
setupLog.Error(err, "設置Controller失敗")
os.Exit(1)
}
if err := (&appsv1alpha1.WebApp{}).SetupWebhookWithManager(mgr); err != nil {
setupLog.Error(err, "設置Webhook失敗")
os.Exit(1)
}
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
setupLog.Error(err, "設置健康檢查失敗")
os.Exit(1)
}
if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
setupLog.Error(err, "設置就緒檢查失敗")
os.Exit(1)
}
setupLog.Info("啟動Operator",
"metrics", metricsAddr,
"probe", probeAddr,
"leader-election", enableLeaderElection,
"webhook-port", webhookPort,
)
if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
setupLog.Error(err, "執行Manager失敗")
os.Exit(1)
}
}
Controller Manager部署YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-operator-controller-manager
namespace: webapp-operator-system
spec:
replicas: 2
selector:
matchLabels:
control-plane: controller-manager
template:
metadata:
labels:
control-plane: controller-manager
annotations:
kubectl.kubernetes.io/default-container: manager
spec:
serviceAccountName: controller-manager
containers:
- name: manager
image: toolsku/webapp-operator:v1.0.0
args:
- --leader-elect
- --metrics-bind-address=:8080
- --health-probe-bind-address=:8081
- --webhook-port=9443
ports:
- containerPort: 8080
name: metrics
- containerPort: 8081
name: health
- containerPort: 9443
name: webhook
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
resources:
limits:
cpu: 500m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: cert
mountPath: /tmp/k8s-webhook-server/serving-certs
readOnly: true
volumes:
- name: cert
secret:
defaultMode: 420
secretName: webhook-server-cert
terminationGracePeriodSeconds: 60
六大避坑指南
坑1:Reconcile非冪等導致重複建立資源
❌ 錯誤做法:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var webapp appsv1alpha1.WebApp
r.Get(ctx, req.NamespacedName, &webapp)
deploy := r.buildDeployment(&webapp)
r.Create(ctx, deploy)
return ctrl.Result{}, nil
}
✅ 正確做法:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var webapp appsv1alpha1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
if errors.IsNotFound(err) {
return ctrl.Result{}, nil
}
return ctrl.Result{}, err
}
var existing appsv1.Deployment
err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &existing)
if err != nil && errors.IsNotFound(err) {
deploy := r.buildDeployment(&webapp)
if err := ctrl.SetControllerReference(&webapp, deploy, r.Scheme); err != nil {
return ctrl.Result{}, err
}
if err := r.Create(ctx, deploy); err != nil {
return ctrl.Result{}, err
}
} else if err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
坑2:Status更新與Spec更新衝突
❌ 錯誤做法:
webapp.Status.AvailableReplicas = 3
if err := r.Update(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
✅ 正確做法:
webapp.Status.AvailableReplicas = 3
if err := r.Status().Update(ctx, webapp); err != nil {
if errors.IsConflict(err) {
return ctrl.Result{Requeue: true}, nil
}
return ctrl.Result{}, err
}
坑3:Finalizer未正確處理導致資源卡在Terminating
❌ 錯誤做法:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var webapp appsv1alpha1.WebApp
r.Get(ctx, req.NamespacedName, &webapp)
if webapp.DeletionTimestamp != nil {
return ctrl.Result{}, nil
}
webapp.Finalizers = append(webapp.Finalizers, "my-finalizer")
r.Update(ctx, &webapp)
return ctrl.Result{}, nil
}
✅ 正確做法:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var webapp appsv1alpha1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
if webapp.DeletionTimestamp != nil {
if containsString(webapp.Finalizers, finalizerName) {
if err := r.cleanupExternalResources(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
webapp.Finalizers = removeString(webapp.Finalizers, finalizerName)
if err := r.Update(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
if !containsString(webapp.Finalizers, finalizerName) {
webapp.Finalizers = append(webapp.Finalizers, finalizerName)
if err := r.Update(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
}
return r.reconcileNormal(ctx, &webapp)
}
坑4:缺少OwnerReference導致子資源洩漏
❌ 錯誤做法:
deploy := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
},
}
r.Create(ctx, deploy)
✅ 正確做法:
deploy := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
},
}
if err := ctrl.SetControllerReference(&webapp, deploy, r.Scheme); err != nil {
return ctrl.Result{}, err
}
if err := r.Create(ctx, deploy); err != nil && !errors.IsAlreadyExists(err) {
return ctrl.Result{}, err
}
坑5:Reconcile中忽略Conflict直接返回錯誤
❌ 錯誤做法:
if err := r.Update(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
✅ 正確做法:
if err := r.Update(ctx, &webapp); err != nil {
if errors.IsConflict(err) {
return ctrl.Result{Requeue: true}, nil
}
return ctrl.Result{}, err
}
坑6:Webhook未配置憑證導致TLS握手失敗
❌ 錯誤做法:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: webapp-validator
webhooks:
- name: webapp.apps.toolsku.dev
clientConfig:
service:
name: webhook-service
namespace: webapp-operator-system
path: /validate-apps-toolsku-dev-v1alpha1-webapp
sideEffects: None
admissionReviewVersions: ["v1"]
✅ 正確做法:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: webapp-validator
annotations:
cert-manager.io/inject-ca-from: webapp-operator-system/webapp-operator-serving-cert
webhooks:
- name: webapp.apps.toolsku.dev
clientConfig:
service:
name: webhook-service
namespace: webapp-operator-system
path: /validate-apps-toolsku-dev-v1alpha1-webapp
port: 443
sideEffects: None
admissionReviewVersions: ["v1"]
failurePolicy: Fail
timeoutSeconds: 10
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: webapp-operator-serving-cert
namespace: webapp-operator-system
spec:
dnsNames:
- webhook-service.webapp-operator-system.svc
- webhook-service.webapp-operator-system.svc.cluster.local
issuerRef:
kind: Issuer
name: webapp-operator-selfsigned-issuer
secretName: webhook-server-cert
錯誤排查速查表
| # | 錯誤 | 原因 | 解決方案 |
|---|---|---|---|
| 1 | the server could not find the requested resource |
CRD未安裝或API版本不匹配 | 執行kubectl apply -f config/crd/bases/,檢查apiVersion是否與CRD一致 |
| 2 | Operation cannot be fulfilled on webapps.apps.toolsku.dev: the object has been modified |
樂觀鎖衝突,物件在Reconcile期間被其他程序修改 | 捕獲IsConflict錯誤,返回ctrl.Result{Requeue: true}讓Reconcile重新執行 |
| 3 | webhook server: TLS handshake error |
Webhook憑證未正確配置或已過期 | 安裝cert-manager,配置Certificate資源自動簽發憑證 |
| 4 | resource stuck in Terminating state |
Finalizer未正確移除,Controller已停止運行 | kubectl patch <cr> -p '{"metadata":{"finalizers":[]}}' --type=merge強制移除 |
| 5 | no matches for kind "WebApp" in version "apps.toolsku.dev/v1alpha1" |
CRD的versions中served: false或storage: false |
檢查CRD定義,確保目標版本served: true |
| 6 | failed to call webhook: Post "https://...": x509: certificate signed by unknown authority |
Webhook CA憑證未注入到ValidatingWebhookConfiguration | 配置cert-manager的inject-ca-from註解,或手動設置caBundle |
| 7 | controller: Reconciler error: failed to get deployment: deployments is forbidden |
RBAC權限不足,ServiceAccount缺少對應API的操作權限 | 檢查config/rbac/下的Role和RoleBinding,新增apps組的deployments資源權限 |
| 8 | leader-election: failed to renew lease |
Leader Election的Lease物件更新失敗,通常是網路或權限問題 | 檢查coordination.k8s.io的leases資源權限,確認網路連通性 |
| 9 | too many open files |
Controller Watch的連線數超過系統檔案描述符限制 | 調高ulimit -n,或減少Watch的資源類型,使用WithEventFilter過濾事件 |
| 10 | webhook: admission webhook denied the request: replicas cannot be 0 |
Validating Webhook拒絕了非法請求 | 檢查Webhook校驗邏輯,確認請求參數是否符合業務約束 |
三大進階最佳化技巧
技巧1:EventFilter減少無效Reconcile
預設情況下,任何Status更新都會觸發Reconcile。透過EventFilter過濾掉不需要的事件:
import (
"fmt"
"reflect"
"sigs.k8s.io/controller-runtime/pkg/event"
"sigs.k8s.io/controller-runtime/pkg/predicate"
)
func WebAppPredicate() predicate.Predicate {
return predicate.Funcs{
UpdateFunc: func(e event.UpdateEvent) bool {
oldObj := e.ObjectOld.(*appsv1alpha1.WebApp)
newObj := e.ObjectNew.(*appsv1alpha1.WebApp)
if oldObj.Generation != newObj.Generation {
return true
}
if oldObj.DeletionTimestamp != newObj.DeletionTimestamp {
return true
}
if !reflect.DeepEqual(oldObj.Finalizers, newObj.Finalizers) {
return true
}
return false
},
CreateFunc: func(e event.CreateEvent) bool {
return true
},
DeleteFunc: func(e event.DeleteEvent) bool {
return true
},
GenericFunc: func(e event.GenericEvent) bool {
return false
},
}
}
func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&appsv1alpha1.WebApp{}).
WithEventFilter(WebAppPredicate()).
Owns(&appsv1.Deployment{}).
Owns(&corev1.Service{}).
Complete(r)
}
技巧2:Reconcile結果與Requeue策略
合理使用RequeueAfter實現定時調諧,避免輪詢:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var webapp appsv1alpha1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
if webapp.DeletionTimestamp != nil {
return r.handleDeletion(ctx, &webapp)
}
if err := r.reconcileResources(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
if !r.isWebAppReady(ctx, &webapp) {
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
}
return ctrl.Result{}, nil
}
func (r *WebAppReconciler) isWebAppReady(ctx context.Context, webapp *appsv1alpha1.WebApp) bool {
var deployment appsv1.Deployment
if err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment); err != nil {
return false
}
return deployment.Status.ReadyReplicas == *webapp.Spec.Replicas
}
技巧3:自定義Metrics暴露Controller運行指標
import (
"github.com/prometheus/client_golang/prometheus"
"sigs.k8s.io/controller-runtime/pkg/metrics"
)
var (
reconcileTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "webapp_operator_reconcile_total",
Help: "WebApp Operator調諧總次數",
},
[]string{"name", "namespace", "result"},
)
reconcileDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "webapp_operator_reconcile_duration_seconds",
Help: "WebApp Operator調諧耗時分佈",
Buckets: prometheus.DefBuckets,
},
[]string{"name", "namespace"},
)
managedResources = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "webapp_operator_managed_resources",
Help: "當前管理的WebApp資源數量",
},
[]string{"namespace"},
)
)
func init() {
metrics.Registry.MustRegister(reconcileTotal, reconcileDuration, managedResources)
}
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
start := time.Now()
defer func() {
reconcileDuration.WithLabelValues(req.Name, req.Namespace).Observe(time.Since(start).Seconds())
}()
var webapp appsv1alpha1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
if errors.IsNotFound(err) {
reconcileTotal.WithLabelValues(req.Name, req.Namespace, "not_found").Inc()
return ctrl.Result{}, nil
}
reconcileTotal.WithLabelValues(req.Name, req.Namespace, "error").Inc()
return ctrl.Result{}, err
}
if err := r.reconcileResources(ctx, &webapp); err != nil {
reconcileTotal.WithLabelValues(req.Name, req.Namespace, "error").Inc()
return ctrl.Result{}, err
}
reconcileTotal.WithLabelValues(req.Name, req.Namespace, "success").Inc()
return ctrl.Result{}, nil
}
Operator開發方案對比分析
| 維度 | raw client-go | Kubebuilder | Operator SDK (Go) | Helm Operator |
|---|---|---|---|---|
| 開發語言 | Go | Go | Go/Ansible/Helm | Helm |
| 腳手架 | 無 | 完善的CLI | 完善的CLI | SDK內建 |
| 學習曲線 | 陡峭 | 中等 | 中等 | 平緩 |
| 程式碼生成 | 無 | CRD類型+Webhook | CRD類型+Webhook | 無 |
| 靈活性 | 最高 | 高 | 中 | 低 |
| Controller複雜度 | 手動實現 | controller-runtime | controller-runtime | 自動生成 |
| Webhook支援 | 手動實現 | 自動生成 | 自動生成 | 不支援 |
| Leader Election | 手動實現 | 內建 | 內建 | 內建 |
| Metrics | 手動實現 | 內建 | 內建 | 內建 |
| 多版本CRD | 手動實現 | Conversion Webhook | Conversion Webhook | 不支援 |
| 社群生態 | K8s官方 | CNCF | CNCF | CNCF |
| 適用場景 | 高度定製 | 通用Operator | 快速開發 | 簡單應用 |
| 生產就緒 | 需大量封裝 | 開箱即用 | 開箱即用 | 開箱即用 |
總結
Summary: CRD Operator不是「寫個CRD + 跑個Controller」那麼簡單。生產級Operator需要關注6個關鍵維度:CRD Schema設計要精確到每個欄位的校驗規則;Reconcile Loop必須冪等且正確處理Conflict;Status Subresource要分離Spec和Status的更新路徑;Finalizer要確保資源清理不會死鎖;Webhook要配置好憑證和失敗策略;生產配置必須包含Leader Election、Graceful Shutdown和Metrics。Kubebuilder是目前最推薦的Operator開發框架,它在靈活性和開發效率之間取得了最佳平衡。
推薦工具
- JSON格式化工具 - 格式化CRD定義和Status輸出的JSON資料
- Base64編碼工具 - 編碼Webhook憑證和Secret資料
- 雜湊計算工具 - 計算ConfigMap和Secret的內容雜湊,檢測配置變更
- JWT解碼工具 - 解碼ServiceAccount的JWT Token,排查RBAC權限問題
本站提供瀏覽器本地工具,免註冊即可試用 →