K8s CRD Operator開發實戰:從CRD設計到Controller實現的6種生產模式

云原生

當你手寫2000行YAML才部署一個應用時

你有沒有經歷過這種痛苦——部署一個微服務需要手寫Deployment、Service、ConfigMap、Secret、Ingress、HPA、PDB等7種資源,每個環境還要改映像版本、副本數、環境變數?一個新同事入職,光是搞清楚部署流程就要一週?更可怕的是,某天凌晨3點線上故障,你發現ConfigMap裡的資料庫連接串寫錯了,但YAML檔案散落在5個Git倉庫裡,根本不知道改哪個?

這就是K8s原生API的「組合爆炸」問題:底層資源太細碎,缺少業務語義的抽象;YAML運維沒有校驗,一個縮排錯誤就能讓整個部署失敗;多環境配置沒有收斂,dev/staging/prod的YAML差異全靠人肉對比。

CRD + Operator改變了這一切。它讓你在K8s中定義自己的業務API,用Controller自動化所有的部署邏輯——你只需要寫一個MyApp資源,Operator自動建立7個子資源、處理版本升級、管理配置變更。本文將帶你從零開始,掌握6種CRD Operator開發的生產級模式。

核心概念速查表

概念 全稱 說明
CRD Custom Resource Definition K8s中自定義資源類型的定義,聲明API路徑、欄位和校驗規則
CR Custom Resource CRD定義的資源實例,使用者建立的具體自定義資源物件
Operator Operator Pattern 透過CRD + Controller實現K8s應用自動化管理的模式
Controller Controller 持續監控集群狀態並驅動實際狀態向期望狀態收斂的控制迴圈
Reconcile Reconciliation Loop Controller的核心調諧邏輯,對比期望狀態與實際狀態的差異並執行操作
Kubebuilder Kubebuilder 基於controller-runtime的Operator SDK,提供專案腳手架和程式碼生成
controller-runtime controller-runtime 實現K8s Controller的Go庫,提供Watch/Reconcile/Event等核心機制
Finalizer Finalizer 資源刪除前的攔截機制,確保Controller完成清理工作後才允許刪除
Status Subresource Status Subresource 將Spec和Status分離為兩個獨立更新路徑,避免更新衝突
Owner Reference Owner Reference 資源間的屬主關係,實現級聯刪除和垃圾回收
Event Event K8s事件記錄,用於向使用者展示Controller的操作狀態和錯誤資訊
Webhook Admission Webhook API請求的攔截校驗機制,包括Mutating和Validating兩種

六大挑戰:為什麼CRD Operator開發不是「寫個CRD就完事」

  1. CRD Schema設計陷阱:欄位類型定義不精確,缺少OpenAPI v3校驗,導致使用者可以提交非法資料;版本相容性沒有提前規劃,v1到v2的遷移變成噩夢

  2. Reconcile Loop冪等性:Controller的Reconcile可能被重複觸發,如果邏輯不是冪等的,就會建立重複資源或執行重複操作——這是最常見也最致命的bug

  3. Status更新衝突:多個Controller同時更新同一資源的Status,或者Controller更新Status時Spec已經被修改,導致樂觀鎖衝突和更新丟失

  4. Finalizer死鎖:Finalizer新增後如果Controller崩潰或被刪除,資源將永遠卡在Terminating狀態,無法刪除也無法重建

  5. 事件風暴:大規模集群中,一個CRD變更可能觸發數百個子資源的建立/更新,Controller來不及處理導致Workqueue堆積

  6. 生產級缺失:缺少Leader Election導致多副本Controller重複執行,缺少Graceful Shutdown導致Reconcile中斷,缺少Metrics導致無法監控Controller健康狀態

六步實戰:從CRD設計到Controller實現

第一步:CRD定義與OpenAPI v3 Schema

CRD定義(含完整校驗):

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: webapps.apps.toolsku.dev
spec:
  group: apps.toolsku.dev
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required:
                - image
                - replicas
              properties:
                image:
                  type: string
                  pattern: '^[a-zA-Z0-9][a-zA-Z0-9._-]*:[a-zA-Z0-9][a-zA-Z0-9._-]*$'
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 100
                port:
                  type: integer
                  minimum: 1
                  maximum: 65535
                  default: 8080
                resources:
                  type: object
                  properties:
                    requests:
                      type: object
                      properties:
                        cpu:
                          type: string
                          pattern: '^([0-9]+m|[0-9]+(\.[0-9]+)?)$'
                        memory:
                          type: string
                          pattern: '^[0-9]+(Ki|Mi|Gi|Ti)$'
                    limits:
                      type: object
                      properties:
                        cpu:
                          type: string
                        memory:
                          type: string
                env:
                  type: array
                  items:
                    type: object
                    required:
                      - name
                    properties:
                      name:
                        type: string
                        maxLength: 256
                      value:
                        type: string
                      valueFrom:
                        type: object
                        properties:
                          secretKeyRef:
                            type: object
                            required:
                              - name
                              - key
                            properties:
                              name:
                                type: string
                              key:
                                type: string
            status:
              type: object
              properties:
                availableReplicas:
                  type: integer
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                      status:
                        type: string
                        enum:
                          - "True"
                          - "False"
                          - Unknown
                      lastTransitionTime:
                        type: string
                        format: date-time
                      reason:
                        type: string
                      message:
                        type: string
      subresources:
        status: {}
        scale:
          specReplicasPath: .spec.replicas
          statusReplicasPath: .status.availableReplicas
      additionalPrinterColumns:
        - name: Image
          type: string
          jsonPath: .spec.image
        - name: Replicas
          type: integer
          jsonPath: .spec.replicas
        - name: Available
          type: integer
          jsonPath: .status.availableReplicas
        - name: Age
          type: date
          jsonPath: .metadata.creationTimestamp
  scope: Namespaced
  names:
    plural: webapps
    singular: webapp
    kind: WebApp
    shortNames:
      - wa

第二步:Kubebuilder專案腳手架

# 初始化Kubebuilder專案
mkdir webapp-operator && cd webapp-operator
kubebuilder init --domain toolsku.dev --repo github.com/toolsku/webapp-operator

# 建立API(CRD + Controller)
kubebuilder create api --group apps --version v1alpha1 --kind WebApp

# 建立Webhook
kubebuilder create webhook --group apps --version v1alpha1 --kind WebApp --defaulting --programmatic-validation

# 專案結構
# ├── api/v1alpha1/
# │   ├── webapp_types.go      # CRD類型定義
# │   ├── webapp_webhook.go    # Webhook邏輯
# │   ├── webhook_suite_test.go
# │   └── groupversion_info.go
# ├── internal/controller/
# │   ├── webapp_controller.go # Controller調諧邏輯
# │   └── suite_test.go
# ├── cmd/
# │   └── main.go              # 入口檔案
# ├── config/
# │   ├── crd/                 # CRD YAML
# │   ├── rbac/                # RBAC配置
# │   ├── manager/             # Controller Manager部署
# │   └── samples/             # 範例CR
# └── Dockerfile

CRD類型定義(Go):

package v1alpha1

import (
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type WebAppSpec struct {
	Image     string                      `json:"image"`
	Replicas  *int32                      `json:"replicas"`
	Port      *int32                      `json:"port,omitempty"`
	Resources *WebAppResourceRequirements `json:"resources,omitempty"`
	Env       []WebAppEnvVar              `json:"env,omitempty"`
}

type WebAppResourceRequirements struct {
	Requests *ResourceList `json:"requests,omitempty"`
	Limits   *ResourceList `json:"limits,omitempty"`
}

type ResourceList struct {
	CPU    string `json:"cpu,omitempty"`
	Memory string `json:"memory,omitempty"`
}

type WebAppEnvVar struct {
	Name      string          `json:"name"`
	Value     string          `json:"value,omitempty"`
	ValueFrom *EnvVarSource   `json:"valueFrom,omitempty"`
}

type EnvVarSource struct {
	SecretKeyRef *SecretKeyRef `json:"secretKeyRef,omitempty"`
}

type SecretKeyRef struct {
	Name string `json:"name"`
	Key  string `json:"key"`
}

type WebAppStatus struct {
	AvailableReplicas int32              `json:"availableReplicas,omitempty"`
	Conditions       []metav1.Condition `json:"conditions,omitempty"`
}

type WebApp struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   WebAppSpec   `json:"spec,omitempty"`
	Status WebAppStatus `json:"status,omitempty"`
}

type WebAppList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []WebApp `json:"items"`
}

func init() {
	SchemeBuilder.Register(&WebApp{}, &WebAppList{})
}

第三步:Controller Reconcile Loop實現

package controller

import (
	"context"
	"fmt"

	appsv1 "k8s.io/api/apps/v1"
	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	"k8s.io/apimachinery/pkg/api/meta"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/types"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/log"

	appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
)

type WebAppReconciler struct {
	client.Client
	Scheme *runtime.Scheme
}

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	logger := log.FromContext(ctx)

	var webapp appsv1alpha1.WebApp
	if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
		if errors.IsNotFound(err) {
			logger.Info("WebApp resource not found, ignoring")
			return ctrl.Result{}, nil
		}
		logger.Error(err, "Failed to get WebApp")
		return ctrl.Result{}, err
	}

	if webapp.DeletionTimestamp != nil {
		return r.handleDeletion(ctx, &webapp)
	}

	if err := r.ensureFinalizer(ctx, &webapp); err != nil {
		return ctrl.Result{}, err
	}

	deployment, err := r.reconcileDeployment(ctx, &webapp)
	if err != nil {
		r.updateCondition(ctx, &webapp, "Available", metav1.ConditionFalse, "ReconcileError", err.Error())
		return ctrl.Result{}, err
	}

	service, err := r.reconcileService(ctx, &webapp)
	if err != nil {
		r.updateCondition(ctx, &webapp, "Available", metav1.ConditionFalse, "ReconcileError", err.Error())
		return ctrl.Result{}, err
	}

	_ = deployment
	_ = service

	if err := r.updateStatus(ctx, &webapp); err != nil {
		return ctrl.Result{}, err
	}

	r.updateCondition(ctx, &webapp, "Available", metav1.ConditionTrue, "ReconcileSuccess", "WebApp is available")

	return ctrl.Result{}, nil
}

func (r *WebAppReconciler) reconcileDeployment(ctx context.Context, webapp *appsv1alpha1.WebApp) (*appsv1.Deployment, error) {
	logger := log.FromContext(ctx)

	var deployment appsv1.Deployment
	err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment)

	if err != nil && errors.IsNotFound(err) {
		desired := r.buildDeployment(webapp)
		if err := ctrl.SetControllerReference(webapp, desired, r.Scheme); err != nil {
			return nil, fmt.Errorf("設置OwnerReference失敗: %w", err)
		}
		logger.Info("建立Deployment", "name", desired.Name)
		if err := r.Create(ctx, desired); err != nil {
			return nil, fmt.Errorf("建立Deployment失敗: %w", err)
		}
		return desired, nil
	} else if err != nil {
		return nil, fmt.Errorf("查詢Deployment失敗: %w", err)
	}

	desired := r.buildDeployment(webapp)
	if r.deploymentNeedsUpdate(&deployment, desired) {
		deployment.Spec = desired.Spec
		logger.Info("更新Deployment", "name", deployment.Name)
		if err := r.Update(ctx, &deployment); err != nil {
			return nil, fmt.Errorf("更新Deployment失敗: %w", err)
		}
	}

	return &deployment, nil
}

func (r *WebAppReconciler) buildDeployment(webapp *appsv1alpha1.WebApp) *appsv1.Deployment {
	replicas := int32(1)
	if webapp.Spec.Replicas != nil {
		replicas = *webapp.Spec.Replicas
	}
	port := int32(8080)
	if webapp.Spec.Port != nil {
		port = *webapp.Spec.Port
	}

	envVars := make([]corev1.EnvVar, 0, len(webapp.Spec.Env))
	for _, e := range webapp.Spec.Env {
		ev := corev1.EnvVar{Name: e.Name, Value: e.Value}
		if e.ValueFrom != nil && e.ValueFrom.SecretKeyRef != nil {
			ev.ValueFrom = &corev1.EnvVarSource{
				SecretKeyRef: &corev1.SecretKeySelector{
					LocalObjectReference: corev1.LocalObjectReference{Name: e.ValueFrom.SecretKeyRef.Name},
					Key:                  e.ValueFrom.SecretKeyRef.Key,
				},
			}
		}
		envVars = append(envVars, ev)
	}

	return &appsv1.Deployment{
		ObjectMeta: metav1.ObjectMeta{
			Name:      webapp.Name,
			Namespace: webapp.Namespace,
			Labels: map[string]string{
				"app.kubernetes.io/name":       "webapp",
				"app.kubernetes.io/instance":   webapp.Name,
				"app.kubernetes.io/managed-by": "webapp-operator",
			},
		},
		Spec: appsv1.DeploymentSpec{
			Replicas: &replicas,
			Selector: &metav1.LabelSelector{
				MatchLabels: map[string]string{
					"app.kubernetes.io/instance": webapp.Name,
				},
			},
			Template: corev1.PodTemplateSpec{
				ObjectMeta: metav1.ObjectMeta{
					Labels: map[string]string{
						"app.kubernetes.io/name":     "webapp",
						"app.kubernetes.io/instance": webapp.Name,
					},
				},
				Spec: corev1.PodSpec{
					Containers: []corev1.Container{
						{
							Name:  "webapp",
							Image: webapp.Spec.Image,
							Ports: []corev1.ContainerPort{{ContainerPort: port}},
							Env:   envVars,
						},
					},
				},
			},
		},
	}
}

func (r *WebAppReconciler) reconcileService(ctx context.Context, webapp *appsv1alpha1.WebApp) (*corev1.Service, error) {
	logger := log.FromContext(ctx)
	port := int32(8080)
	if webapp.Spec.Port != nil {
		port = *webapp.Spec.Port
	}

	var svc corev1.Service
	err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &svc)

	if err != nil && errors.IsNotFound(err) {
		desired := &corev1.Service{
			ObjectMeta: metav1.ObjectMeta{
				Name:      webapp.Name,
				Namespace: webapp.Namespace,
				Labels: map[string]string{
					"app.kubernetes.io/instance":   webapp.Name,
					"app.kubernetes.io/managed-by": "webapp-operator",
				},
			},
			Spec: corev1.ServiceSpec{
				Selector: map[string]string{
					"app.kubernetes.io/instance": webapp.Name,
				},
				Ports: []corev1.ServicePort{
					{Port: port, TargetPort: intstr.FromInt32(port)},
				},
			},
		}
		if err := ctrl.SetControllerReference(webapp, desired, r.Scheme); err != nil {
			return nil, err
		}
		logger.Info("建立Service", "name", desired.Name)
		if err := r.Create(ctx, desired); err != nil {
			return nil, err
		}
		return desired, nil
	} else if err != nil {
		return nil, err
	}

	return &svc, nil
}

func (r *WebAppReconciler) deploymentNeedsUpdate(current, desired *appsv1.Deployment) bool {
	if *current.Spec.Replicas != *desired.Spec.Replicas {
		return true
	}
	if current.Spec.Template.Spec.Containers[0].Image != desired.Spec.Template.Spec.Containers[0].Image {
		return true
	}
	return false
}

func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&appsv1alpha1.WebApp{}).
		Owns(&appsv1.Deployment{}).
		Owns(&corev1.Service{}).
		Complete(r)
}

第四步:Status Subresource管理與Condition更新

package controller

import (
	"context"

	appsv1 "k8s.io/api/apps/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/types"
	"sigs.k8s.io/controller-runtime/pkg/log"

	appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
)

const (
	finalizerName = "apps.toolsku.dev/finalizer"
)

func (r *WebAppReconciler) updateStatus(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
	var deployment appsv1.Deployment
	err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment)
	if err != nil {
		return err
	}

	availableReplicas := int32(0)
	if deployment.Status.AvailableReplicas > 0 {
		availableReplicas = deployment.Status.AvailableReplicas
	}

	if webapp.Status.AvailableReplicas == availableReplicas {
		return nil
	}

	webapp.Status.AvailableReplicas = availableReplicas
	return r.Status().Update(ctx, webapp)
}

func (r *WebAppReconciler) updateCondition(ctx context.Context, webapp *appsv1alpha1.WebApp, condType string, status metav1.ConditionStatus, reason, message string) {
	condition := metav1.Condition{
		Type:               condType,
		Status:             status,
		Reason:             reason,
		Message:            message,
		ObservedGeneration: webapp.Generation,
	}

	meta.SetStatusCondition(&webapp.Status.Conditions, condition)

	if err := r.Status().Update(ctx, webapp); err != nil {
		log.FromContext(ctx).Error(err, "更新Condition失敗")
	}
}

func (r *WebAppReconciler) ensureFinalizer(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
	if !containsString(webapp.Finalizers, finalizerName) {
		webapp.Finalizers = append(webapp.Finalizers, finalizerName)
		if err := r.Update(ctx, webapp); err != nil {
			return err
		}
	}
	return nil
}

func (r *WebAppReconciler) handleDeletion(ctx context.Context, webapp *appsv1alpha1.WebApp) (ctrl.Result, error) {
	logger := log.FromContext(ctx)

	if !containsString(webapp.Finalizers, finalizerName) {
		return ctrl.Result{}, nil
	}

	logger.Info("執行清理邏輯", "webapp", webapp.Name)

	if err := r.cleanupExternalResources(ctx, webapp); err != nil {
		return ctrl.Result{}, err
	}

	webapp.Finalizers = removeString(webapp.Finalizers, finalizerName)
	if err := r.Update(ctx, webapp); err != nil {
		return ctrl.Result{}, err
	}

	logger.Info("Finalizer清理完成", "webapp", webapp.Name)
	return ctrl.Result{}, nil
}

func (r *WebAppReconciler) cleanupExternalResources(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
	logger := log.FromContext(ctx)
	logger.Info("清理外部資源", "webapp", webapp.Name)
	return nil
}

func containsString(slice []string, s string) bool {
	for _, item := range slice {
		if item == s {
			return true
		}
	}
	return false
}

func removeString(slice []string, s string) []string {
	var result []string
	for _, item := range slice {
		if item == s {
			continue
		}
		result = append(result, item)
	}
	return result
}

第五步:Admission Webhook校驗

package v1alpha1

import (
	"fmt"

	"k8s.io/apimachinery/pkg/runtime"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/webhook"
	"sigs.k8s.io/controller-runtime/pkg/webhook/admission"
)

func (w *WebApp) SetupWebhookWithManager(mgr ctrl.Manager) error {
	return ctrl.NewWebhookManagedBy(mgr).
		For(w).
		Complete()
}

var _ webhook.Defaulter = &WebApp{}

func (w *WebApp) Default() {
	if w.Spec.Port == nil {
		port := int32(8080)
		w.Spec.Port = &port
	}
	if w.Spec.Replicas == nil {
		replicas := int32(1)
		w.Spec.Replicas = &replicas
	}
}

var _ webhook.Validator = &WebApp{}

func (w *WebApp) ValidateCreate() (admission.Warnings, error) {
	if err := validateWebApp(w); err != nil {
		return nil, err
	}
	return admission.Warnings{"WebApp建立後將自動部署Deployment和Service"}, nil
}

func (w *WebApp) ValidateUpdate(old runtime.Object) error {
	oldWA := old.(*WebApp)
	if *oldWA.Spec.Replicas > 0 && *w.Spec.Replicas == 0 {
		return fmt.Errorf("不允許將replicas縮容到0,請直接刪除WebApp資源")
	}
	return validateWebApp(w)
}

func (w *WebApp) ValidateDelete() error {
	return nil
}

func validateWebApp(w *WebApp) error {
	if w.Spec.Image == "" {
		return fmt.Errorf("image不能為空")
	}
	if w.Spec.Replicas != nil && *w.Spec.Replicas > 100 {
		return fmt.Errorf("replicas不能超過100,當前值: %d", *w.Spec.Replicas)
	}
	if w.Spec.Port != nil {
		if *w.Spec.Port < 1 || *w.Spec.Port > 65535 {
			return fmt.Errorf("port必須在1-65535之間,當前值: %d", *w.Spec.Port)
		}
	}
	for i, env := range w.Spec.Env {
		if env.Name == "" {
			return fmt.Errorf("env[%d].name不能為空", i)
		}
		if env.Value == "" && env.ValueFrom == nil {
			return fmt.Errorf("env[%d]必須指定value或valueFrom", i)
		}
	}
	return nil
}

第六步:生產級配置——Leader Election、Graceful Shutdown與Metrics

package main

import (
	"context"
	"flag"
	"fmt"
	"os"

	"k8s.io/apimachinery/pkg/runtime"
	utilruntime "k8s.io/apimachinery/pkg/util/runtime"
	clientgoscheme "k8s.io/client-go/kubernetes/scheme"
	_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/healthz"
	"sigs.k8s.io/controller-runtime/pkg/log/zap"
	"sigs.k8s.io/controller-runtime/pkg/metrics/server"
	"sigs.k8s.io/controller-runtime/pkg/webhook"

	appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
	"github.com/toolsku/webapp-operator/internal/controller"
)

var (
	scheme   = runtime.NewScheme()
	setupLog = ctrl.Log.WithName("setup")
)

func init() {
	utilruntime.Must(clientgoscheme.AddToScheme(scheme))
	utilruntime.Must(appsv1alpha1.AddToScheme(scheme))
}

func main() {
	var metricsAddr string
	var enableLeaderElection bool
	var probeAddr string
	var webhookPort int

	flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "Metrics服務位址")
	flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "健康檢查位址")
	flag.BoolVar(&enableLeaderElection, "leader-elect", true, "啟用Leader Election")
	flag.IntVar(&webhookPort, "webhook-port", 9443, "Webhook連接埠")
	opts := zap.Options{Development: true}
	opts.BindFlags(flag.CommandLine)
	flag.Parse()

	ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
		Scheme: scheme,
		Metrics: server.Options{
			BindAddress: metricsAddr,
		},
		WebhookServer: webhook.NewServer(webhook.Options{
			Port: webhookPort,
		}),
		HealthProbeBindAddress: probeAddr,
		LeaderElection:         enableLeaderElection,
		LeaderElectionID:       "webapp-operator.toolsku.dev",
		LeaderElectionNamespace: os.Getenv("POD_NAMESPACE"),
	})
	if err != nil {
		setupLog.Error(err, "建立Manager失敗")
		os.Exit(1)
	}

	if err := (&controller.WebAppReconciler{
		Client: mgr.GetClient(),
		Scheme: mgr.GetScheme(),
	}).SetupWithManager(mgr); err != nil {
		setupLog.Error(err, "設置Controller失敗")
		os.Exit(1)
	}

	if err := (&appsv1alpha1.WebApp{}).SetupWebhookWithManager(mgr); err != nil {
		setupLog.Error(err, "設置Webhook失敗")
		os.Exit(1)
	}

	if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
		setupLog.Error(err, "設置健康檢查失敗")
		os.Exit(1)
	}
	if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
		setupLog.Error(err, "設置就緒檢查失敗")
		os.Exit(1)
	}

	setupLog.Info("啟動Operator",
		"metrics", metricsAddr,
		"probe", probeAddr,
		"leader-election", enableLeaderElection,
		"webhook-port", webhookPort,
	)

	if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
		setupLog.Error(err, "執行Manager失敗")
		os.Exit(1)
	}
}

Controller Manager部署YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp-operator-controller-manager
  namespace: webapp-operator-system
spec:
  replicas: 2
  selector:
    matchLabels:
      control-plane: controller-manager
  template:
    metadata:
      labels:
        control-plane: controller-manager
      annotations:
        kubectl.kubernetes.io/default-container: manager
    spec:
      serviceAccountName: controller-manager
      containers:
        - name: manager
          image: toolsku/webapp-operator:v1.0.0
          args:
            - --leader-elect
            - --metrics-bind-address=:8080
            - --health-probe-bind-address=:8081
            - --webhook-port=9443
          ports:
            - containerPort: 8080
              name: metrics
            - containerPort: 8081
              name: health
            - containerPort: 9443
              name: webhook
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8081
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8081
            initialDelaySeconds: 5
            periodSeconds: 10
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: 100m
              memory: 128Mi
          env:
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          volumeMounts:
            - name: cert
              mountPath: /tmp/k8s-webhook-server/serving-certs
              readOnly: true
      volumes:
        - name: cert
          secret:
            defaultMode: 420
            secretName: webhook-server-cert
      terminationGracePeriodSeconds: 60

六大避坑指南

坑1:Reconcile非冪等導致重複建立資源

錯誤做法:

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var webapp appsv1alpha1.WebApp
	r.Get(ctx, req.NamespacedName, &webapp)

	deploy := r.buildDeployment(&webapp)
	r.Create(ctx, deploy)
	return ctrl.Result{}, nil
}

正確做法:

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var webapp appsv1alpha1.WebApp
	if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
		if errors.IsNotFound(err) {
			return ctrl.Result{}, nil
		}
		return ctrl.Result{}, err
	}

	var existing appsv1.Deployment
	err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &existing)
	if err != nil && errors.IsNotFound(err) {
		deploy := r.buildDeployment(&webapp)
		if err := ctrl.SetControllerReference(&webapp, deploy, r.Scheme); err != nil {
			return ctrl.Result{}, err
		}
		if err := r.Create(ctx, deploy); err != nil {
			return ctrl.Result{}, err
		}
	} else if err != nil {
		return ctrl.Result{}, err
	}

	return ctrl.Result{}, nil
}

坑2:Status更新與Spec更新衝突

錯誤做法:

webapp.Status.AvailableReplicas = 3
if err := r.Update(ctx, webapp); err != nil {
	return ctrl.Result{}, err
}

正確做法:

webapp.Status.AvailableReplicas = 3
if err := r.Status().Update(ctx, webapp); err != nil {
	if errors.IsConflict(err) {
		return ctrl.Result{Requeue: true}, nil
	}
	return ctrl.Result{}, err
}

坑3:Finalizer未正確處理導致資源卡在Terminating

錯誤做法:

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var webapp appsv1alpha1.WebApp
	r.Get(ctx, req.NamespacedName, &webapp)
	if webapp.DeletionTimestamp != nil {
		return ctrl.Result{}, nil
	}
	webapp.Finalizers = append(webapp.Finalizers, "my-finalizer")
	r.Update(ctx, &webapp)
	return ctrl.Result{}, nil
}

正確做法:

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var webapp appsv1alpha1.WebApp
	if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	if webapp.DeletionTimestamp != nil {
		if containsString(webapp.Finalizers, finalizerName) {
			if err := r.cleanupExternalResources(ctx, &webapp); err != nil {
				return ctrl.Result{}, err
			}
			webapp.Finalizers = removeString(webapp.Finalizers, finalizerName)
			if err := r.Update(ctx, &webapp); err != nil {
				return ctrl.Result{}, err
			}
		}
		return ctrl.Result{}, nil
	}

	if !containsString(webapp.Finalizers, finalizerName) {
		webapp.Finalizers = append(webapp.Finalizers, finalizerName)
		if err := r.Update(ctx, &webapp); err != nil {
			return ctrl.Result{}, err
		}
	}

	return r.reconcileNormal(ctx, &webapp)
}

坑4:缺少OwnerReference導致子資源洩漏

錯誤做法:

deploy := &appsv1.Deployment{
	ObjectMeta: metav1.ObjectMeta{
		Name:      webapp.Name,
		Namespace: webapp.Namespace,
	},
}
r.Create(ctx, deploy)

正確做法:

deploy := &appsv1.Deployment{
	ObjectMeta: metav1.ObjectMeta{
		Name:      webapp.Name,
		Namespace: webapp.Namespace,
	},
}
if err := ctrl.SetControllerReference(&webapp, deploy, r.Scheme); err != nil {
	return ctrl.Result{}, err
}
if err := r.Create(ctx, deploy); err != nil && !errors.IsAlreadyExists(err) {
	return ctrl.Result{}, err
}

坑5:Reconcile中忽略Conflict直接返回錯誤

錯誤做法:

if err := r.Update(ctx, &webapp); err != nil {
	return ctrl.Result{}, err
}

正確做法:

if err := r.Update(ctx, &webapp); err != nil {
	if errors.IsConflict(err) {
		return ctrl.Result{Requeue: true}, nil
	}
	return ctrl.Result{}, err
}

坑6:Webhook未配置憑證導致TLS握手失敗

錯誤做法:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: webapp-validator
webhooks:
  - name: webapp.apps.toolsku.dev
    clientConfig:
      service:
        name: webhook-service
        namespace: webapp-operator-system
        path: /validate-apps-toolsku-dev-v1alpha1-webapp
    sideEffects: None
    admissionReviewVersions: ["v1"]

正確做法:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: webapp-validator
  annotations:
    cert-manager.io/inject-ca-from: webapp-operator-system/webapp-operator-serving-cert
webhooks:
  - name: webapp.apps.toolsku.dev
    clientConfig:
      service:
        name: webhook-service
        namespace: webapp-operator-system
        path: /validate-apps-toolsku-dev-v1alpha1-webapp
        port: 443
    sideEffects: None
    admissionReviewVersions: ["v1"]
    failurePolicy: Fail
    timeoutSeconds: 10
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: webapp-operator-serving-cert
  namespace: webapp-operator-system
spec:
  dnsNames:
    - webhook-service.webapp-operator-system.svc
    - webhook-service.webapp-operator-system.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: webapp-operator-selfsigned-issuer
  secretName: webhook-server-cert

錯誤排查速查表

# 錯誤 原因 解決方案
1 the server could not find the requested resource CRD未安裝或API版本不匹配 執行kubectl apply -f config/crd/bases/,檢查apiVersion是否與CRD一致
2 Operation cannot be fulfilled on webapps.apps.toolsku.dev: the object has been modified 樂觀鎖衝突,物件在Reconcile期間被其他程序修改 捕獲IsConflict錯誤,返回ctrl.Result{Requeue: true}讓Reconcile重新執行
3 webhook server: TLS handshake error Webhook憑證未正確配置或已過期 安裝cert-manager,配置Certificate資源自動簽發憑證
4 resource stuck in Terminating state Finalizer未正確移除,Controller已停止運行 kubectl patch <cr> -p '{"metadata":{"finalizers":[]}}' --type=merge強制移除
5 no matches for kind "WebApp" in version "apps.toolsku.dev/v1alpha1" CRD的versionsserved: falsestorage: false 檢查CRD定義,確保目標版本served: true
6 failed to call webhook: Post "https://...": x509: certificate signed by unknown authority Webhook CA憑證未注入到ValidatingWebhookConfiguration 配置cert-manager的inject-ca-from註解,或手動設置caBundle
7 controller: Reconciler error: failed to get deployment: deployments is forbidden RBAC權限不足,ServiceAccount缺少對應API的操作權限 檢查config/rbac/下的Role和RoleBinding,新增apps組的deployments資源權限
8 leader-election: failed to renew lease Leader Election的Lease物件更新失敗,通常是網路或權限問題 檢查coordination.k8s.ioleases資源權限,確認網路連通性
9 too many open files Controller Watch的連線數超過系統檔案描述符限制 調高ulimit -n,或減少Watch的資源類型,使用WithEventFilter過濾事件
10 webhook: admission webhook denied the request: replicas cannot be 0 Validating Webhook拒絕了非法請求 檢查Webhook校驗邏輯,確認請求參數是否符合業務約束

三大進階最佳化技巧

技巧1:EventFilter減少無效Reconcile

預設情況下,任何Status更新都會觸發Reconcile。透過EventFilter過濾掉不需要的事件:

import (
	"fmt"
	"reflect"

	"sigs.k8s.io/controller-runtime/pkg/event"
	"sigs.k8s.io/controller-runtime/pkg/predicate"
)

func WebAppPredicate() predicate.Predicate {
	return predicate.Funcs{
		UpdateFunc: func(e event.UpdateEvent) bool {
			oldObj := e.ObjectOld.(*appsv1alpha1.WebApp)
			newObj := e.ObjectNew.(*appsv1alpha1.WebApp)

			if oldObj.Generation != newObj.Generation {
				return true
			}

			if oldObj.DeletionTimestamp != newObj.DeletionTimestamp {
				return true
			}

			if !reflect.DeepEqual(oldObj.Finalizers, newObj.Finalizers) {
				return true
			}

			return false
		},
		CreateFunc: func(e event.CreateEvent) bool {
			return true
		},
		DeleteFunc: func(e event.DeleteEvent) bool {
			return true
		},
		GenericFunc: func(e event.GenericEvent) bool {
			return false
		},
	}
}

func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&appsv1alpha1.WebApp{}).
		WithEventFilter(WebAppPredicate()).
		Owns(&appsv1.Deployment{}).
		Owns(&corev1.Service{}).
		Complete(r)
}

技巧2:Reconcile結果與Requeue策略

合理使用RequeueAfter實現定時調諧,避免輪詢:

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var webapp appsv1alpha1.WebApp
	if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	if webapp.DeletionTimestamp != nil {
		return r.handleDeletion(ctx, &webapp)
	}

	if err := r.reconcileResources(ctx, &webapp); err != nil {
		return ctrl.Result{}, err
	}

	if !r.isWebAppReady(ctx, &webapp) {
		return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
	}

	return ctrl.Result{}, nil
}

func (r *WebAppReconciler) isWebAppReady(ctx context.Context, webapp *appsv1alpha1.WebApp) bool {
	var deployment appsv1.Deployment
	if err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment); err != nil {
		return false
	}
	return deployment.Status.ReadyReplicas == *webapp.Spec.Replicas
}

技巧3:自定義Metrics暴露Controller運行指標

import (
	"github.com/prometheus/client_golang/prometheus"
	"sigs.k8s.io/controller-runtime/pkg/metrics"
)

var (
	reconcileTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "webapp_operator_reconcile_total",
			Help: "WebApp Operator調諧總次數",
		},
		[]string{"name", "namespace", "result"},
	)

	reconcileDuration = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "webapp_operator_reconcile_duration_seconds",
			Help:    "WebApp Operator調諧耗時分佈",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"name", "namespace"},
	)

	managedResources = prometheus.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "webapp_operator_managed_resources",
			Help: "當前管理的WebApp資源數量",
		},
		[]string{"namespace"},
	)
)

func init() {
	metrics.Registry.MustRegister(reconcileTotal, reconcileDuration, managedResources)
}

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	start := time.Now()
	defer func() {
		reconcileDuration.WithLabelValues(req.Name, req.Namespace).Observe(time.Since(start).Seconds())
	}()

	var webapp appsv1alpha1.WebApp
	if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
		if errors.IsNotFound(err) {
			reconcileTotal.WithLabelValues(req.Name, req.Namespace, "not_found").Inc()
			return ctrl.Result{}, nil
		}
		reconcileTotal.WithLabelValues(req.Name, req.Namespace, "error").Inc()
		return ctrl.Result{}, err
	}

	if err := r.reconcileResources(ctx, &webapp); err != nil {
		reconcileTotal.WithLabelValues(req.Name, req.Namespace, "error").Inc()
		return ctrl.Result{}, err
	}

	reconcileTotal.WithLabelValues(req.Name, req.Namespace, "success").Inc()
	return ctrl.Result{}, nil
}

Operator開發方案對比分析

維度 raw client-go Kubebuilder Operator SDK (Go) Helm Operator
開發語言 Go Go Go/Ansible/Helm Helm
腳手架 完善的CLI 完善的CLI SDK內建
學習曲線 陡峭 中等 中等 平緩
程式碼生成 CRD類型+Webhook CRD類型+Webhook
靈活性 最高
Controller複雜度 手動實現 controller-runtime controller-runtime 自動生成
Webhook支援 手動實現 自動生成 自動生成 不支援
Leader Election 手動實現 內建 內建 內建
Metrics 手動實現 內建 內建 內建
多版本CRD 手動實現 Conversion Webhook Conversion Webhook 不支援
社群生態 K8s官方 CNCF CNCF CNCF
適用場景 高度定製 通用Operator 快速開發 簡單應用
生產就緒 需大量封裝 開箱即用 開箱即用 開箱即用

總結

Summary: CRD Operator不是「寫個CRD + 跑個Controller」那麼簡單。生產級Operator需要關注6個關鍵維度:CRD Schema設計要精確到每個欄位的校驗規則;Reconcile Loop必須冪等且正確處理Conflict;Status Subresource要分離Spec和Status的更新路徑;Finalizer要確保資源清理不會死鎖;Webhook要配置好憑證和失敗策略;生產配置必須包含Leader Election、Graceful Shutdown和Metrics。Kubebuilder是目前最推薦的Operator開發框架,它在靈活性和開發效率之間取得了最佳平衡。

推薦工具

本站提供瀏覽器本地工具,免註冊即可試用 →

#Kubernetes#Operator#CRD#Controller#云原生#2026#Kubebuilder