Kubernetes CRD Operator Development: 6 Production Patterns from CRD Design to Controller Implementation

云原生

When You're Writing 2000 Lines of YAML Just to Deploy One App

Have you ever experienced this pain—deploying a microservice requires hand-writing Deployment, Service, ConfigMap, Secret, Ingress, HPA, PDB, and 7 other resource types, with image versions, replica counts, and environment variables changing for each environment? A new hire needs a week just to understand the deployment process? Worse yet, at 3 AM during a production incident, you discover the database connection string in ConfigMap is wrong, but the YAML files are scattered across 5 Git repos and you have no idea which one to change?

This is the "combinatorial explosion" problem of native K8s APIs: low-level resources are too granular, lacking business-semantic abstractions; YAML ops has no validation, a single indentation error can break the entire deployment; multi-environment configs are unconverged, dev/staging/prod YAML differences rely entirely on manual comparison.

CRD + Operator changes everything. It lets you define your own business APIs in K8s, automating all deployment logic with a Controller—you only need to write one MyApp resource, and the Operator automatically creates 7 child resources, handles version upgrades, and manages config changes. This article walks you through 6 production-grade CRD Operator development patterns from scratch.

Core Concepts Reference Table

Concept Full Name Description
CRD Custom Resource Definition Definition of a custom resource type in K8s, declaring API path, fields, and validation rules
CR Custom Resource An instance of a resource defined by a CRD, a specific custom resource object created by users
Operator Operator Pattern Pattern for automating K8s application management through CRD + Controller
Controller Controller Control loop that continuously monitors cluster state and drives actual state toward desired state
Reconcile Reconciliation Loop Core reconciliation logic of a Controller, comparing desired vs actual state and executing operations
Kubebuilder Kubebuilder Operator SDK based on controller-runtime, providing project scaffolding and code generation
controller-runtime controller-runtime Go library for implementing K8s Controllers, providing Watch/Reconcile/Event core mechanisms
Finalizer Finalizer Interception mechanism before resource deletion, ensuring Controller completes cleanup before allowing deletion
Status Subresource Status Subresource Separating Spec and Status into two independent update paths to avoid update conflicts
Owner Reference Owner Reference Ownership relationship between resources, enabling cascading deletion and garbage collection
Event Event K8s event records for showing Controller operation status and error information to users
Webhook Admission Webhook API request interception and validation mechanism, including Mutating and Validating types

Six Challenges: Why CRD Operator Development Isn't "Just Write a CRD"

  1. CRD Schema Design Traps: Imprecise field type definitions, missing OpenAPI v3 validation, allowing users to submit invalid data; no advance planning for version compatibility, making v1-to-v2 migration a nightmare

  2. Reconcile Loop Idempotency: Controller Reconcile can be triggered repeatedly—if the logic isn't idempotent, it will create duplicate resources or execute duplicate operations—this is the most common and most fatal bug

  3. Status Update Conflicts: Multiple Controllers updating the same resource's Status simultaneously, or Spec being modified while Controller updates Status, causing optimistic lock conflicts and lost updates

  4. Finalizer Deadlocks: If the Controller crashes or is deleted after adding a Finalizer, the resource will be stuck in Terminating state forever—unable to be deleted or recreated

  5. Event Storms: In large-scale clusters, a single CRD change can trigger creation/update of hundreds of child resources, overwhelming the Controller and causing Workqueue backlog

  6. Production-Grade Gaps: Missing Leader Election causing multi-replica Controllers to execute redundantly, missing Graceful Shutdown causing interrupted Reconcile, missing Metrics making Controller health unmonitorable

Six-Step Implementation: From CRD Design to Controller Implementation

Step 1: CRD Definition with OpenAPI v3 Schema

CRD Definition (with full validation):

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: webapps.apps.toolsku.dev
spec:
  group: apps.toolsku.dev
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required:
                - image
                - replicas
              properties:
                image:
                  type: string
                  pattern: '^[a-zA-Z0-9][a-zA-Z0-9._-]*:[a-zA-Z0-9][a-zA-Z0-9._-]*$'
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 100
                port:
                  type: integer
                  minimum: 1
                  maximum: 65535
                  default: 8080
                resources:
                  type: object
                  properties:
                    requests:
                      type: object
                      properties:
                        cpu:
                          type: string
                          pattern: '^([0-9]+m|[0-9]+(\.[0-9]+)?)$'
                        memory:
                          type: string
                          pattern: '^[0-9]+(Ki|Mi|Gi|Ti)$'
                    limits:
                      type: object
                      properties:
                        cpu:
                          type: string
                        memory:
                          type: string
                env:
                  type: array
                  items:
                    type: object
                    required:
                      - name
                    properties:
                      name:
                        type: string
                        maxLength: 256
                      value:
                        type: string
                      valueFrom:
                        type: object
                        properties:
                          secretKeyRef:
                            type: object
                            required:
                              - name
                              - key
                            properties:
                              name:
                                type: string
                              key:
                                type: string
            status:
              type: object
              properties:
                availableReplicas:
                  type: integer
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                      status:
                        type: string
                        enum:
                          - "True"
                          - "False"
                          - Unknown
                      lastTransitionTime:
                        type: string
                        format: date-time
                      reason:
                        type: string
                      message:
                        type: string
      subresources:
        status: {}
        scale:
          specReplicasPath: .spec.replicas
          statusReplicasPath: .status.availableReplicas
      additionalPrinterColumns:
        - name: Image
          type: string
          jsonPath: .spec.image
        - name: Replicas
          type: integer
          jsonPath: .spec.replicas
        - name: Available
          type: integer
          jsonPath: .status.availableReplicas
        - name: Age
          type: date
          jsonPath: .metadata.creationTimestamp
  scope: Namespaced
  names:
    plural: webapps
    singular: webapp
    kind: WebApp
    shortNames:
      - wa

Step 2: Kubebuilder Project Scaffolding

# Initialize Kubebuilder project
mkdir webapp-operator && cd webapp-operator
kubebuilder init --domain toolsku.dev --repo github.com/toolsku/webapp-operator

# Create API (CRD + Controller)
kubebuilder create api --group apps --version v1alpha1 --kind WebApp

# Create Webhook
kubebuilder create webhook --group apps --version v1alpha1 --kind WebApp --defaulting --programmatic-validation

# Project structure
# ├── api/v1alpha1/
# │   ├── webapp_types.go      # CRD type definitions
# │   ├── webapp_webhook.go    # Webhook logic
# │   ├── webhook_suite_test.go
# │   └── groupversion_info.go
# ├── internal/controller/
# │   ├── webapp_controller.go # Controller reconciliation logic
# │   └── suite_test.go
# ├── cmd/
# │   └── main.go              # Entry point
# ├── config/
# │   ├── crd/                 # CRD YAML
# │   ├── rbac/                # RBAC config
# │   ├── manager/             # Controller Manager deployment
# │   └── samples/             # Sample CR
# └── Dockerfile

CRD Type Definitions (Go):

package v1alpha1

import (
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type WebAppSpec struct {
	Image     string                      `json:"image"`
	Replicas  *int32                      `json:"replicas"`
	Port      *int32                      `json:"port,omitempty"`
	Resources *WebAppResourceRequirements `json:"resources,omitempty"`
	Env       []WebAppEnvVar              `json:"env,omitempty"`
}

type WebAppResourceRequirements struct {
	Requests *ResourceList `json:"requests,omitempty"`
	Limits   *ResourceList `json:"limits,omitempty"`
}

type ResourceList struct {
	CPU    string `json:"cpu,omitempty"`
	Memory string `json:"memory,omitempty"`
}

type WebAppEnvVar struct {
	Name      string          `json:"name"`
	Value     string          `json:"value,omitempty"`
	ValueFrom *EnvVarSource   `json:"valueFrom,omitempty"`
}

type EnvVarSource struct {
	SecretKeyRef *SecretKeyRef `json:"secretKeyRef,omitempty"`
}

type SecretKeyRef struct {
	Name string `json:"name"`
	Key  string `json:"key"`
}

type WebAppStatus struct {
	AvailableReplicas int32              `json:"availableReplicas,omitempty"`
	Conditions       []metav1.Condition `json:"conditions,omitempty"`
}

type WebApp struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata,omitempty"`

	Spec   WebAppSpec   `json:"spec,omitempty"`
	Status WebAppStatus `json:"status,omitempty"`
}

type WebAppList struct {
	metav1.TypeMeta `json:",inline"`
	metav1.ListMeta `json:"metadata,omitempty"`
	Items           []WebApp `json:"items"`
}

func init() {
	SchemeBuilder.Register(&WebApp{}, &WebAppList{})
}

Step 3: Controller Reconcile Loop Implementation

package controller

import (
	"context"
	"fmt"

	appsv1 "k8s.io/api/apps/v1"
	corev1 "k8s.io/api/core/v1"
	"k8s.io/apimachinery/pkg/api/errors"
	"k8s.io/apimachinery/pkg/api/meta"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/types"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/client"
	"sigs.k8s.io/controller-runtime/pkg/log"

	appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
)

type WebAppReconciler struct {
	client.Client
	Scheme *runtime.Scheme
}

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	logger := log.FromContext(ctx)

	var webapp appsv1alpha1.WebApp
	if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
		if errors.IsNotFound(err) {
			logger.Info("WebApp resource not found, ignoring")
			return ctrl.Result{}, nil
		}
		logger.Error(err, "Failed to get WebApp")
		return ctrl.Result{}, err
	}

	if webapp.DeletionTimestamp != nil {
		return r.handleDeletion(ctx, &webapp)
	}

	if err := r.ensureFinalizer(ctx, &webapp); err != nil {
		return ctrl.Result{}, err
	}

	deployment, err := r.reconcileDeployment(ctx, &webapp)
	if err != nil {
		r.updateCondition(ctx, &webapp, "Available", metav1.ConditionFalse, "ReconcileError", err.Error())
		return ctrl.Result{}, err
	}

	service, err := r.reconcileService(ctx, &webapp)
	if err != nil {
		r.updateCondition(ctx, &webapp, "Available", metav1.ConditionFalse, "ReconcileError", err.Error())
		return ctrl.Result{}, err
	}

	_ = deployment
	_ = service

	if err := r.updateStatus(ctx, &webapp); err != nil {
		return ctrl.Result{}, err
	}

	r.updateCondition(ctx, &webapp, "Available", metav1.ConditionTrue, "ReconcileSuccess", "WebApp is available")

	return ctrl.Result{}, nil
}

func (r *WebAppReconciler) reconcileDeployment(ctx context.Context, webapp *appsv1alpha1.WebApp) (*appsv1.Deployment, error) {
	logger := log.FromContext(ctx)

	var deployment appsv1.Deployment
	err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment)

	if err != nil && errors.IsNotFound(err) {
		desired := r.buildDeployment(webapp)
		if err := ctrl.SetControllerReference(webapp, desired, r.Scheme); err != nil {
			return nil, fmt.Errorf("failed to set OwnerReference: %w", err)
		}
		logger.Info("Creating Deployment", "name", desired.Name)
		if err := r.Create(ctx, desired); err != nil {
			return nil, fmt.Errorf("failed to create Deployment: %w", err)
		}
		return desired, nil
	} else if err != nil {
		return nil, fmt.Errorf("failed to query Deployment: %w", err)
	}

	desired := r.buildDeployment(webapp)
	if r.deploymentNeedsUpdate(&deployment, desired) {
		deployment.Spec = desired.Spec
		logger.Info("Updating Deployment", "name", deployment.Name)
		if err := r.Update(ctx, &deployment); err != nil {
			return nil, fmt.Errorf("failed to update Deployment: %w", err)
		}
	}

	return &deployment, nil
}

func (r *WebAppReconciler) buildDeployment(webapp *appsv1alpha1.WebApp) *appsv1.Deployment {
	replicas := int32(1)
	if webapp.Spec.Replicas != nil {
		replicas = *webapp.Spec.Replicas
	}
	port := int32(8080)
	if webapp.Spec.Port != nil {
		port = *webapp.Spec.Port
	}

	envVars := make([]corev1.EnvVar, 0, len(webapp.Spec.Env))
	for _, e := range webapp.Spec.Env {
		ev := corev1.EnvVar{Name: e.Name, Value: e.Value}
		if e.ValueFrom != nil && e.ValueFrom.SecretKeyRef != nil {
			ev.ValueFrom = &corev1.EnvVarSource{
				SecretKeyRef: &corev1.SecretKeySelector{
					LocalObjectReference: corev1.LocalObjectReference{Name: e.ValueFrom.SecretKeyRef.Name},
					Key:                  e.ValueFrom.SecretKeyRef.Key,
				},
			}
		}
		envVars = append(envVars, ev)
	}

	return &appsv1.Deployment{
		ObjectMeta: metav1.ObjectMeta{
			Name:      webapp.Name,
			Namespace: webapp.Namespace,
			Labels: map[string]string{
				"app.kubernetes.io/name":       "webapp",
				"app.kubernetes.io/instance":   webapp.Name,
				"app.kubernetes.io/managed-by": "webapp-operator",
			},
		},
		Spec: appsv1.DeploymentSpec{
			Replicas: &replicas,
			Selector: &metav1.LabelSelector{
				MatchLabels: map[string]string{
					"app.kubernetes.io/instance": webapp.Name,
				},
			},
			Template: corev1.PodTemplateSpec{
				ObjectMeta: metav1.ObjectMeta{
					Labels: map[string]string{
						"app.kubernetes.io/name":     "webapp",
						"app.kubernetes.io/instance": webapp.Name,
					},
				},
				Spec: corev1.PodSpec{
					Containers: []corev1.Container{
						{
							Name:  "webapp",
							Image: webapp.Spec.Image,
							Ports: []corev1.ContainerPort{{ContainerPort: port}},
							Env:   envVars,
						},
					},
				},
			},
		},
	}
}

func (r *WebAppReconciler) reconcileService(ctx context.Context, webapp *appsv1alpha1.WebApp) (*corev1.Service, error) {
	logger := log.FromContext(ctx)
	port := int32(8080)
	if webapp.Spec.Port != nil {
		port = *webapp.Spec.Port
	}

	var svc corev1.Service
	err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &svc)

	if err != nil && errors.IsNotFound(err) {
		desired := &corev1.Service{
			ObjectMeta: metav1.ObjectMeta{
				Name:      webapp.Name,
				Namespace: webapp.Namespace,
				Labels: map[string]string{
					"app.kubernetes.io/instance":   webapp.Name,
					"app.kubernetes.io/managed-by": "webapp-operator",
				},
			},
			Spec: corev1.ServiceSpec{
				Selector: map[string]string{
					"app.kubernetes.io/instance": webapp.Name,
				},
				Ports: []corev1.ServicePort{
					{Port: port, TargetPort: intstr.FromInt32(port)},
				},
			},
		}
		if err := ctrl.SetControllerReference(webapp, desired, r.Scheme); err != nil {
			return nil, err
		}
		logger.Info("Creating Service", "name", desired.Name)
		if err := r.Create(ctx, desired); err != nil {
			return nil, err
		}
		return desired, nil
	} else if err != nil {
		return nil, err
	}

	return &svc, nil
}

func (r *WebAppReconciler) deploymentNeedsUpdate(current, desired *appsv1.Deployment) bool {
	if *current.Spec.Replicas != *desired.Spec.Replicas {
		return true
	}
	if current.Spec.Template.Spec.Containers[0].Image != desired.Spec.Template.Spec.Containers[0].Image {
		return true
	}
	return false
}

func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&appsv1alpha1.WebApp{}).
		Owns(&appsv1.Deployment{}).
		Owns(&corev1.Service{}).
		Complete(r)
}

Step 4: Status Subresource Management and Condition Updates

package controller

import (
	"context"

	appsv1 "k8s.io/api/apps/v1"
	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/apimachinery/pkg/types"
	"sigs.k8s.io/controller-runtime/pkg/log"

	appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
)

const (
	finalizerName = "apps.toolsku.dev/finalizer"
)

func (r *WebAppReconciler) updateStatus(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
	var deployment appsv1.Deployment
	err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment)
	if err != nil {
		return err
	}

	availableReplicas := int32(0)
	if deployment.Status.AvailableReplicas > 0 {
		availableReplicas = deployment.Status.AvailableReplicas
	}

	if webapp.Status.AvailableReplicas == availableReplicas {
		return nil
	}

	webapp.Status.AvailableReplicas = availableReplicas
	return r.Status().Update(ctx, webapp)
}

func (r *WebAppReconciler) updateCondition(ctx context.Context, webapp *appsv1alpha1.WebApp, condType string, status metav1.ConditionStatus, reason, message string) {
	condition := metav1.Condition{
		Type:               condType,
		Status:             status,
		Reason:             reason,
		Message:            message,
		ObservedGeneration: webapp.Generation,
	}

	meta.SetStatusCondition(&webapp.Status.Conditions, condition)

	if err := r.Status().Update(ctx, webapp); err != nil {
		log.FromContext(ctx).Error(err, "Failed to update Condition")
	}
}

func (r *WebAppReconciler) ensureFinalizer(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
	if !containsString(webapp.Finalizers, finalizerName) {
		webapp.Finalizers = append(webapp.Finalizers, finalizerName)
		if err := r.Update(ctx, webapp); err != nil {
			return err
		}
	}
	return nil
}

func (r *WebAppReconciler) handleDeletion(ctx context.Context, webapp *appsv1alpha1.WebApp) (ctrl.Result, error) {
	logger := log.FromContext(ctx)

	if !containsString(webapp.Finalizers, finalizerName) {
		return ctrl.Result{}, nil
	}

	logger.Info("Executing cleanup logic", "webapp", webapp.Name)

	if err := r.cleanupExternalResources(ctx, webapp); err != nil {
		return ctrl.Result{}, err
	}

	webapp.Finalizers = removeString(webapp.Finalizers, finalizerName)
	if err := r.Update(ctx, webapp); err != nil {
		return ctrl.Result{}, err
	}

	logger.Info("Finalizer cleanup completed", "webapp", webapp.Name)
	return ctrl.Result{}, nil
}

func (r *WebAppReconciler) cleanupExternalResources(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
	logger := log.FromContext(ctx)
	logger.Info("Cleaning up external resources", "webapp", webapp.Name)
	return nil
}

func containsString(slice []string, s string) bool {
	for _, item := range slice {
		if item == s {
			return true
		}
	}
	return false
}

func removeString(slice []string, s string) []string {
	var result []string
	for _, item := range slice {
		if item == s {
			continue
		}
		result = append(result, item)
	}
	return result
}

Step 5: Admission Webhook Validation

package v1alpha1

import (
	"fmt"

	"k8s.io/apimachinery/pkg/runtime"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/webhook"
	"sigs.k8s.io/controller-runtime/pkg/webhook/admission"
)

func (w *WebApp) SetupWebhookWithManager(mgr ctrl.Manager) error {
	return ctrl.NewWebhookManagedBy(mgr).
		For(w).
		Complete()
}

var _ webhook.Defaulter = &WebApp{}

func (w *WebApp) Default() {
	if w.Spec.Port == nil {
		port := int32(8080)
		w.Spec.Port = &port
	}
	if w.Spec.Replicas == nil {
		replicas := int32(1)
		w.Spec.Replicas = &replicas
	}
}

var _ webhook.Validator = &WebApp{}

func (w *WebApp) ValidateCreate() (admission.Warnings, error) {
	if err := validateWebApp(w); err != nil {
		return nil, err
	}
	return admission.Warnings{"WebApp will automatically deploy Deployment and Service upon creation"}, nil
}

func (w *WebApp) ValidateUpdate(old runtime.Object) error {
	oldWA := old.(*WebApp)
	if *oldWA.Spec.Replicas > 0 && *w.Spec.Replicas == 0 {
		return fmt.Errorf("scaling replicas to 0 is not allowed, please delete the WebApp resource directly")
	}
	return validateWebApp(w)
}

func (w *WebApp) ValidateDelete() error {
	return nil
}

func validateWebApp(w *WebApp) error {
	if w.Spec.Image == "" {
		return fmt.Errorf("image cannot be empty")
	}
	if w.Spec.Replicas != nil && *w.Spec.Replicas > 100 {
		return fmt.Errorf("replicas cannot exceed 100, current value: %d", *w.Spec.Replicas)
	}
	if w.Spec.Port != nil {
		if *w.Spec.Port < 1 || *w.Spec.Port > 65535 {
			return fmt.Errorf("port must be between 1-65535, current value: %d", *w.Spec.Port)
		}
	}
	for i, env := range w.Spec.Env {
		if env.Name == "" {
			return fmt.Errorf("env[%d].name cannot be empty", i)
		}
		if env.Value == "" && env.ValueFrom == nil {
			return fmt.Errorf("env[%d] must specify value or valueFrom", i)
		}
	}
	return nil
}

Step 6: Production Configuration—Leader Election, Graceful Shutdown, and Metrics

package main

import (
	"context"
	"flag"
	"fmt"
	"os"

	"k8s.io/apimachinery/pkg/runtime"
	utilruntime "k8s.io/apimachinery/pkg/util/runtime"
	clientgoscheme "k8s.io/client-go/kubernetes/scheme"
	_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
	ctrl "sigs.k8s.io/controller-runtime"
	"sigs.k8s.io/controller-runtime/pkg/healthz"
	"sigs.k8s.io/controller-runtime/pkg/log/zap"
	"sigs.k8s.io/controller-runtime/pkg/metrics/server"
	"sigs.k8s.io/controller-runtime/pkg/webhook"

	appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
	"github.com/toolsku/webapp-operator/internal/controller"
)

var (
	scheme   = runtime.NewScheme()
	setupLog = ctrl.Log.WithName("setup")
)

func init() {
	utilruntime.Must(clientgoscheme.AddToScheme(scheme))
	utilruntime.Must(appsv1alpha1.AddToScheme(scheme))
}

func main() {
	var metricsAddr string
	var enableLeaderElection bool
	var probeAddr string
	var webhookPort int

	flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "Metrics server address")
	flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "Health probe address")
	flag.BoolVar(&enableLeaderElection, "leader-elect", true, "Enable Leader Election")
	flag.IntVar(&webhookPort, "webhook-port", 9443, "Webhook port")
	opts := zap.Options{Development: true}
	opts.BindFlags(flag.CommandLine)
	flag.Parse()

	ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))

	mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
		Scheme: scheme,
		Metrics: server.Options{
			BindAddress: metricsAddr,
		},
		WebhookServer: webhook.NewServer(webhook.Options{
			Port: webhookPort,
		}),
		HealthProbeBindAddress: probeAddr,
		LeaderElection:         enableLeaderElection,
		LeaderElectionID:       "webapp-operator.toolsku.dev",
		LeaderElectionNamespace: os.Getenv("POD_NAMESPACE"),
	})
	if err != nil {
		setupLog.Error(err, "Failed to create Manager")
		os.Exit(1)
	}

	if err := (&controller.WebAppReconciler{
		Client: mgr.GetClient(),
		Scheme: mgr.GetScheme(),
	}).SetupWithManager(mgr); err != nil {
		setupLog.Error(err, "Failed to setup Controller")
		os.Exit(1)
	}

	if err := (&appsv1alpha1.WebApp{}).SetupWebhookWithManager(mgr); err != nil {
		setupLog.Error(err, "Failed to setup Webhook")
		os.Exit(1)
	}

	if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
		setupLog.Error(err, "Failed to setup health check")
		os.Exit(1)
	}
	if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
		setupLog.Error(err, "Failed to setup readiness check")
		os.Exit(1)
	}

	setupLog.Info("Starting Operator",
		"metrics", metricsAddr,
		"probe", probeAddr,
		"leader-election", enableLeaderElection,
		"webhook-port", webhookPort,
	)

	if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
		setupLog.Error(err, "Failed to run Manager")
		os.Exit(1)
	}
}

Controller Manager Deployment YAML:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp-operator-controller-manager
  namespace: webapp-operator-system
spec:
  replicas: 2
  selector:
    matchLabels:
      control-plane: controller-manager
  template:
    metadata:
      labels:
        control-plane: controller-manager
      annotations:
        kubectl.kubernetes.io/default-container: manager
    spec:
      serviceAccountName: controller-manager
      containers:
        - name: manager
          image: toolsku/webapp-operator:v1.0.0
          args:
            - --leader-elect
            - --metrics-bind-address=:8080
            - --health-probe-bind-address=:8081
            - --webhook-port=9443
          ports:
            - containerPort: 8080
              name: metrics
            - containerPort: 8081
              name: health
            - containerPort: 9443
              name: webhook
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8081
            initialDelaySeconds: 15
            periodSeconds: 20
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8081
            initialDelaySeconds: 5
            periodSeconds: 10
          resources:
            limits:
              cpu: 500m
              memory: 256Mi
            requests:
              cpu: 100m
              memory: 128Mi
          env:
            - name: POD_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
          volumeMounts:
            - name: cert
              mountPath: /tmp/k8s-webhook-server/serving-certs
              readOnly: true
      volumes:
        - name: cert
          secret:
            defaultMode: 420
            secretName: webhook-server-cert
      terminationGracePeriodSeconds: 60

Six Pitfall Guide

Pitfall 1: Non-Idempotent Reconcile Causing Duplicate Resource Creation

Wrong approach:

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var webapp appsv1alpha1.WebApp
	r.Get(ctx, req.NamespacedName, &webapp)

	deploy := r.buildDeployment(&webapp)
	r.Create(ctx, deploy)
	return ctrl.Result{}, nil
}

Correct approach:

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var webapp appsv1alpha1.WebApp
	if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
		if errors.IsNotFound(err) {
			return ctrl.Result{}, nil
		}
		return ctrl.Result{}, err
	}

	var existing appsv1.Deployment
	err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &existing)
	if err != nil && errors.IsNotFound(err) {
		deploy := r.buildDeployment(&webapp)
		if err := ctrl.SetControllerReference(&webapp, deploy, r.Scheme); err != nil {
			return ctrl.Result{}, err
		}
		if err := r.Create(ctx, deploy); err != nil {
			return ctrl.Result{}, err
		}
	} else if err != nil {
		return ctrl.Result{}, err
	}

	return ctrl.Result{}, nil
}

Pitfall 2: Status Update Conflicting with Spec Update

Wrong approach:

webapp.Status.AvailableReplicas = 3
if err := r.Update(ctx, webapp); err != nil {
	return ctrl.Result{}, err
}

Correct approach:

webapp.Status.AvailableReplicas = 3
if err := r.Status().Update(ctx, webapp); err != nil {
	if errors.IsConflict(err) {
		return ctrl.Result{Requeue: true}, nil
	}
	return ctrl.Result{}, err
}

Pitfall 3: Improper Finalizer Handling Causing Resources Stuck in Terminating

Wrong approach:

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var webapp appsv1alpha1.WebApp
	r.Get(ctx, req.NamespacedName, &webapp)
	if webapp.DeletionTimestamp != nil {
		return ctrl.Result{}, nil
	}
	webapp.Finalizers = append(webapp.Finalizers, "my-finalizer")
	r.Update(ctx, &webapp)
	return ctrl.Result{}, nil
}

Correct approach:

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var webapp appsv1alpha1.WebApp
	if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	if webapp.DeletionTimestamp != nil {
		if containsString(webapp.Finalizers, finalizerName) {
			if err := r.cleanupExternalResources(ctx, &webapp); err != nil {
				return ctrl.Result{}, err
			}
			webapp.Finalizers = removeString(webapp.Finalizers, finalizerName)
			if err := r.Update(ctx, &webapp); err != nil {
				return ctrl.Result{}, err
			}
		}
		return ctrl.Result{}, nil
	}

	if !containsString(webapp.Finalizers, finalizerName) {
		webapp.Finalizers = append(webapp.Finalizers, finalizerName)
		if err := r.Update(ctx, &webapp); err != nil {
			return ctrl.Result{}, err
		}
	}

	return r.reconcileNormal(ctx, &webapp)
}

Pitfall 4: Missing OwnerReference Causing Child Resource Leaks

Wrong approach:

deploy := &appsv1.Deployment{
	ObjectMeta: metav1.ObjectMeta{
		Name:      webapp.Name,
		Namespace: webapp.Namespace,
	},
}
r.Create(ctx, deploy)

Correct approach:

deploy := &appsv1.Deployment{
	ObjectMeta: metav1.ObjectMeta{
		Name:      webapp.Name,
		Namespace: webapp.Namespace,
	},
}
if err := ctrl.SetControllerReference(&webapp, deploy, r.Scheme); err != nil {
	return ctrl.Result{}, err
}
if err := r.Create(ctx, deploy); err != nil && !errors.IsAlreadyExists(err) {
	return ctrl.Result{}, err
}

Pitfall 5: Ignoring Conflict Errors in Reconcile

Wrong approach:

if err := r.Update(ctx, &webapp); err != nil {
	return ctrl.Result{}, err
}

Correct approach:

if err := r.Update(ctx, &webapp); err != nil {
	if errors.IsConflict(err) {
		return ctrl.Result{Requeue: true}, nil
	}
	return ctrl.Result{}, err
}

Pitfall 6: Webhook Without Certificate Configuration Causing TLS Handshake Failure

Wrong approach:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: webapp-validator
webhooks:
  - name: webapp.apps.toolsku.dev
    clientConfig:
      service:
        name: webhook-service
        namespace: webapp-operator-system
        path: /validate-apps-toolsku-dev-v1alpha1-webapp
    sideEffects: None
    admissionReviewVersions: ["v1"]

Correct approach:

apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
  name: webapp-validator
  annotations:
    cert-manager.io/inject-ca-from: webapp-operator-system/webapp-operator-serving-cert
webhooks:
  - name: webapp.apps.toolsku.dev
    clientConfig:
      service:
        name: webhook-service
        namespace: webapp-operator-system
        path: /validate-apps-toolsku-dev-v1alpha1-webapp
        port: 443
    sideEffects: None
    admissionReviewVersions: ["v1"]
    failurePolicy: Fail
    timeoutSeconds: 10
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: webapp-operator-serving-cert
  namespace: webapp-operator-system
spec:
  dnsNames:
    - webhook-service.webapp-operator-system.svc
    - webhook-service.webapp-operator-system.svc.cluster.local
  issuerRef:
    kind: Issuer
    name: webapp-operator-selfsigned-issuer
  secretName: webhook-server-cert

Error Troubleshooting Reference Table

# Error Cause Solution
1 the server could not find the requested resource CRD not installed or API version mismatch Run kubectl apply -f config/crd/bases/, check apiVersion matches CRD
2 Operation cannot be fulfilled on webapps.apps.toolsku.dev: the object has been modified Optimistic lock conflict, object modified during Reconcile Catch IsConflict error, return ctrl.Result{Requeue: true} to re-execute Reconcile
3 webhook server: TLS handshake error Webhook certificate not properly configured or expired Install cert-manager, configure Certificate resource for auto-issuance
4 resource stuck in Terminating state Finalizer not properly removed, Controller has stopped kubectl patch <cr> -p '{"metadata":{"finalizers":[]}}' --type=merge to force remove
5 no matches for kind "WebApp" in version "apps.toolsku.dev/v1alpha1" CRD versions has served: false or storage: false Check CRD definition, ensure target version has served: true
6 failed to call webhook: Post "https://...": x509: certificate signed by unknown authority Webhook CA certificate not injected into ValidatingWebhookConfiguration Configure cert-manager's inject-ca-from annotation, or manually set caBundle
7 controller: Reconciler error: failed to get deployment: deployments is forbidden Insufficient RBAC permissions, ServiceAccount missing API operation permissions Check Role and RoleBinding under config/rbac/, add deployments resource permissions for apps group
8 leader-election: failed to renew lease Leader Election Lease update failure, usually network or permission issues Check leases resource permissions for coordination.k8s.io, confirm network connectivity
9 too many open files Controller Watch connections exceed system file descriptor limits Increase ulimit -n, or reduce Watch resource types, use WithEventFilter to filter events
10 webhook: admission webhook denied the request: replicas cannot be 0 Validating Webhook rejected invalid request Check Webhook validation logic, confirm request parameters meet business constraints

Three Advanced Optimization Techniques

Technique 1: EventFilter to Reduce Unnecessary Reconciles

By default, any Status update triggers a Reconcile. Use EventFilter to filter out unnecessary events:

import (
	"fmt"
	"reflect"

	"sigs.k8s.io/controller-runtime/pkg/event"
	"sigs.k8s.io/controller-runtime/pkg/predicate"
)

func WebAppPredicate() predicate.Predicate {
	return predicate.Funcs{
		UpdateFunc: func(e event.UpdateEvent) bool {
			oldObj := e.ObjectOld.(*appsv1alpha1.WebApp)
			newObj := e.ObjectNew.(*appsv1alpha1.WebApp)

			if oldObj.Generation != newObj.Generation {
				return true
			}

			if oldObj.DeletionTimestamp != newObj.DeletionTimestamp {
				return true
			}

			if !reflect.DeepEqual(oldObj.Finalizers, newObj.Finalizers) {
				return true
			}

			return false
		},
		CreateFunc: func(e event.CreateEvent) bool {
			return true
		},
		DeleteFunc: func(e event.DeleteEvent) bool {
			return true
		},
		GenericFunc: func(e event.GenericEvent) bool {
			return false
		},
	}
}

func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
	return ctrl.NewControllerManagedBy(mgr).
		For(&appsv1alpha1.WebApp{}).
		WithEventFilter(WebAppPredicate()).
		Owns(&appsv1.Deployment{}).
		Owns(&corev1.Service{}).
		Complete(r)
}

Technique 2: Reconcile Result and Requeue Strategy

Use RequeueAfter wisely for periodic reconciliation, avoiding polling:

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	var webapp appsv1alpha1.WebApp
	if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
		return ctrl.Result{}, client.IgnoreNotFound(err)
	}

	if webapp.DeletionTimestamp != nil {
		return r.handleDeletion(ctx, &webapp)
	}

	if err := r.reconcileResources(ctx, &webapp); err != nil {
		return ctrl.Result{}, err
	}

	if !r.isWebAppReady(ctx, &webapp) {
		return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
	}

	return ctrl.Result{}, nil
}

func (r *WebAppReconciler) isWebAppReady(ctx context.Context, webapp *appsv1alpha1.WebApp) bool {
	var deployment appsv1.Deployment
	if err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment); err != nil {
		return false
	}
	return deployment.Status.ReadyReplicas == *webapp.Spec.Replicas
}

Technique 3: Custom Metrics for Controller Runtime Indicators

import (
	"github.com/prometheus/client_golang/prometheus"
	"sigs.k8s.io/controller-runtime/pkg/metrics"
)

var (
	reconcileTotal = prometheus.NewCounterVec(
		prometheus.CounterOpts{
			Name: "webapp_operator_reconcile_total",
			Help: "Total number of WebApp Operator reconciliations",
		},
		[]string{"name", "namespace", "result"},
	)

	reconcileDuration = prometheus.NewHistogramVec(
		prometheus.HistogramOpts{
			Name:    "webapp_operator_reconcile_duration_seconds",
			Help:    "WebApp Operator reconciliation duration distribution",
			Buckets: prometheus.DefBuckets,
		},
		[]string{"name", "namespace"},
	)

	managedResources = prometheus.NewGaugeVec(
		prometheus.GaugeOpts{
			Name: "webapp_operator_managed_resources",
			Help: "Current number of managed WebApp resources",
		},
		[]string{"namespace"},
	)
)

func init() {
	metrics.Registry.MustRegister(reconcileTotal, reconcileDuration, managedResources)
}

func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
	start := time.Now()
	defer func() {
		reconcileDuration.WithLabelValues(req.Name, req.Namespace).Observe(time.Since(start).Seconds())
	}()

	var webapp appsv1alpha1.WebApp
	if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
		if errors.IsNotFound(err) {
			reconcileTotal.WithLabelValues(req.Name, req.Namespace, "not_found").Inc()
			return ctrl.Result{}, nil
		}
		reconcileTotal.WithLabelValues(req.Name, req.Namespace, "error").Inc()
		return ctrl.Result{}, err
	}

	if err := r.reconcileResources(ctx, &webapp); err != nil {
		reconcileTotal.WithLabelValues(req.Name, req.Namespace, "error").Inc()
		return ctrl.Result{}, err
	}

	reconcileTotal.WithLabelValues(req.Name, req.Namespace, "success").Inc()
	return ctrl.Result{}, nil
}

Operator Development Framework Comparison

Dimension raw client-go Kubebuilder Operator SDK (Go) Helm Operator
Language Go Go Go/Ansible/Helm Helm
Scaffolding None Complete CLI Complete CLI SDK built-in
Learning Curve Steep Moderate Moderate Gentle
Code Generation None CRD types + Webhook CRD types + Webhook None
Flexibility Highest High Medium Low
Controller Complexity Manual controller-runtime controller-runtime Auto-generated
Webhook Support Manual Auto-generated Auto-generated Not supported
Leader Election Manual Built-in Built-in Built-in
Metrics Manual Built-in Built-in Built-in
Multi-version CRD Manual Conversion Webhook Conversion Webhook Not supported
Community Ecosystem K8s Official CNCF CNCF CNCF
Use Case Highly customized General Operator Rapid development Simple apps
Production Ready Requires extensive wrapping Out of the box Out of the box Out of the box

Summary

Summary: CRD Operator isn't as simple as "write a CRD + run a Controller." A production-grade Operator needs to focus on 6 key dimensions: CRD Schema design must be precise down to validation rules for every field; Reconcile Loop must be idempotent and correctly handle Conflicts; Status Subresource must separate Spec and Status update paths; Finalizer must ensure resource cleanup doesn't deadlock; Webhook must configure certificates and failure policies properly; Production config must include Leader Election, Graceful Shutdown, and Metrics. Kubebuilder is currently the most recommended Operator development framework, achieving the best balance between flexibility and development efficiency.

  • JSON Formatter - Format CRD definitions and Status output JSON data
  • Base64 Encoder - Encode Webhook certificates and Secret data
  • Hash Calculator - Calculate content hashes for ConfigMaps and Secrets to detect config changes
  • JWT Decoder - Decode ServiceAccount JWT tokens for RBAC permission troubleshooting

Try these browser-local tools — no sign-up required →

#Kubernetes#Operator#CRD#Controller#云原生#2026#Kubebuilder