Kubernetes CRD Operator Development: 6 Production Patterns from CRD Design to Controller Implementation
When You're Writing 2000 Lines of YAML Just to Deploy One App
Have you ever experienced this pain—deploying a microservice requires hand-writing Deployment, Service, ConfigMap, Secret, Ingress, HPA, PDB, and 7 other resource types, with image versions, replica counts, and environment variables changing for each environment? A new hire needs a week just to understand the deployment process? Worse yet, at 3 AM during a production incident, you discover the database connection string in ConfigMap is wrong, but the YAML files are scattered across 5 Git repos and you have no idea which one to change?
This is the "combinatorial explosion" problem of native K8s APIs: low-level resources are too granular, lacking business-semantic abstractions; YAML ops has no validation, a single indentation error can break the entire deployment; multi-environment configs are unconverged, dev/staging/prod YAML differences rely entirely on manual comparison.
CRD + Operator changes everything. It lets you define your own business APIs in K8s, automating all deployment logic with a Controller—you only need to write one MyApp resource, and the Operator automatically creates 7 child resources, handles version upgrades, and manages config changes. This article walks you through 6 production-grade CRD Operator development patterns from scratch.
Core Concepts Reference Table
| Concept | Full Name | Description |
|---|---|---|
| CRD | Custom Resource Definition | Definition of a custom resource type in K8s, declaring API path, fields, and validation rules |
| CR | Custom Resource | An instance of a resource defined by a CRD, a specific custom resource object created by users |
| Operator | Operator Pattern | Pattern for automating K8s application management through CRD + Controller |
| Controller | Controller | Control loop that continuously monitors cluster state and drives actual state toward desired state |
| Reconcile | Reconciliation Loop | Core reconciliation logic of a Controller, comparing desired vs actual state and executing operations |
| Kubebuilder | Kubebuilder | Operator SDK based on controller-runtime, providing project scaffolding and code generation |
| controller-runtime | controller-runtime | Go library for implementing K8s Controllers, providing Watch/Reconcile/Event core mechanisms |
| Finalizer | Finalizer | Interception mechanism before resource deletion, ensuring Controller completes cleanup before allowing deletion |
| Status Subresource | Status Subresource | Separating Spec and Status into two independent update paths to avoid update conflicts |
| Owner Reference | Owner Reference | Ownership relationship between resources, enabling cascading deletion and garbage collection |
| Event | Event | K8s event records for showing Controller operation status and error information to users |
| Webhook | Admission Webhook | API request interception and validation mechanism, including Mutating and Validating types |
Six Challenges: Why CRD Operator Development Isn't "Just Write a CRD"
-
CRD Schema Design Traps: Imprecise field type definitions, missing OpenAPI v3 validation, allowing users to submit invalid data; no advance planning for version compatibility, making v1-to-v2 migration a nightmare
-
Reconcile Loop Idempotency: Controller Reconcile can be triggered repeatedly—if the logic isn't idempotent, it will create duplicate resources or execute duplicate operations—this is the most common and most fatal bug
-
Status Update Conflicts: Multiple Controllers updating the same resource's Status simultaneously, or Spec being modified while Controller updates Status, causing optimistic lock conflicts and lost updates
-
Finalizer Deadlocks: If the Controller crashes or is deleted after adding a Finalizer, the resource will be stuck in Terminating state forever—unable to be deleted or recreated
-
Event Storms: In large-scale clusters, a single CRD change can trigger creation/update of hundreds of child resources, overwhelming the Controller and causing Workqueue backlog
-
Production-Grade Gaps: Missing Leader Election causing multi-replica Controllers to execute redundantly, missing Graceful Shutdown causing interrupted Reconcile, missing Metrics making Controller health unmonitorable
Six-Step Implementation: From CRD Design to Controller Implementation
Step 1: CRD Definition with OpenAPI v3 Schema
CRD Definition (with full validation):
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: webapps.apps.toolsku.dev
spec:
group: apps.toolsku.dev
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required:
- image
- replicas
properties:
image:
type: string
pattern: '^[a-zA-Z0-9][a-zA-Z0-9._-]*:[a-zA-Z0-9][a-zA-Z0-9._-]*$'
replicas:
type: integer
minimum: 1
maximum: 100
port:
type: integer
minimum: 1
maximum: 65535
default: 8080
resources:
type: object
properties:
requests:
type: object
properties:
cpu:
type: string
pattern: '^([0-9]+m|[0-9]+(\.[0-9]+)?)$'
memory:
type: string
pattern: '^[0-9]+(Ki|Mi|Gi|Ti)$'
limits:
type: object
properties:
cpu:
type: string
memory:
type: string
env:
type: array
items:
type: object
required:
- name
properties:
name:
type: string
maxLength: 256
value:
type: string
valueFrom:
type: object
properties:
secretKeyRef:
type: object
required:
- name
- key
properties:
name:
type: string
key:
type: string
status:
type: object
properties:
availableReplicas:
type: integer
conditions:
type: array
items:
type: object
properties:
type:
type: string
status:
type: string
enum:
- "True"
- "False"
- Unknown
lastTransitionTime:
type: string
format: date-time
reason:
type: string
message:
type: string
subresources:
status: {}
scale:
specReplicasPath: .spec.replicas
statusReplicasPath: .status.availableReplicas
additionalPrinterColumns:
- name: Image
type: string
jsonPath: .spec.image
- name: Replicas
type: integer
jsonPath: .spec.replicas
- name: Available
type: integer
jsonPath: .status.availableReplicas
- name: Age
type: date
jsonPath: .metadata.creationTimestamp
scope: Namespaced
names:
plural: webapps
singular: webapp
kind: WebApp
shortNames:
- wa
Step 2: Kubebuilder Project Scaffolding
# Initialize Kubebuilder project
mkdir webapp-operator && cd webapp-operator
kubebuilder init --domain toolsku.dev --repo github.com/toolsku/webapp-operator
# Create API (CRD + Controller)
kubebuilder create api --group apps --version v1alpha1 --kind WebApp
# Create Webhook
kubebuilder create webhook --group apps --version v1alpha1 --kind WebApp --defaulting --programmatic-validation
# Project structure
# ├── api/v1alpha1/
# │ ├── webapp_types.go # CRD type definitions
# │ ├── webapp_webhook.go # Webhook logic
# │ ├── webhook_suite_test.go
# │ └── groupversion_info.go
# ├── internal/controller/
# │ ├── webapp_controller.go # Controller reconciliation logic
# │ └── suite_test.go
# ├── cmd/
# │ └── main.go # Entry point
# ├── config/
# │ ├── crd/ # CRD YAML
# │ ├── rbac/ # RBAC config
# │ ├── manager/ # Controller Manager deployment
# │ └── samples/ # Sample CR
# └── Dockerfile
CRD Type Definitions (Go):
package v1alpha1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
type WebAppSpec struct {
Image string `json:"image"`
Replicas *int32 `json:"replicas"`
Port *int32 `json:"port,omitempty"`
Resources *WebAppResourceRequirements `json:"resources,omitempty"`
Env []WebAppEnvVar `json:"env,omitempty"`
}
type WebAppResourceRequirements struct {
Requests *ResourceList `json:"requests,omitempty"`
Limits *ResourceList `json:"limits,omitempty"`
}
type ResourceList struct {
CPU string `json:"cpu,omitempty"`
Memory string `json:"memory,omitempty"`
}
type WebAppEnvVar struct {
Name string `json:"name"`
Value string `json:"value,omitempty"`
ValueFrom *EnvVarSource `json:"valueFrom,omitempty"`
}
type EnvVarSource struct {
SecretKeyRef *SecretKeyRef `json:"secretKeyRef,omitempty"`
}
type SecretKeyRef struct {
Name string `json:"name"`
Key string `json:"key"`
}
type WebAppStatus struct {
AvailableReplicas int32 `json:"availableReplicas,omitempty"`
Conditions []metav1.Condition `json:"conditions,omitempty"`
}
type WebApp struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec WebAppSpec `json:"spec,omitempty"`
Status WebAppStatus `json:"status,omitempty"`
}
type WebAppList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []WebApp `json:"items"`
}
func init() {
SchemeBuilder.Register(&WebApp{}, &WebAppList{})
}
Step 3: Controller Reconcile Loop Implementation
package controller
import (
"context"
"fmt"
appsv1 "k8s.io/api/apps/v1"
corev1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/errors"
"k8s.io/apimachinery/pkg/api/meta"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/log"
appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
)
type WebAppReconciler struct {
client.Client
Scheme *runtime.Scheme
}
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
logger := log.FromContext(ctx)
var webapp appsv1alpha1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
if errors.IsNotFound(err) {
logger.Info("WebApp resource not found, ignoring")
return ctrl.Result{}, nil
}
logger.Error(err, "Failed to get WebApp")
return ctrl.Result{}, err
}
if webapp.DeletionTimestamp != nil {
return r.handleDeletion(ctx, &webapp)
}
if err := r.ensureFinalizer(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
deployment, err := r.reconcileDeployment(ctx, &webapp)
if err != nil {
r.updateCondition(ctx, &webapp, "Available", metav1.ConditionFalse, "ReconcileError", err.Error())
return ctrl.Result{}, err
}
service, err := r.reconcileService(ctx, &webapp)
if err != nil {
r.updateCondition(ctx, &webapp, "Available", metav1.ConditionFalse, "ReconcileError", err.Error())
return ctrl.Result{}, err
}
_ = deployment
_ = service
if err := r.updateStatus(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
r.updateCondition(ctx, &webapp, "Available", metav1.ConditionTrue, "ReconcileSuccess", "WebApp is available")
return ctrl.Result{}, nil
}
func (r *WebAppReconciler) reconcileDeployment(ctx context.Context, webapp *appsv1alpha1.WebApp) (*appsv1.Deployment, error) {
logger := log.FromContext(ctx)
var deployment appsv1.Deployment
err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment)
if err != nil && errors.IsNotFound(err) {
desired := r.buildDeployment(webapp)
if err := ctrl.SetControllerReference(webapp, desired, r.Scheme); err != nil {
return nil, fmt.Errorf("failed to set OwnerReference: %w", err)
}
logger.Info("Creating Deployment", "name", desired.Name)
if err := r.Create(ctx, desired); err != nil {
return nil, fmt.Errorf("failed to create Deployment: %w", err)
}
return desired, nil
} else if err != nil {
return nil, fmt.Errorf("failed to query Deployment: %w", err)
}
desired := r.buildDeployment(webapp)
if r.deploymentNeedsUpdate(&deployment, desired) {
deployment.Spec = desired.Spec
logger.Info("Updating Deployment", "name", deployment.Name)
if err := r.Update(ctx, &deployment); err != nil {
return nil, fmt.Errorf("failed to update Deployment: %w", err)
}
}
return &deployment, nil
}
func (r *WebAppReconciler) buildDeployment(webapp *appsv1alpha1.WebApp) *appsv1.Deployment {
replicas := int32(1)
if webapp.Spec.Replicas != nil {
replicas = *webapp.Spec.Replicas
}
port := int32(8080)
if webapp.Spec.Port != nil {
port = *webapp.Spec.Port
}
envVars := make([]corev1.EnvVar, 0, len(webapp.Spec.Env))
for _, e := range webapp.Spec.Env {
ev := corev1.EnvVar{Name: e.Name, Value: e.Value}
if e.ValueFrom != nil && e.ValueFrom.SecretKeyRef != nil {
ev.ValueFrom = &corev1.EnvVarSource{
SecretKeyRef: &corev1.SecretKeySelector{
LocalObjectReference: corev1.LocalObjectReference{Name: e.ValueFrom.SecretKeyRef.Name},
Key: e.ValueFrom.SecretKeyRef.Key,
},
}
}
envVars = append(envVars, ev)
}
return &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
Labels: map[string]string{
"app.kubernetes.io/name": "webapp",
"app.kubernetes.io/instance": webapp.Name,
"app.kubernetes.io/managed-by": "webapp-operator",
},
},
Spec: appsv1.DeploymentSpec{
Replicas: &replicas,
Selector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"app.kubernetes.io/instance": webapp.Name,
},
},
Template: corev1.PodTemplateSpec{
ObjectMeta: metav1.ObjectMeta{
Labels: map[string]string{
"app.kubernetes.io/name": "webapp",
"app.kubernetes.io/instance": webapp.Name,
},
},
Spec: corev1.PodSpec{
Containers: []corev1.Container{
{
Name: "webapp",
Image: webapp.Spec.Image,
Ports: []corev1.ContainerPort{{ContainerPort: port}},
Env: envVars,
},
},
},
},
},
}
}
func (r *WebAppReconciler) reconcileService(ctx context.Context, webapp *appsv1alpha1.WebApp) (*corev1.Service, error) {
logger := log.FromContext(ctx)
port := int32(8080)
if webapp.Spec.Port != nil {
port = *webapp.Spec.Port
}
var svc corev1.Service
err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &svc)
if err != nil && errors.IsNotFound(err) {
desired := &corev1.Service{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
Labels: map[string]string{
"app.kubernetes.io/instance": webapp.Name,
"app.kubernetes.io/managed-by": "webapp-operator",
},
},
Spec: corev1.ServiceSpec{
Selector: map[string]string{
"app.kubernetes.io/instance": webapp.Name,
},
Ports: []corev1.ServicePort{
{Port: port, TargetPort: intstr.FromInt32(port)},
},
},
}
if err := ctrl.SetControllerReference(webapp, desired, r.Scheme); err != nil {
return nil, err
}
logger.Info("Creating Service", "name", desired.Name)
if err := r.Create(ctx, desired); err != nil {
return nil, err
}
return desired, nil
} else if err != nil {
return nil, err
}
return &svc, nil
}
func (r *WebAppReconciler) deploymentNeedsUpdate(current, desired *appsv1.Deployment) bool {
if *current.Spec.Replicas != *desired.Spec.Replicas {
return true
}
if current.Spec.Template.Spec.Containers[0].Image != desired.Spec.Template.Spec.Containers[0].Image {
return true
}
return false
}
func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&appsv1alpha1.WebApp{}).
Owns(&appsv1.Deployment{}).
Owns(&corev1.Service{}).
Complete(r)
}
Step 4: Status Subresource Management and Condition Updates
package controller
import (
"context"
appsv1 "k8s.io/api/apps/v1"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/types"
"sigs.k8s.io/controller-runtime/pkg/log"
appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
)
const (
finalizerName = "apps.toolsku.dev/finalizer"
)
func (r *WebAppReconciler) updateStatus(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
var deployment appsv1.Deployment
err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment)
if err != nil {
return err
}
availableReplicas := int32(0)
if deployment.Status.AvailableReplicas > 0 {
availableReplicas = deployment.Status.AvailableReplicas
}
if webapp.Status.AvailableReplicas == availableReplicas {
return nil
}
webapp.Status.AvailableReplicas = availableReplicas
return r.Status().Update(ctx, webapp)
}
func (r *WebAppReconciler) updateCondition(ctx context.Context, webapp *appsv1alpha1.WebApp, condType string, status metav1.ConditionStatus, reason, message string) {
condition := metav1.Condition{
Type: condType,
Status: status,
Reason: reason,
Message: message,
ObservedGeneration: webapp.Generation,
}
meta.SetStatusCondition(&webapp.Status.Conditions, condition)
if err := r.Status().Update(ctx, webapp); err != nil {
log.FromContext(ctx).Error(err, "Failed to update Condition")
}
}
func (r *WebAppReconciler) ensureFinalizer(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
if !containsString(webapp.Finalizers, finalizerName) {
webapp.Finalizers = append(webapp.Finalizers, finalizerName)
if err := r.Update(ctx, webapp); err != nil {
return err
}
}
return nil
}
func (r *WebAppReconciler) handleDeletion(ctx context.Context, webapp *appsv1alpha1.WebApp) (ctrl.Result, error) {
logger := log.FromContext(ctx)
if !containsString(webapp.Finalizers, finalizerName) {
return ctrl.Result{}, nil
}
logger.Info("Executing cleanup logic", "webapp", webapp.Name)
if err := r.cleanupExternalResources(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
webapp.Finalizers = removeString(webapp.Finalizers, finalizerName)
if err := r.Update(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
logger.Info("Finalizer cleanup completed", "webapp", webapp.Name)
return ctrl.Result{}, nil
}
func (r *WebAppReconciler) cleanupExternalResources(ctx context.Context, webapp *appsv1alpha1.WebApp) error {
logger := log.FromContext(ctx)
logger.Info("Cleaning up external resources", "webapp", webapp.Name)
return nil
}
func containsString(slice []string, s string) bool {
for _, item := range slice {
if item == s {
return true
}
}
return false
}
func removeString(slice []string, s string) []string {
var result []string
for _, item := range slice {
if item == s {
continue
}
result = append(result, item)
}
return result
}
Step 5: Admission Webhook Validation
package v1alpha1
import (
"fmt"
"k8s.io/apimachinery/pkg/runtime"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/webhook"
"sigs.k8s.io/controller-runtime/pkg/webhook/admission"
)
func (w *WebApp) SetupWebhookWithManager(mgr ctrl.Manager) error {
return ctrl.NewWebhookManagedBy(mgr).
For(w).
Complete()
}
var _ webhook.Defaulter = &WebApp{}
func (w *WebApp) Default() {
if w.Spec.Port == nil {
port := int32(8080)
w.Spec.Port = &port
}
if w.Spec.Replicas == nil {
replicas := int32(1)
w.Spec.Replicas = &replicas
}
}
var _ webhook.Validator = &WebApp{}
func (w *WebApp) ValidateCreate() (admission.Warnings, error) {
if err := validateWebApp(w); err != nil {
return nil, err
}
return admission.Warnings{"WebApp will automatically deploy Deployment and Service upon creation"}, nil
}
func (w *WebApp) ValidateUpdate(old runtime.Object) error {
oldWA := old.(*WebApp)
if *oldWA.Spec.Replicas > 0 && *w.Spec.Replicas == 0 {
return fmt.Errorf("scaling replicas to 0 is not allowed, please delete the WebApp resource directly")
}
return validateWebApp(w)
}
func (w *WebApp) ValidateDelete() error {
return nil
}
func validateWebApp(w *WebApp) error {
if w.Spec.Image == "" {
return fmt.Errorf("image cannot be empty")
}
if w.Spec.Replicas != nil && *w.Spec.Replicas > 100 {
return fmt.Errorf("replicas cannot exceed 100, current value: %d", *w.Spec.Replicas)
}
if w.Spec.Port != nil {
if *w.Spec.Port < 1 || *w.Spec.Port > 65535 {
return fmt.Errorf("port must be between 1-65535, current value: %d", *w.Spec.Port)
}
}
for i, env := range w.Spec.Env {
if env.Name == "" {
return fmt.Errorf("env[%d].name cannot be empty", i)
}
if env.Value == "" && env.ValueFrom == nil {
return fmt.Errorf("env[%d] must specify value or valueFrom", i)
}
}
return nil
}
Step 6: Production Configuration—Leader Election, Graceful Shutdown, and Metrics
package main
import (
"context"
"flag"
"fmt"
"os"
"k8s.io/apimachinery/pkg/runtime"
utilruntime "k8s.io/apimachinery/pkg/util/runtime"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/healthz"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
"sigs.k8s.io/controller-runtime/pkg/metrics/server"
"sigs.k8s.io/controller-runtime/pkg/webhook"
appsv1alpha1 "github.com/toolsku/webapp-operator/api/v1alpha1"
"github.com/toolsku/webapp-operator/internal/controller"
)
var (
scheme = runtime.NewScheme()
setupLog = ctrl.Log.WithName("setup")
)
func init() {
utilruntime.Must(clientgoscheme.AddToScheme(scheme))
utilruntime.Must(appsv1alpha1.AddToScheme(scheme))
}
func main() {
var metricsAddr string
var enableLeaderElection bool
var probeAddr string
var webhookPort int
flag.StringVar(&metricsAddr, "metrics-bind-address", ":8080", "Metrics server address")
flag.StringVar(&probeAddr, "health-probe-bind-address", ":8081", "Health probe address")
flag.BoolVar(&enableLeaderElection, "leader-elect", true, "Enable Leader Election")
flag.IntVar(&webhookPort, "webhook-port", 9443, "Webhook port")
opts := zap.Options{Development: true}
opts.BindFlags(flag.CommandLine)
flag.Parse()
ctrl.SetLogger(zap.New(zap.UseFlagOptions(&opts)))
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
Metrics: server.Options{
BindAddress: metricsAddr,
},
WebhookServer: webhook.NewServer(webhook.Options{
Port: webhookPort,
}),
HealthProbeBindAddress: probeAddr,
LeaderElection: enableLeaderElection,
LeaderElectionID: "webapp-operator.toolsku.dev",
LeaderElectionNamespace: os.Getenv("POD_NAMESPACE"),
})
if err != nil {
setupLog.Error(err, "Failed to create Manager")
os.Exit(1)
}
if err := (&controller.WebAppReconciler{
Client: mgr.GetClient(),
Scheme: mgr.GetScheme(),
}).SetupWithManager(mgr); err != nil {
setupLog.Error(err, "Failed to setup Controller")
os.Exit(1)
}
if err := (&appsv1alpha1.WebApp{}).SetupWebhookWithManager(mgr); err != nil {
setupLog.Error(err, "Failed to setup Webhook")
os.Exit(1)
}
if err := mgr.AddHealthzCheck("healthz", healthz.Ping); err != nil {
setupLog.Error(err, "Failed to setup health check")
os.Exit(1)
}
if err := mgr.AddReadyzCheck("readyz", healthz.Ping); err != nil {
setupLog.Error(err, "Failed to setup readiness check")
os.Exit(1)
}
setupLog.Info("Starting Operator",
"metrics", metricsAddr,
"probe", probeAddr,
"leader-election", enableLeaderElection,
"webhook-port", webhookPort,
)
if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil {
setupLog.Error(err, "Failed to run Manager")
os.Exit(1)
}
}
Controller Manager Deployment YAML:
apiVersion: apps/v1
kind: Deployment
metadata:
name: webapp-operator-controller-manager
namespace: webapp-operator-system
spec:
replicas: 2
selector:
matchLabels:
control-plane: controller-manager
template:
metadata:
labels:
control-plane: controller-manager
annotations:
kubectl.kubernetes.io/default-container: manager
spec:
serviceAccountName: controller-manager
containers:
- name: manager
image: toolsku/webapp-operator:v1.0.0
args:
- --leader-elect
- --metrics-bind-address=:8080
- --health-probe-bind-address=:8081
- --webhook-port=9443
ports:
- containerPort: 8080
name: metrics
- containerPort: 8081
name: health
- containerPort: 9443
name: webhook
livenessProbe:
httpGet:
path: /healthz
port: 8081
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /readyz
port: 8081
initialDelaySeconds: 5
periodSeconds: 10
resources:
limits:
cpu: 500m
memory: 256Mi
requests:
cpu: 100m
memory: 128Mi
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
volumeMounts:
- name: cert
mountPath: /tmp/k8s-webhook-server/serving-certs
readOnly: true
volumes:
- name: cert
secret:
defaultMode: 420
secretName: webhook-server-cert
terminationGracePeriodSeconds: 60
Six Pitfall Guide
Pitfall 1: Non-Idempotent Reconcile Causing Duplicate Resource Creation
❌ Wrong approach:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var webapp appsv1alpha1.WebApp
r.Get(ctx, req.NamespacedName, &webapp)
deploy := r.buildDeployment(&webapp)
r.Create(ctx, deploy)
return ctrl.Result{}, nil
}
✅ Correct approach:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var webapp appsv1alpha1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
if errors.IsNotFound(err) {
return ctrl.Result{}, nil
}
return ctrl.Result{}, err
}
var existing appsv1.Deployment
err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &existing)
if err != nil && errors.IsNotFound(err) {
deploy := r.buildDeployment(&webapp)
if err := ctrl.SetControllerReference(&webapp, deploy, r.Scheme); err != nil {
return ctrl.Result{}, err
}
if err := r.Create(ctx, deploy); err != nil {
return ctrl.Result{}, err
}
} else if err != nil {
return ctrl.Result{}, err
}
return ctrl.Result{}, nil
}
Pitfall 2: Status Update Conflicting with Spec Update
❌ Wrong approach:
webapp.Status.AvailableReplicas = 3
if err := r.Update(ctx, webapp); err != nil {
return ctrl.Result{}, err
}
✅ Correct approach:
webapp.Status.AvailableReplicas = 3
if err := r.Status().Update(ctx, webapp); err != nil {
if errors.IsConflict(err) {
return ctrl.Result{Requeue: true}, nil
}
return ctrl.Result{}, err
}
Pitfall 3: Improper Finalizer Handling Causing Resources Stuck in Terminating
❌ Wrong approach:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var webapp appsv1alpha1.WebApp
r.Get(ctx, req.NamespacedName, &webapp)
if webapp.DeletionTimestamp != nil {
return ctrl.Result{}, nil
}
webapp.Finalizers = append(webapp.Finalizers, "my-finalizer")
r.Update(ctx, &webapp)
return ctrl.Result{}, nil
}
✅ Correct approach:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var webapp appsv1alpha1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
if webapp.DeletionTimestamp != nil {
if containsString(webapp.Finalizers, finalizerName) {
if err := r.cleanupExternalResources(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
webapp.Finalizers = removeString(webapp.Finalizers, finalizerName)
if err := r.Update(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
}
return ctrl.Result{}, nil
}
if !containsString(webapp.Finalizers, finalizerName) {
webapp.Finalizers = append(webapp.Finalizers, finalizerName)
if err := r.Update(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
}
return r.reconcileNormal(ctx, &webapp)
}
Pitfall 4: Missing OwnerReference Causing Child Resource Leaks
❌ Wrong approach:
deploy := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
},
}
r.Create(ctx, deploy)
✅ Correct approach:
deploy := &appsv1.Deployment{
ObjectMeta: metav1.ObjectMeta{
Name: webapp.Name,
Namespace: webapp.Namespace,
},
}
if err := ctrl.SetControllerReference(&webapp, deploy, r.Scheme); err != nil {
return ctrl.Result{}, err
}
if err := r.Create(ctx, deploy); err != nil && !errors.IsAlreadyExists(err) {
return ctrl.Result{}, err
}
Pitfall 5: Ignoring Conflict Errors in Reconcile
❌ Wrong approach:
if err := r.Update(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
✅ Correct approach:
if err := r.Update(ctx, &webapp); err != nil {
if errors.IsConflict(err) {
return ctrl.Result{Requeue: true}, nil
}
return ctrl.Result{}, err
}
Pitfall 6: Webhook Without Certificate Configuration Causing TLS Handshake Failure
❌ Wrong approach:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: webapp-validator
webhooks:
- name: webapp.apps.toolsku.dev
clientConfig:
service:
name: webhook-service
namespace: webapp-operator-system
path: /validate-apps-toolsku-dev-v1alpha1-webapp
sideEffects: None
admissionReviewVersions: ["v1"]
✅ Correct approach:
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata:
name: webapp-validator
annotations:
cert-manager.io/inject-ca-from: webapp-operator-system/webapp-operator-serving-cert
webhooks:
- name: webapp.apps.toolsku.dev
clientConfig:
service:
name: webhook-service
namespace: webapp-operator-system
path: /validate-apps-toolsku-dev-v1alpha1-webapp
port: 443
sideEffects: None
admissionReviewVersions: ["v1"]
failurePolicy: Fail
timeoutSeconds: 10
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: webapp-operator-serving-cert
namespace: webapp-operator-system
spec:
dnsNames:
- webhook-service.webapp-operator-system.svc
- webhook-service.webapp-operator-system.svc.cluster.local
issuerRef:
kind: Issuer
name: webapp-operator-selfsigned-issuer
secretName: webhook-server-cert
Error Troubleshooting Reference Table
| # | Error | Cause | Solution |
|---|---|---|---|
| 1 | the server could not find the requested resource |
CRD not installed or API version mismatch | Run kubectl apply -f config/crd/bases/, check apiVersion matches CRD |
| 2 | Operation cannot be fulfilled on webapps.apps.toolsku.dev: the object has been modified |
Optimistic lock conflict, object modified during Reconcile | Catch IsConflict error, return ctrl.Result{Requeue: true} to re-execute Reconcile |
| 3 | webhook server: TLS handshake error |
Webhook certificate not properly configured or expired | Install cert-manager, configure Certificate resource for auto-issuance |
| 4 | resource stuck in Terminating state |
Finalizer not properly removed, Controller has stopped | kubectl patch <cr> -p '{"metadata":{"finalizers":[]}}' --type=merge to force remove |
| 5 | no matches for kind "WebApp" in version "apps.toolsku.dev/v1alpha1" |
CRD versions has served: false or storage: false |
Check CRD definition, ensure target version has served: true |
| 6 | failed to call webhook: Post "https://...": x509: certificate signed by unknown authority |
Webhook CA certificate not injected into ValidatingWebhookConfiguration | Configure cert-manager's inject-ca-from annotation, or manually set caBundle |
| 7 | controller: Reconciler error: failed to get deployment: deployments is forbidden |
Insufficient RBAC permissions, ServiceAccount missing API operation permissions | Check Role and RoleBinding under config/rbac/, add deployments resource permissions for apps group |
| 8 | leader-election: failed to renew lease |
Leader Election Lease update failure, usually network or permission issues | Check leases resource permissions for coordination.k8s.io, confirm network connectivity |
| 9 | too many open files |
Controller Watch connections exceed system file descriptor limits | Increase ulimit -n, or reduce Watch resource types, use WithEventFilter to filter events |
| 10 | webhook: admission webhook denied the request: replicas cannot be 0 |
Validating Webhook rejected invalid request | Check Webhook validation logic, confirm request parameters meet business constraints |
Three Advanced Optimization Techniques
Technique 1: EventFilter to Reduce Unnecessary Reconciles
By default, any Status update triggers a Reconcile. Use EventFilter to filter out unnecessary events:
import (
"fmt"
"reflect"
"sigs.k8s.io/controller-runtime/pkg/event"
"sigs.k8s.io/controller-runtime/pkg/predicate"
)
func WebAppPredicate() predicate.Predicate {
return predicate.Funcs{
UpdateFunc: func(e event.UpdateEvent) bool {
oldObj := e.ObjectOld.(*appsv1alpha1.WebApp)
newObj := e.ObjectNew.(*appsv1alpha1.WebApp)
if oldObj.Generation != newObj.Generation {
return true
}
if oldObj.DeletionTimestamp != newObj.DeletionTimestamp {
return true
}
if !reflect.DeepEqual(oldObj.Finalizers, newObj.Finalizers) {
return true
}
return false
},
CreateFunc: func(e event.CreateEvent) bool {
return true
},
DeleteFunc: func(e event.DeleteEvent) bool {
return true
},
GenericFunc: func(e event.GenericEvent) bool {
return false
},
}
}
func (r *WebAppReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&appsv1alpha1.WebApp{}).
WithEventFilter(WebAppPredicate()).
Owns(&appsv1.Deployment{}).
Owns(&corev1.Service{}).
Complete(r)
}
Technique 2: Reconcile Result and Requeue Strategy
Use RequeueAfter wisely for periodic reconciliation, avoiding polling:
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var webapp appsv1alpha1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err)
}
if webapp.DeletionTimestamp != nil {
return r.handleDeletion(ctx, &webapp)
}
if err := r.reconcileResources(ctx, &webapp); err != nil {
return ctrl.Result{}, err
}
if !r.isWebAppReady(ctx, &webapp) {
return ctrl.Result{RequeueAfter: 5 * time.Second}, nil
}
return ctrl.Result{}, nil
}
func (r *WebAppReconciler) isWebAppReady(ctx context.Context, webapp *appsv1alpha1.WebApp) bool {
var deployment appsv1.Deployment
if err := r.Get(ctx, types.NamespacedName{Name: webapp.Name, Namespace: webapp.Namespace}, &deployment); err != nil {
return false
}
return deployment.Status.ReadyReplicas == *webapp.Spec.Replicas
}
Technique 3: Custom Metrics for Controller Runtime Indicators
import (
"github.com/prometheus/client_golang/prometheus"
"sigs.k8s.io/controller-runtime/pkg/metrics"
)
var (
reconcileTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "webapp_operator_reconcile_total",
Help: "Total number of WebApp Operator reconciliations",
},
[]string{"name", "namespace", "result"},
)
reconcileDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "webapp_operator_reconcile_duration_seconds",
Help: "WebApp Operator reconciliation duration distribution",
Buckets: prometheus.DefBuckets,
},
[]string{"name", "namespace"},
)
managedResources = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "webapp_operator_managed_resources",
Help: "Current number of managed WebApp resources",
},
[]string{"namespace"},
)
)
func init() {
metrics.Registry.MustRegister(reconcileTotal, reconcileDuration, managedResources)
}
func (r *WebAppReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
start := time.Now()
defer func() {
reconcileDuration.WithLabelValues(req.Name, req.Namespace).Observe(time.Since(start).Seconds())
}()
var webapp appsv1alpha1.WebApp
if err := r.Get(ctx, req.NamespacedName, &webapp); err != nil {
if errors.IsNotFound(err) {
reconcileTotal.WithLabelValues(req.Name, req.Namespace, "not_found").Inc()
return ctrl.Result{}, nil
}
reconcileTotal.WithLabelValues(req.Name, req.Namespace, "error").Inc()
return ctrl.Result{}, err
}
if err := r.reconcileResources(ctx, &webapp); err != nil {
reconcileTotal.WithLabelValues(req.Name, req.Namespace, "error").Inc()
return ctrl.Result{}, err
}
reconcileTotal.WithLabelValues(req.Name, req.Namespace, "success").Inc()
return ctrl.Result{}, nil
}
Operator Development Framework Comparison
| Dimension | raw client-go | Kubebuilder | Operator SDK (Go) | Helm Operator |
|---|---|---|---|---|
| Language | Go | Go | Go/Ansible/Helm | Helm |
| Scaffolding | None | Complete CLI | Complete CLI | SDK built-in |
| Learning Curve | Steep | Moderate | Moderate | Gentle |
| Code Generation | None | CRD types + Webhook | CRD types + Webhook | None |
| Flexibility | Highest | High | Medium | Low |
| Controller Complexity | Manual | controller-runtime | controller-runtime | Auto-generated |
| Webhook Support | Manual | Auto-generated | Auto-generated | Not supported |
| Leader Election | Manual | Built-in | Built-in | Built-in |
| Metrics | Manual | Built-in | Built-in | Built-in |
| Multi-version CRD | Manual | Conversion Webhook | Conversion Webhook | Not supported |
| Community Ecosystem | K8s Official | CNCF | CNCF | CNCF |
| Use Case | Highly customized | General Operator | Rapid development | Simple apps |
| Production Ready | Requires extensive wrapping | Out of the box | Out of the box | Out of the box |
Summary
Summary: CRD Operator isn't as simple as "write a CRD + run a Controller." A production-grade Operator needs to focus on 6 key dimensions: CRD Schema design must be precise down to validation rules for every field; Reconcile Loop must be idempotent and correctly handle Conflicts; Status Subresource must separate Spec and Status update paths; Finalizer must ensure resource cleanup doesn't deadlock; Webhook must configure certificates and failure policies properly; Production config must include Leader Election, Graceful Shutdown, and Metrics. Kubebuilder is currently the most recommended Operator development framework, achieving the best balance between flexibility and development efficiency.
Recommended Tools
- JSON Formatter - Format CRD definitions and Status output JSON data
- Base64 Encoder - Encode Webhook certificates and Secret data
- Hash Calculator - Calculate content hashes for ConfigMaps and Secrets to detect config changes
- JWT Decoder - Decode ServiceAccount JWT tokens for RBAC permission troubleshooting
Try these browser-local tools — no sign-up required →