Terraform IaC最佳实践:从模块设计到GitOps的5种生产模式

DevOps

2026年,Terraform IaC已经不是"会不会"的问题,而是"做得好不好"

Terraform在2023年更改许可证为BSL 1.1后,社区分裂出了OpenTofu。但无论你选择Terraform还是OpenTofu,HCL依然是IaC领域使用最广泛的语言。问题不再是"要不要用IaC",而是"如何把IaC做得生产可用"。

太多团队的Terraform代码是这样的:一个巨大的main.tf、状态文件存在本地、所有环境共享同一套变量、模块没有版本管理、CI/CD里手动执行terraform apply。这不是IaC,这是"用代码写的手工运维"。

本文覆盖5种生产级IaC模式,从模块组合设计到GitOps自动化,帮你把Terraform从"能用"升级到"好用"。

核心收获

  • 掌握模块组合设计模式:可复用、可测试、可版本化的模块架构
  • 理解远程状态管理的3层防护:远程后端、状态锁、Drift检测
  • 实现Workspace环境隔离和变量管理最佳实践
  • 完成从Terraform到OpenTofu的无缝迁移
  • 集成GitOps工作流:Atlantis + CI/CD自动化plan/apply

目录

  • Terraform IaC核心概念
  • Pattern 1: 模块组合设计
  • Pattern 2: 状态管理
  • Pattern 3: Workspace环境隔离
  • Pattern 4: OpenTofu迁移
  • Pattern 5: GitOps集成
  • 5个常见坑及解决方案
  • 10个常见报错排查
  • 进阶优化技巧
  • 对比分析
  • 在线工具推荐

Terraform IaC核心概念

IaC成熟度模型

┌─────────────────────────────────────────────────────────────┐
│                 IaC成熟度模型                                  │
├──────────┬──────────────────┬────────────────────────────────┤
│  Level 1 │  Level 2         │  Level 3                       │
│  脚本化   │  模块化           │  平台化                        │
├──────────┼──────────────────┼────────────────────────────────┤
│ 单文件    │  模块拆分         │  模块组合+注册表               │
│ 本地状态  │  远程状态         │  状态分层+隔离                 │
│ 手动执行  │  CI/CD触发       │  GitOps自动化                  │
│ 无测试    │  基础测试         │  Policy as Code               │
│ 无版本    │  Git版本         │  语义化版本+变更日志           │
│ 环境耦合  │  Workspace隔离   │  多环境抽象层                  │
└──────────┴──────────────────┴────────────────────────────────┘

Terraform核心工作流

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  Write   │────▶│  Plan    │────▶│  Apply   │────▶│  State   │
│ (编写HCL) │     │ (预览变更)│     │ (执行变更)│     │ (状态更新)│
└──────────┘     └──────────┘     └──────────┘     └──────────┘
     │                │                │                │
     ▼                ▼                ▼                ▼
  Git Commit    terraform plan   terraform apply   Remote Backend
  Pull Request  Plan File输出    资源创建/更新     S3/GCS/Cloud

2026年Terraform生态关键变化

变化 影响 应对策略
BSL 1.1许可证 企业使用受限 评估OpenTofu迁移
OpenTofu 1.9+ 社区驱动替代方案 新项目优先选择
Terraform 1.10+ 原生测试框架 采用terraform test
Crossplane崛起 K8s原生IaC 互补而非替代
Pulumi成熟 通用语言IaC 按团队技能选择

Pattern 1: 模块组合设计

模块是Terraform IaC的基石。但大多数团队只做到了"拆文件",没有做到"可组合"。生产级模块设计需要3层架构:基础模块(Base Module)、组合模块(Composition Module)、环境模块(Environment Module)。

三层模块架构

┌──────────────────────────────────────────────────────┐
│              Environment Module (环境模块)              │
│  ┌──────────────────────────────────────────────────┐ │
│  │           Composition Module (组合模块)            │ │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐      │ │
│  │  │  Base    │  │  Base    │  │  Base    │      │ │
│  │  │ Module   │  │ Module   │  │ Module   │      │ │
│  │  │ (VPC)    │  │ (RDS)    │  │ (ECS)    │      │ │
│  │  └──────────┘  └──────────┘  └──────────┘      │ │
│  └──────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘

基础模块:VPC

modules/
└── vpc/
    ├── main.tf
    ├── variables.tf
    ├── outputs.tf
    ├── versions.tf
    └── README.md
# modules/vpc/versions.tf
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 5.0, < 6.0"
    }
  }
}
# modules/vpc/variables.tf
variable "cidr_block" {
  description = "VPC CIDR block"
  type        = string
  default     = "10.0.0.0/16"
}

variable "environment" {
  description = "Environment name"
  type        = string
}

variable "public_subnets" {
  description = "List of public subnet CIDR blocks"
  type        = list(string)
  default     = ["10.0.1.0/24", "10.0.2.0/24"]
}

variable "private_subnets" {
  description = "List of private subnet CIDR blocks"
  type        = list(string)
  default     = ["10.0.10.0/24", "10.0.11.0/24"]
}

variable "enable_nat_gateway" {
  description = "Enable NAT Gateway for private subnets"
  type        = bool
  default     = true
}

variable "single_nat_gateway" {
  description = "Use single NAT Gateway to reduce cost"
  type        = bool
  default     = false
}

variable "tags" {
  description = "Additional tags for all resources"
  type        = map(string)
  default     = {}
}
# modules/vpc/main.tf
resource "aws_vpc" "this" {
  cidr_block           = var.cidr_block
  enable_dns_hostnames = true
  enable_dns_support   = true

  tags = merge(
    {
      Name        = "${var.environment}-vpc"
      Environment = var.environment
      ManagedBy   = "terraform"
    },
    var.tags
  )
}

resource "aws_internet_gateway" "this" {
  vpc_id = aws_vpc.this.id

  tags = merge(
    {
      Name        = "${var.environment}-igw"
      Environment = var.environment
      ManagedBy   = "terraform"
    },
    var.tags
  )
}

resource "aws_subnet" "public" {
  count                   = length(var.public_subnets)
  vpc_id                  = aws_vpc.this.id
  cidr_block              = var.public_subnets[count.index]
  availability_zone       = data.aws_availability_zones.available.names[count.index]
  map_public_ip_on_launch = true

  tags = merge(
    {
      Name        = "${var.environment}-public-${count.index + 1}"
      Environment = var.environment
      Tier        = "public"
      ManagedBy   = "terraform"
    },
    var.tags
  )
}

resource "aws_subnet" "private" {
  count             = length(var.private_subnets)
  vpc_id            = aws_vpc.this.id
  cidr_block        = var.private_subnets[count.index]
  availability_zone = data.aws_availability_zones.available.names[count.index]

  tags = merge(
    {
      Name        = "${var.environment}-private-${count.index + 1}"
      Environment = var.environment
      Tier        = "private"
      ManagedBy   = "terraform"
    },
    var.tags
  )
}

resource "aws_eip" "nat" {
  count  = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.private_subnets)) : 0
  domain = "vpc"

  tags = merge(
    {
      Name        = "${var.environment}-nat-eip-${count.index + 1}"
      Environment = var.environment
      ManagedBy   = "terraform"
    },
    var.tags
  )
}

resource "aws_nat_gateway" "this" {
  count         = var.enable_nat_gateway ? (var.single_nat_gateway ? 1 : length(var.private_subnets)) : 0
  allocation_id = aws_eip.nat[count.index].id
  subnet_id     = aws_subnet.public[count.index % length(aws_subnet.public)].id

  tags = merge(
    {
      Name        = "${var.environment}-nat-${count.index + 1}"
      Environment = var.environment
      ManagedBy   = "terraform"
    },
    var.tags
  )

  depends_on = [aws_internet_gateway.this]
}

resource "aws_route_table" "public" {
  vpc_id = aws_vpc.this.id

  route {
    cidr_block = "0.0.0.0/0"
    gateway_id = aws_internet_gateway.this.id
  }

  tags = merge(
    {
      Name        = "${var.environment}-public-rt"
      Environment = var.environment
      ManagedBy   = "terraform"
    },
    var.tags
  )
}

resource "aws_route_table" "private" {
  count  = length(var.private_subnets)
  vpc_id = aws_vpc.this.id

  route {
    cidr_block     = "0.0.0.0/0"
    nat_gateway_id = var.single_nat_gateway ? aws_nat_gateway.this[0].id : aws_nat_gateway.this[count.index].id
  }

  tags = merge(
    {
      Name        = "${var.environment}-private-rt-${count.index + 1}"
      Environment = var.environment
      ManagedBy   = "terraform"
    },
    var.tags
  )
}

resource "aws_route_table_association" "public" {
  count          = length(var.public_subnets)
  subnet_id      = aws_subnet.public[count.index].id
  route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "private" {
  count          = length(var.private_subnets)
  subnet_id      = aws_subnet.private[count.index].id
  route_table_id = aws_route_table.private[count.index].id
}

data "aws_availability_zones" "available" {
  state = "available"
}
# modules/vpc/outputs.tf
output "vpc_id" {
  description = "VPC ID"
  value       = aws_vpc.this.id
}

output "vpc_cidr" {
  description = "VPC CIDR block"
  value       = aws_vpc.this.cidr_block
}

output "public_subnet_ids" {
  description = "List of public subnet IDs"
  value       = aws_subnet.public[*].id
}

output "private_subnet_ids" {
  description = "List of private subnet IDs"
  value       = aws_subnet.private[*].id
}

output "nat_gateway_ids" {
  description = "List of NAT Gateway IDs"
  value       = aws_nat_gateway.this[*].id
}

output "igw_id" {
  description = "Internet Gateway ID"
  value       = aws_internet_gateway.this.id
}

组合模块:完整应用基础设施

# modules/app-stack/main.tf
module "vpc" {
  source = "../vpc"

  cidr_block        = var.vpc_cidr
  environment       = var.environment
  public_subnets    = var.public_subnet_cidrs
  private_subnets   = var.private_subnet_cidrs
  enable_nat_gateway = true
  single_nat_gateway = var.environment != "prod"
  tags              = local.common_tags
}

module "rds" {
  source = "../rds"

  environment      = var.environment
  vpc_id           = module.vpc.vpc_id
  subnet_ids       = module.vpc.private_subnet_ids
  engine           = var.db_engine
  engine_version   = var.db_engine_version
  instance_class   = var.db_instance_class
  allocated_storage = var.db_allocated_storage
  database_name    = var.database_name
  username         = var.db_username
  password         = var.db_password
  tags             = local.common_tags
}

module "ecs" {
  source = "../ecs"

  environment    = var.environment
  vpc_id         = module.vpc.vpc_id
  subnet_ids     = module.vpc.private_subnet_ids
  cluster_name   = "${var.environment}-cluster"
  container_image = var.container_image
  container_port = var.container_port
  desired_count  = var.desired_count
  cpu            = var.cpu
  memory         = var.memory
  environment_variables = merge(
    {
      DATABASE_URL = "postgresql://${var.db_username}:${var.db_password}@${module.rds.endpoint}/${var.database_name}"
      ENVIRONMENT  = var.environment
    },
    var.extra_environment_variables
  )
  tags = local.common_tags
}

locals {
  common_tags = merge(
    {
      Environment = var.environment
      Project     = var.project_name
      ManagedBy   = "terraform"
    },
    var.tags
  )
}

模块版本管理

# 使用Terraform Registry模块
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 5.0"
}

# 使用Git仓库模块(带标签)
module "app_stack" {
  source = "git::https://github.com/myorg/terraform-modules.git//modules/app-stack?ref=v2.1.0"
}

# 使用本地模块(开发阶段)
module "vpc" {
  source = "../../modules/vpc"
}

# 使用S3存储的模块包
module "app_stack" {
  source = "s3::https://my-terraform-modules.s3.amazonaws.com/app-stack/v2.1.0.zip"
}

模块注册表(私有Registry)

# 发布模块到Terraform Private Registry
# 1. 创建Git标签
git tag v2.1.0
git push origin v2.1.0

# 2. 在Terraform Cloud中配置模块源
# Settings > Modules > Add module
# Source: myorg/terraform-modules

# 3. 使用私有Registry模块
module "vpc" {
  source  = "app.myorg.local/myorg/vpc/aws"
  version = "~> 2.0"
}

模块测试

# modules/vpc/tests/main.tftest.hcl
run "validate_vpc_cidr" {
  command = plan

  variables {
    cidr_block  = "10.0.0.0/16"
    environment = "test"
  }

  assert {
    condition     = aws_vpc.this.cidr_block == "10.0.0.0/16"
    error_message = "VPC CIDR block should match input"
  }
}

run "validate_subnets" {
  command = plan

  variables {
    cidr_block      = "10.0.0.0/16"
    environment     = "test"
    public_subnets  = ["10.0.1.0/24", "10.0.2.0/24"]
    private_subnets = ["10.0.10.0/24", "10.0.11.0/24"]
  }

  assert {
    condition     = length(aws_subnet.public) == 2
    error_message = "Should create 2 public subnets"
  }

  assert {
    condition     = length(aws_subnet.private) == 2
    error_message = "Should create 2 private subnets"
  }
}

run "validate_nat_gateway_production" {
  command = plan

  variables {
    cidr_block        = "10.0.0.0/16"
    environment       = "prod"
    enable_nat_gateway = true
    single_nat_gateway = false
    private_subnets   = ["10.0.10.0/24", "10.0.11.0/24", "10.0.12.0/24"]
  }

  assert {
    condition     = length(aws_nat_gateway.this) == 3
    error_message = "Production should have one NAT Gateway per AZ"
  }
}
# 运行模块测试
cd modules/vpc
terraform test

# 运行所有模块测试
terraform test -recursive

Pattern 2: 状态管理

Terraform状态文件是IaC最关键的数据。丢失状态等于丢失对基础设施的控制。生产环境必须使用远程后端、启用状态锁、定期检测Drift。

远程后端架构

┌──────────────────────────────────────────────────────┐
│                  状态管理三层防护                        │
│                                                       │
│  ┌──────────────────────────────────────────────┐    │
│  │  Layer 1: 远程后端                             │    │
│  │  S3 + DynamoDB / GCS / Azure Blob             │    │
│  │  (状态持久化,团队共享)                          │    │
│  └──────────────────────────────────────────────┘    │
│                                                       │
│  ┌──────────────────────────────────────────────┐    │
│  │  Layer 2: 状态锁                               │    │
│  │  DynamoDB / GCS原生锁 / Azure Blob租约         │    │
│  │  (防止并发修改,串行化apply)                    │    │
│  └──────────────────────────────────────────────┘    │
│                                                       │
│  ┌──────────────────────────────────────────────┐    │
│  │  Layer 3: Drift检测                            │    │
│  │  terraform refresh + CI/CD定时检查             │    │
│  │  (发现手动变更,保持状态一致)                    │    │
│  └──────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────┘

S3 + DynamoDB后端配置

# backend.tf
terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "app-infra/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/abc123"

    state_lock_timeout = "30m"
  }
}

引导后端基础设施

# bootstrap/main.tf
# 这部分用本地状态创建远程后端资源
# 创建完成后迁移到远程后端

resource "aws_s3_bucket" "terraform_state" {
  bucket = "myorg-terraform-state"

  lifecycle {
    prevent_destroy = true
  }
}

resource "aws_s3_bucket_versioning" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.terraform_state.arn
    }
  }
}

resource "aws_s3_bucket_public_access_block" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_lifecycle_configuration" "terraform_state" {
  bucket = aws_s3_bucket.terraform_state.id

  rule {
    id     = "cleanup-old-versions"
    status = "Enabled"

    noncurrent_version_transition {
      noncurrent_days = 90
      storage_class   = "GLACIER"
    }

    noncurrent_version_expiration {
      noncurrent_days = 365
    }
  }
}

resource "aws_dynamodb_table" "terraform_locks" {
  name         = "terraform-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }
}

resource "aws_kms_key" "terraform_state" {
  description             = "Terraform state encryption key"
  deletion_window_in_days = 30
  enable_key_rotation     = true
}

resource "aws_kms_alias" "terraform_state" {
  name          = "alias/terraform-state"
  target_key_id = aws_kms_key.terraform_state.key_id
}
# 引导流程
cd bootstrap
terraform init
terraform apply

# 迁移到远程后端
# 创建backend.tf后执行
terraform init -migrate-state

# 验证状态已迁移到S3
aws s3 ls s3://myorg-terraform-state/app-infra/

GCS后端配置

terraform {
  backend "gcs" {
    bucket = "myorg-terraform-state"
    prefix = "app-infra"
  }
}

状态分层

┌──────────────────────────────────────────────────────┐
│                  状态分层架构                           │
│                                                       │
│  ┌──────────────────────────────────────────────┐    │
│  │  Layer 0: bootstrap (本地状态)                 │    │
│  │  S3 Bucket / DynamoDB / KMS / IAM             │    │
│  └──────────────────────────────────────────────┘    │
│                     │ data flow                       │
│  ┌──────────────────────────────────────────────┐    │
│  │  Layer 1: networking (远程状态)                │    │
│  │  VPC / Subnets / Route Tables / NAT GW        │    │
│  └──────────────────────────────────────────────┘    │
│                     │ data flow                       │
│  ┌──────────────────────────────────────────────┐    │
│  │  Layer 2: compute (远程状态)                   │    │
│  │  ECS / EKS / RDS / ElastiCache                │    │
│  └──────────────────────────────────────────────┘    │
│                     │ data flow                       │
│  ┌──────────────────────────────────────────────┐    │
│  │  Layer 3: services (远程状态)                  │    │
│  │  DNS / CDN / Monitoring / Alerts              │    │
│  └──────────────────────────────────────────────┘    │
└──────────────────────────────────────────────────────┘
# Layer 2引用Layer 1的输出
data "terraform_remote_state" "networking" {
  backend = "s3"

  config = {
    bucket = "myorg-terraform-state"
    key    = "networking/terraform.tfstate"
    region = "us-east-1"
  }
}

module "ecs" {
  source = "../../modules/ecs"

  vpc_id     = data.terraform_remote_state.networking.outputs.vpc_id
  subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
}

Drift检测

# 手动Drift检测
terraform plan -detailed-exitcode
# 0 = 无变更
# 1 = 错误
# 2 = 有变更(Drift存在)

# 刷新状态
terraform refresh
# .github/workflows/drift-detection.yml
name: Drift Detection
on:
  schedule:
    - cron: "0 8 * * 1-5"
  workflow_dispatch:

jobs:
  drift-detection:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "1.10.0"

      - name: Terraform Init
        run: terraform init -backend-config=backend.hcl

      - name: Check Drift
        run: |
          terraform plan -detailed-exitcode -out=plan.out || exit_code=$?
          if [ "${exit_code:-0}" -eq 2 ]; then
            echo "::warning::Infrastructure drift detected!"
            terraform show -json plan.out | jq -r '.resource_changes[] | select(.change.actions != ["no-op"]) | "\(.type).\(.name): \(.change.actions | join(", "))"'
            exit 1
          fi

状态操作命令

# 查看当前状态
terraform state list

# 查看特定资源状态
terraform state show aws_vpc.this

# 移动资源(重构时)
terraform state mv aws_vpc.this aws_vpc.main

# 移除资源(不再由Terraform管理)
terraform state rm aws_vpc.this

# 导入已有资源
terraform import aws_vpc.this vpc-abc123

# 强制解锁(状态锁卡住时)
terraform force-unlock <lock-id>

# 拉取远程状态到本地
terraform state pull > state.json

# 推送本地状态到远程
terraform state push state.json

Pattern 3: Workspace环境隔离

多环境管理是IaC最基本的需求。dev/staging/prod三套环境,共享模块但配置不同。Terraform Workspace提供了轻量级的环境隔离方案,但需要配合变量管理才能用好。

Workspace vs 目录隔离

┌─────────────────────────────────────────────────────────────┐
│              环境隔离两种方案对比                                │
├──────────────────────┬──────────────────────────────────────┤
│  Workspace隔离        │  目录隔离                              │
├──────────────────────┼──────────────────────────────────────┤
│  单个State文件/环境   │  每个环境独立State文件                  │
│  同一份代码           │  每个环境独立代码目录                    │
│  workspace切换       │  目录切换                              │
│  适合简单环境         │  适合复杂环境                          │
│  状态文件路径:        │  状态文件路径:                         │
│  env:/dev/state      │  dev/terraform.tfstate               │
│  env:/prod/state     │  prod/terraform.tfstate              │
└──────────────────────┴──────────────────────────────────────┘

推荐方案:目录隔离 + 共享模块

infra/
├── modules/
│   ├── vpc/
│   ├── rds/
│   └── ecs/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── backend.tf
│   │   └── terraform.tfvars
│   ├── staging/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── backend.tf
│   │   └── terraform.tfvars
│   └── prod/
│       ├── main.tf
│       ├── variables.tf
│       ├── outputs.tf
│       ├── backend.tf
│       └── terraform.tfvars
└── shared/
    └── locals.tf

环境配置

# environments/dev/backend.tf
terraform {
  backend "s3" {
    bucket         = "myorg-terraform-state"
    key            = "dev/app-infra/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-locks"
  }
}
# environments/dev/main.tf
module "vpc" {
  source = "../../modules/vpc"

  cidr_block         = "10.0.0.0/16"
  environment        = "dev"
  public_subnets     = ["10.0.1.0/24"]
  private_subnets    = ["10.0.10.0/24"]
  enable_nat_gateway = true
  single_nat_gateway = true
}

module "rds" {
  source = "../../modules/rds"

  environment       = "dev"
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.private_subnet_ids
  engine            = "postgres"
  engine_version    = "16.4"
  instance_class    = "db.t3.micro"
  allocated_storage = 20
  database_name     = "appdb_dev"
  username          = "appadmin"
  password          = var.db_password
}

module "ecs" {
  source = "../../modules/ecs"

  environment     = "dev"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnet_ids
  container_image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/app:dev-latest"
  desired_count   = 1
  cpu             = 256
  memory          = 512
}
# environments/prod/main.tf
module "vpc" {
  source = "../../modules/vpc"

  cidr_block         = "10.100.0.0/16"
  environment        = "prod"
  public_subnets     = ["10.100.1.0/24", "10.100.2.0/24", "10.100.3.0/24"]
  private_subnets    = ["10.100.10.0/24", "10.100.11.0/24", "10.100.12.0/24"]
  enable_nat_gateway = true
  single_nat_gateway = false
}

module "rds" {
  source = "../../modules/rds"

  environment       = "prod"
  vpc_id            = module.vpc.vpc_id
  subnet_ids        = module.vpc.private_subnet_ids
  engine            = "postgres"
  engine_version    = "16.4"
  instance_class    = "db.r6g.xlarge"
  allocated_storage = 500
  database_name     = "appdb_prod"
  username          = "appadmin"
  password          = var.db_password
  multi_az          = true
}

module "ecs" {
  source = "../../modules/ecs"

  environment     = "prod"
  vpc_id          = module.vpc.vpc_id
  subnet_ids      = module.vpc.private_subnet_ids
  container_image = "123456789012.dkr.ecr.us-east-1.amazonaws.com/app:v2.1.0"
  desired_count   = 3
  cpu             = 1024
  memory          = 2048
}

变量管理

# environments/dev/terraform.tfvars
environment     = "dev"
aws_region      = "us-east-1"
db_password     = "dev-password-change-me"
instance_type   = "t3.micro"
desired_count   = 1
# environments/prod/terraform.tfvars
environment     = "prod"
aws_region      = "us-east-1"
db_password     = ""  # 从环境变量或Vault获取
instance_type   = "r6g.xlarge"
desired_count   = 3

使用Workspace的场景

# Workspace适合简单场景(如同一环境的多租户)
terraform workspace new tenant-a
terraform workspace new tenant-b

terraform workspace select tenant-a
terraform apply -var="tenant_id=tenant-a"

terraform workspace select tenant-b
terraform apply -var="tenant_id=tenant-b"
# 使用terraform.workspace做条件判断
resource "aws_instance" "app" {
  ami           = var.ami
  instance_type = terraform.workspace == "prod" ? "r6g.xlarge" : "t3.micro"

  tags = {
    Environment = terraform.workspace
  }
}

变量验证

# variables.tf
variable "environment" {
  description = "Environment name"
  type        = string

  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be one of: dev, staging, prod."
  }
}

variable "db_password" {
  description = "Database password"
  type        = string
  sensitive   = true

  validation {
    condition     = length(var.db_password) >= 16
    error_message = "Database password must be at least 16 characters."
  }
}

variable "allowed_cidrs" {
  description = "List of CIDR blocks allowed to access the application"
  type        = list(string)

  validation {
    condition     = length(var.allowed_cidrs) > 0
    error_message = "At least one CIDR block must be specified."
  }
}

Pattern 4: OpenTofu迁移

2023年8月,HashiCorp将Terraform许可证从Mozilla Public License 2.0更改为Business Source License 1.1。BSL 1.1禁止"竞争性使用"——包括使用Terraform提供竞争性IaC服务。这对大多数企业不构成问题,但开源社区的反应催生了OpenTofu。

迁移评估

评估维度 Terraform (BSL 1.1) OpenTofu (MPL 2.0)
许可证 BSL 1.1(竞争性使用限制) MPL 2.0(完全开源)
CLI兼容性 原生 100%兼容Terraform 1.6
Provider兼容 原生 兼容所有社区Provider
状态文件 原生格式 100%兼容
企业支持 HashiCorp支持 Linux基金会社区
新特性 1.10+原生测试 1.9+加密状态
注册表 Terraform Registry OpenTofu Registry

迁移步骤

# Step 1: 安装OpenTofu
# macOS
brew install opentofu

# Linux
curl -fsSL https://get.opentofu.org/install-opentofu.sh | bash

# 验证版本兼容性
tofu version
# OpenTofu v1.9.0

# Step 2: 替换CLI命令
# terraform init → tofu init
# terraform plan → tofu plan
# terraform apply → tofu apply
# terraform destroy → tofu destroy

# Step 3: 验证兼容性
tofu init
tofu plan
# 如果plan输出与terraform plan一致,迁移成功

# Step 4: 更新CI/CD配置
# 将所有terraform命令替换为tofu

CI/CD迁移示例

# .github/workflows/terraform.yml → tofu.yml
name: OpenTofu CI/CD
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup OpenTofu
        uses: opentofu/setup-opentofu@v1
        with:
          tofu_version: "1.9.0"

      - name: Tofu Init
        run: tofu init -backend-config=backend.hcl
        working-directory: environments/${{ matrix.environment }}

      - name: Tofu Plan
        run: tofu plan -out=plan.out
        working-directory: environments/${{ matrix.environment }}

      - name: Upload Plan
        uses: actions/upload-artifact@v4
        with:
          name: plan-${{ matrix.environment }}
          path: environments/${{ matrix.environment }}/plan.out

  apply:
    needs: plan
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    environment: production
    steps:
      - uses: actions/checkout@v4

      - name: Setup OpenTofu
        uses: opentofu/setup-opentofu@v1
        with:
          tofu_version: "1.9.0"

      - name: Download Plan
        uses: actions/download-artifact@v4
        with:
          name: plan-${{ matrix.environment }}

      - name: Tofu Apply
        run: tofu apply plan.out
        working-directory: environments/${{ matrix.environment }}

OpenTofu独有特性:加密状态

# OpenTofu 1.9+支持原生状态加密
terraform {
  encryption {
    key_provider "pbkdf2" "mykey" {
      passphrase = var.encryption_passphrase
      key_length = 32
      iterations = 600000
      salt       = "fixed-salt-for-key-derivation"
    }

    method "aes_gcm" "myencryption" {
      keys = key_provider.pbkdf2.mykey
    }

    state {
      method = method.aes_gcm.myencryption
      fallback {
        method = method.aes_gcm.myencryption
      }
    }

    plan {
      method = method.aes_gcm.myencryption
      fallback {
        method = method.aes_gcm.myencryption
      }
    }
  }
}

渐进式迁移策略

┌──────────────────────────────────────────────────────┐
│                  渐进式迁移路线图                       │
│                                                       │
│  Phase 1: 评估(1-2周)                               │
│  ├── 盘点所有Terraform项目                             │
│  ├── 检查Provider和Module兼容性                        │
│  └── 制定迁移优先级                                   │
│                                                       │
│  Phase 2: 非生产环境迁移(2-4周)                      │
│  ├── dev/staging环境切换到OpenTofu                     │
│  ├── 验证plan输出一致性                               │
│  └── 更新CI/CD Pipeline                              │
│                                                       │
│  Phase 3: 生产环境迁移(1-2周)                        │
│  ├── 生产环境切换到OpenTofu                            │
│  ├── 启用状态加密                                     │
│  └── 监控运行1周确认稳定                               │
│                                                       │
│  Phase 4: 清理(1周)                                  │
│  ├── 移除Terraform CLI依赖                            │
│  ├── 更新文档和Runbook                                │
│  └── 团队培训完成                                     │
└──────────────────────────────────────────────────────┘

Pattern 5: GitOps集成

Terraform IaC的最终目标是GitOps:代码提交即触发plan,审批后自动apply。Atlantis是目前最成熟的Terraform GitOps工具,在Pull Request中直接执行terraform planterraform apply

Atlantis架构

┌──────────────────────────────────────────────────────┐
│                  Atlantis GitOps架构                   │
│                                                       │
│  ┌──────────┐     webhook     ┌──────────────────┐  │
│  │  GitHub   │───────────────▶│    Atlantis       │  │
│  │  /GitLab  │                │    Server         │  │
│  │          │◀───────────────│                    │  │
│  │          │  PR Comment     │  ┌──────────────┐│  │
│  │          │  (plan/apply)   │  │ terraform    ││  │
│  └──────────┘                │  │ plan/apply   ││  │
│                               │  └──────────────┘│  │
│                               └────────┬─────────┘  │
│                                        │             │
│                               ┌────────▼─────────┐  │
│                               │  AWS / GCP /     │  │
│                               │  Azure API       │  │
│                               └──────────────────┘  │
└──────────────────────────────────────────────────────┘

部署Atlantis

# atlantis-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atlantis
  namespace: atlantis
spec:
  replicas: 1
  selector:
    matchLabels:
      app: atlantis
  template:
    metadata:
      labels:
        app: atlantis
    spec:
      containers:
        - name: atlantis
          image: ghcr.io/runatlantis/atlantis:v0.30.0
          ports:
            - containerPort: 4141
          env:
            - name: ATLANTIS_GH_USER
              value: "myorg-bot"
            - name: ATLANTIS_GH_TOKEN
              valueFrom:
                secretKeyRef:
                  name: atlantis-secrets
                  key: github-token
            - name: ATLANTIS_GH_WEBHOOK_SECRET
              valueFrom:
                secretKeyRef:
                  name: atlantis-secrets
                  key: webhook-secret
            - name: ATLANTIS_ALLOW_REPO_CONFIG
              value: "true"
            - name: ATLANTIS_PARALLEL_PLAN_COUNT
              value: "4"
            - name: ATLANTIS_PARALLEL_APPLY_COUNT
              value: "2"
            - name: ATLANTIS_AUToplan_ENABLED
              value: "true"
            - name: ATLANTIS_REPO_CONFIG_JSON
              value: |
                {
                  "repos": [
                    {
                      "id": "/.*/",
                      "apply_requirements": ["approved", "mergeable"],
                      "plan_requirements": [],
                      "import_requirements": [],
                      "allowed_overrides": ["apply_requirements", "workflow"],
                      "allow_custom_workflows": true
                    }
                  ]
                }
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
          volumeMounts:
            - name: atlantis-data
              mountPath: /home/atlantis
            - name: repo-config
              mountPath: /etc/atlantis
      volumes:
        - name: atlantis-data
          persistentVolumeClaim:
            claimName: atlantis-data
        - name: repo-config
          configMap:
            name: atlantis-config

Atlantis仓库配置

# atlantis.yaml(项目根目录)
version: 3
projects:
  - name: dev-infra
    dir: environments/dev
    workflow: terraform
    autoplan:
      when_modified: ["../../modules/**/*.tf", "*.tf", "*.tfvars"]
      enabled: true
    apply_requirements:
      - approved
      - mergeable

  - name: staging-infra
    dir: environments/staging
    workflow: terraform
    autoplan:
      when_modified: ["../../modules/**/*.tf", "*.tf", "*.tfvars"]
      enabled: true
    apply_requirements:
      - approved
      - mergeable

  - name: prod-infra
    dir: environments/prod
    workflow: terraform
    autoplan:
      when_modified: ["../../modules/**/*.tf", "*.tf", "*.tfvars"]
      enabled: true
    apply_requirements:
      - approved
      - mergeable
      - undiverged

workflows:
  terraform:
    plan:
      steps:
        - env:
            name: TF_VAR_db_password
            value: ${DB_PASSWORD}
        - run: terraform init -backend-config=backend.hcl -reconfigure
        - run: terraform plan -out=$PLANFILE -var-file=terraform.tfvars
        - run: terraform show -json $PLANFILE > $SHOWFILE
    apply:
      steps:
        - run: terraform apply $PLANFILE

CI/CD Pipeline(无Atlantis方案)

# .github/workflows/terraform-cicd.yml
name: Terraform CI/CD

on:
  push:
    branches: [main]
    paths:
      - "environments/**"
      - "modules/**"
  pull_request:
    branches: [main]
    paths:
      - "environments/**"
      - "modules/**"

env:
  TF_VERSION: "1.10.0"
  AWS_REGION: "us-east-1"

jobs:
  detect-changes:
    runs-on: ubuntu-latest
    outputs:
      environments: ${{ steps.changes.outputs.environments }}
    steps:
      - uses: actions/checkout@v4

      - name: Detect changed environments
        id: changes
        run: |
          if [ "${{ github.event_name }}" = "pull_request" ]; then
            CHANGED=$(git diff --name-only origin/main...HEAD | grep -oP 'environments/\K[^/]+' | sort -u | jq -R . | jq -s .)
          else
            CHANGED=$(git diff --name-only HEAD~1 HEAD | grep -oP 'environments/\K[^/]+' | sort -u | jq -R . | jq -s .)
          fi
          echo "environments=$CHANGED" >> $GITHUB_OUTPUT

  plan:
    needs: detect-changes
    if: needs.detect-changes.outputs.environments != '[]'
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: ${{ fromJson(needs.detect-changes.outputs.environments) }}
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-ci
          aws-region: ${{ env.AWS_REGION }}

      - name: Terraform Init
        run: terraform init -backend-config=backend.hcl
        working-directory: environments/${{ matrix.environment }}

      - name: Terraform Plan
        run: |
          terraform plan -out=plan.out -var-file=terraform.tfvars
          terraform show -json plan.out > plan.json
        working-directory: environments/${{ matrix.environment }}

      - name: Upload Plan Artifact
        uses: actions/upload-artifact@v4
        with:
          name: plan-${{ matrix.environment }}
          path: |
            environments/${{ matrix.environment }}/plan.out
            environments/${{ matrix.environment }}/plan.json

      - name: Comment PR with Plan
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('environments/${{ matrix.environment }}/plan.json', 'utf8');
            const planObj = JSON.parse(plan);
            const changes = planObj.resource_changes.filter(c => c.change.actions.some(a => a !== 'no-op'));
            let body = `## Terraform Plan: ${{ matrix.environment }}\n\n`;
            body += `| Action | Resource Type | Resource Name |\n|--------|--------------|---------------|\n`;
            for (const c of changes) {
              body += `| ${c.change.actions.join(', ')} | ${c.type} | ${c.name} |\n`;
            }
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: body
            });

  apply:
    needs: [detect-changes, plan]
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: ${{ fromJson(needs.detect-changes.outputs.environments) }}
    environment: ${{ matrix.environment }}
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: ${{ env.TF_VERSION }}

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-ci
          aws-region: ${{ env.AWS_REGION }}

      - name: Download Plan
        uses: actions/download-artifact@v4
        with:
          name: plan-${{ matrix.environment }}

      - name: Terraform Apply
        run: terraform apply plan.out
        working-directory: environments/${{ matrix.environment }}

Policy as Code:Sentinel / OPA

# sentinel/require-tags.sentinel
# 要求所有资源必须有Environment和ManagedBy标签
import "tfplan/v2" as tfplan

allResources = filter tfplan.resource_changes as _, rc {
    rc.mode == "managed" and
    rc.type != "null_resource" and
    rc.change.actions != ["delete"]
}

tagsRequired = rule {
    all allResources as _, rc {
        rc.change.after.tags contains "Environment" and
        rc.change.after.tags contains "ManagedBy"
    }
}

main = rule {
    tagsRequired
}
# opa/require-tags.rego
package terraform

import future.keywords.if
import future.keywords.in

deny[msg] if {
    some rc in input.resource_changes
    rc.mode == "managed"
    rc.change.actions[_] != "delete"
    not "Environment" in object.keys(rc.change.after.tags)
    msg := sprintf("Resource %s of type %s missing Environment tag", [rc.name, rc.type])
}

deny[msg] if {
    some rc in input.resource_changes
    rc.mode == "managed"
    rc.change.actions[_] != "delete"
    not "ManagedBy" in object.keys(rc.change.after.tags)
    msg := sprintf("Resource %s of type %s missing ManagedBy tag", [rc.name, rc.type])
}
# 使用OPA检查Terraform Plan
terraform plan -out=plan.out
terraform show -json plan.out > plan.json

# 运行OPA策略检查
opa eval --data opa/ --input plan.json "data.terraform.deny"

5个常见坑及解决方案

坑1: 状态文件损坏

现象terraform plan报错state snapshot was created by a newer versioninvalid character

原因:状态文件被手动编辑、磁盘故障、S3版本回退导致。

解决方案

# 1. 从S3版本历史恢复
aws s3api list-object-versions \
  --bucket myorg-terraform-state \
  --prefix dev/app-infra/terraform.tfstate

# 恢复到上一个版本
aws s3api copy-object \
  --bucket myorg-terraform-state \
  --copy-source myorg-terraform-state/dev/app-infra/terraform.tfstate?versionId=PREVIOUS_VERSION \
  --key dev/app-infra/terraform.tfstate

# 2. 强制拉取并修复
terraform state pull > state.json
# 手动修复JSON(谨慎操作)
terraform state push state.json

坑2: Provider版本锁定导致CI失败

现象:本地terraform plan正常,CI/CD中报Provider下载失败或版本不兼容。

原因:本地有缓存,CI环境每次全新安装。Provider版本未锁定。

解决方案

# versions.tf - 锁定Provider版本
terraform {
  required_version = ">= 1.5.0, < 2.0.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "5.80.0"
    }
  }
}
# 提交锁文件到Git
git add .terraform.lock.hcl
git commit -m "chore: lock provider versions"

坑3: 循环依赖

现象terraform plan报错Cycle: module.x, module.y

原因:模块A的输出依赖模块B的输出,模块B又依赖模块A的输出。

解决方案

# 错误:循环依赖
module "vpc" {
  source = "./vpc"
  # 依赖ecs的安全组ID
  ecs_security_group_id = module.ecs.security_group_id
}

module "ecs" {
  source = "./ecs"
  # 依赖vpc的子网ID
  subnet_ids = module.vpc.private_subnet_ids
}

# 正确:拆分为3层,单向依赖
# Layer 1: 网络
module "vpc" {
  source = "./vpc"
}

# Layer 2: 安全组
module "security_groups" {
  source   = "./security-groups"
  vpc_id   = module.vpc.vpc_id
}

# Layer 3: 计算
module "ecs" {
  source             = "./ecs"
  subnet_ids         = module.vpc.private_subnet_ids
  security_group_ids = module.security_groups.app_ids
}

坑4: 敏感变量泄露到状态文件

现象terraform show能看到明文密码,状态文件中包含敏感信息。

原因sensitive = true只隐藏CLI输出,不加密状态文件中的值。

解决方案

# 方案1: 使用AWS SSM Parameter Store
data "aws_ssm_parameter" "db_password" {
  name = "/app/${var.environment}/db-password"
  with_decryption = true
}

module "rds" {
  source   = "../../modules/rds"
  password = data.aws_ssm_parameter.db_password.value
}

# 方案2: 使用AWS Secrets Manager
data "aws_secretsmanager_secret_version" "db_creds" {
  secret_id = "app/${var.environment}/db-credentials"
}

module "rds" {
  source   = "../../modules/rds"
  password = jsondecode(data.aws_secretsmanager_secret_version.db_creds.secret_string)["password"]
}

# 方案3: 环境变量注入(CI/CD场景)
# export TF_VAR_db_password=$(vault read -field=password secret/data/app/prod/db)

坑5: 模块重构导致资源重建

现象:修改模块名称或移动资源后,terraform plan显示要删除重建资源。

原因:Terraform通过资源地址(module.vpc.aws_vpc.this)标识资源。地址变了,Terraform认为是新资源。

解决方案

# 重构前:先移动状态
terraform state mv module.vpc module.networking.vpc
terraform state mv module.vpc.aws_vpc.this module.networking.aws_vpc.this

# 然后修改代码
# 移动模块文件
mv modules/vpc modules/networking/vpc

# 更新模块引用
# module "vpc" → module "networking_vpc"

# 验证
terraform plan  # 应该显示no changes

10个常见报错排查

1. Error: Failed to load plugin

# 清除插件缓存重新下载
rm -rf .terraform/providers
terraform init -upgrade

# 检查网络代理
export HTTPS_PROXY=http://proxy.internal:8080
terraform init

2. Error: Error locking state: Error acquiring the state lock

# 查看DynamoDB中的锁
aws dynamodb scan --table-name terraform-locks

# 确认没有其他进程在运行
# 如果确认锁是残留的,强制解锁
terraform force-unlock <lock-id>

3. Error: Provider produced inconsistent result after apply

# 这是Provider的Bug,通常可以通过以下方式解决
# 1. 更新Provider版本
terraform init -upgrade

# 2. 如果是已知的Provider Bug,使用lifecycle忽略变化
resource "aws_instance" "app" {
  # ...
  lifecycle {
    ignore_changes = [user_data_replace_on_change]
  }
}

4. Error: Resource already managed by Terraform

# 资源已存在于状态中,但代码中已删除
# 查看状态中的资源
terraform state list

# 从状态中移除
terraform state rm aws_instance.old_resource

5. Error: Module not found

# 清除模块缓存
rm -rf .terraform/modules
terraform init -upgrade

# 检查模块源路径
# 本地模块路径是相对于当前tf文件的路径
module "vpc" {
  source = "../../modules/vpc"  # 相对于当前文件
}

6. Error: Invalid for_each argument

# 错误:for_each的值在plan时不可知
resource "aws_subnet" "private" {
  for_each = toset(module.vpc.private_subnet_cidrs)  # 如果是计算值则报错
}

# 正确:使用已知值
variable "private_subnets" {
  type = list(string)
}

resource "aws_subnet" "private" {
  for_each = toset(var.private_subnets)
}

7. Error: Value for unconfigurable attribute

# 错误:试图设置只读属性
resource "aws_eip" "nat" {
  instance = aws_instance.nat.id
  domain   = "vpc"  # domain在某些版本中不可配置
}

# 正确:检查Provider文档,只设置可写属性
resource "aws_eip" "nat" {
  domain = "vpc"
}

8. Error: Backend configuration changed

# 后端配置变更后需要重新初始化
terraform init -migrate-state

# 如果迁移失败,手动迁移
terraform state pull > state.json
# 修改backend.tf
terraform init
terraform state push state.json

9. Error: Invalid terraform configuration: No required_providers

# 每个模块都需要声明required_providers
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = ">= 5.0"
    }
  }
}

10. Error: Incompatible API version

# Provider版本与Terraform版本不兼容
# 检查兼容性
terraform version
terraform providers

# 更新到兼容版本
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "5.80.0"  # 指定兼容版本
    }
  }
}

进阶优化技巧

1. Terraform Cloud / Enterprise

# terraform-cloud.tf
# 使用Terraform Cloud作为远程执行环境
terraform {
  cloud {
    organization = "myorg"
    workspaces {
      tags = ["app-infra"]
    }
  }
}

2. 自定义Provider封装内部API

// internal/provider/resource_internal_service.go
// 使用Terraform Plugin Framework开发自定义Provider
package provider

import (
    "context"
    "github.com/hashicorp/terraform-plugin-framework/resource"
)

type internalServiceResource struct{}

func (r *internalServiceResource) Create(ctx context.Context, req resource.CreateRequest, resp *resource.CreateResponse) {
    // 调用内部API创建服务
}

func (r *internalServiceResource) Read(ctx context.Context, req resource.ReadRequest, resp *resource.ReadResponse) {
    // 调用内部API读取服务状态
}

3. Terraform测试框架

# tests/integration/main.tftest.hcl
run "create_infrastructure" {
  command = apply

  module {
    source = "../../environments/dev"
  }

  variables {
    db_password = "test-password-12345678"
  }
}

run "validate_endpoints" {
  command = apply

  variables {
    api_endpoint = run.create_infrastructure.api_url
  }

  assert {
    condition     = can(http_request.check.status_code == 200)
    error_message = "API endpoint should return 200"
  }
}

4. Cost Estimation

# 使用Infracost估算成本
infracost breakdown --path=plan.json \
  --format=json \
  --out-file=infracost.json

# 在PR中添加成本评论
infracost comment github --path=infracost.json \
  --behavior=update
# .github/workflows/infracost.yml
name: Infracost
on:
  pull_request:
    paths:
      - "environments/**"
      - "modules/**"

jobs:
  infracost:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}

      - name: Terraform Plan
        run: |
          terraform init
          terraform plan -out=plan.out -var-file=terraform.tfvars
          terraform show -json plan.out > plan.json
        working-directory: environments/dev

      - name: Infracost Breakdown
        run: infracost breakdown --path=plan.json --format=json --out-file=/tmp/infracost.json
        working-directory: environments/dev

      - name: Infracost Comment
        run: infracost comment github --path=/tmp/infracost.json --behavior=update --pull-request=${{ github.event.pull_request.number }}

5. 模块文档自动生成

# 安装terraform-docs
brew install terraform-docs

# 生成README
terraform-docs markdown table ./modules/vpc > ./modules/vpc/README.md

# 使用.hooks/pre-commit自动生成
# .pre-commit-config.yaml
repos:
  - repo: https://github.com/antonbabenko/pre-commit-terraform
    rev: v1.92.0
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_docs
        args:
          - '--args=--lockfile=false'
      - id: terraform_tflint
      - id: terraform_trivy
      - id: terraform_checkov

对比分析:Terraform vs OpenTofu vs Pulumi vs CDK

维度 Terraform OpenTofu Pulumi CDK
语言 HCL HCL TS/Python/Go/C# TS/Python/Java/C#
许可证 BSL 1.1 MPL 2.0 Apache 2.0 Apache 2.0
Provider 3000+ 3000+ 200+ AWS为主
状态加密 S3 KMS 原生加密 Pulumi Cloud CDK Cloud
测试 terraform test terraform test Mocha/Jest Jest
GitOps Atlantis Atlantis Pulumi Cloud CDK Pipelines
学习曲线
多云 原生 原生 原生 AWS为主
社区 最大 增长中 增长中 AWS生态
企业支持 HashiCorp Linux基金会 Pulumi Corp AWS

选型决策树

团队是否熟悉HCL?
├── 是 → 是否关注许可证合规?
│        ├── 是 → OpenTofu
│        └── 否 → Terraform
└── 否 → 是否主要使用AWS?
         ├── 是 → CDK
         └── 否 → 是否偏好通用编程语言?
                  ├── 是 → Pulumi
                  └── 否 → Terraform/OpenTofu(HCL简单易学)

在线工具推荐


总结:Terraform IaC最佳实践的核心是5个生产模式——模块组合设计让代码可复用可测试,远程状态管理确保数据安全可靠,Workspace环境隔离实现多环境管理,OpenTofu迁移解决许可证合规,GitOps集成实现自动化plan/apply。2026年,无论选择Terraform还是OpenTofu,HCL依然是IaC领域最成熟的选择。关键实践:三层模块架构、S3+DynamoDB远程后端、目录隔离环境、Atlantis GitOps、Policy as Code。IaC不是一次性工程,而是持续演进的平台。


相关文章

外部参考

本站提供浏览器本地工具,免注册即可试用 →

#Terraform#IaC#基础设施即代码#OpenTofu#GitOps#2026#DevOps