The Config That Lived in the Code


deployment reliability

System Design Scenario

The Config That Lived in the Code

Every configuration change becomes a deployment gamble when feature flags and timeouts are baked into your source code

⏱ 12 min read📐 Intermediate🔒 Deployment

It’s Friday at 4:47 PM when the alert hits: payment processing is failing for 30% of transactions. Marcus, the platform engineer, knows exactly what happened - the third-party payment API changed their timeout requirements from 10 seconds to 15 seconds yesterday. No big deal, right? Just change one line in the code.

Except that timeout is hardcoded as a constant in the payment service. Changing it requires a full deployment through three environments - dev, staging, and production. The deploy pipeline takes 45 minutes if everything goes perfectly. It’s Friday evening, the QA team has left, and every minute costs the company $2,400 in failed transactions.

The codebase is littered with hardcoded values: FEATURE_NEW_CHECKOUT = false, PAYMENT_TIMEOUT = 10000, THIRD_PARTY_API_URL = "https://api-v2.payments.com", MAX_RETRY_ATTEMPTS = 3. It’s like building a house where adjusting the thermostat requires ripping out walls and rewiring the electrical system. Each configuration change becomes an architectural decision, complete with code reviews, testing cycles, and deployment risks.

By Monday morning, they’ll have lost $86,400 to hardcoded configuration. This is the deployment coupling problem - where operational changes require engineering changes, turning simple tweaks into complex releases.

Why This Happens

The path to hardcoded configuration is gradual and logical. Early in development, hardcoding values is the fastest path forward - no infrastructure to build, no external dependencies to manage, no additional complexity to debug. A startup with 100 users doesn’t need sophisticated configuration management.

The problem compounds as systems grow:

initial simplicity
  -> small config changes require deploys
    -> deploys become risky and slow
      -> teams avoid necessary adjustments
        -> systems become fragile and inflexible
          -> outages from inability to adapt quickly

The core issue is coupling operational concerns with development concerns - mixing what the system does with how it should behave in specific environments.

Key Insight

Hardcoded configuration creates deployment coupling - operational changes require engineering changes, turning runtime adjustments into development cycles.

The Naive Solution (and where it breaks)

Most teams first reach for environment variables - move constants out of code and into environment files:

# Move from this
PAYMENT_TIMEOUT = 10000
MAX_RETRIES = 3

# To this
PAYMENT_TIMEOUT = int(os.environ.get('PAYMENT_TIMEOUT', 10000))
MAX_RETRIES = int(os.environ.get('MAX_RETRIES', 3))

Environment variables feel like a house with a separate control panel - you can adjust settings without rewiring, but you still need to turn the power off and on again to make changes take effect. The approach works for basic cases but hits limits quickly.

Environment variables still require restarts and deployments for changes

The problems emerge at scale:

Small scale: env vars work for basic config
Large scale: still requires restarts for changes,
  no change history, no rollback capability,
  difficult coordination across multiple services

Environment variables are better than hardcoded constants, but they’re still deployment-coupled. Changing a timeout still requires updating container configs, restarting services, and coordinating changes across environments.

Watch Out

Environment variables create restart coupling - you’ve moved config out of code but changes still require service restarts, making rapid adjustments impossible during incidents.

External Configuration Store

Here’s what actually fixes this: move configuration to an external store that services can poll or receive updates from without restarts.

The pattern separates configuration data from application code entirely. Services read config from external sources and can pick up changes dynamically, turning configuration updates into data changes instead of code changes.

# Configuration service client with caching and refresh
class ConfigService:
    def __init__(self, config_url, refresh_interval=30):
        self.config_url = config_url
        self.refresh_interval = refresh_interval
        self.cache = {}
        self.last_refresh = 0
        
    def get(self, key, default=None):
        if time.time() - self.last_refresh > self.refresh_interval:
            self.refresh_config()
        return self.cache.get(key, default)
    
    def refresh_config(self):
        try:
            response = requests.get(f"{self.config_url}/config")
            self.cache.update(response.json())
            self.last_refresh = time.time()
        except requests.RequestException:
            # Keep existing config on failure
            pass

# Usage in application code
config = ConfigService("https://config.internal.com")

def process_payment():
    timeout = config.get('payment_timeout', 10000)
    max_retries = config.get('max_retries', 3)
    # ... rest of payment logic
External configuration store allows runtime updates without deployments

Services poll the configuration store every 30 seconds, picking up changes automatically. Updating a timeout becomes a data operation - change the value in the config store, wait 30 seconds, and all services have the new setting.

Real World

Netflix’s Archaius configuration system handles millions of config changes per day across thousands of services, enabling rapid response to production issues without deployments - they can disable features or adjust timeouts within seconds.

Add Feature Flag Service

The second layer: dedicated feature flag management that goes beyond simple boolean toggles to provide gradual rollouts, user targeting, and instant rollbacks.

# Feature flag service with percentage rollouts and targeting
class FeatureFlagService:
    def __init__(self, flag_service_url):
        self.flag_service_url = flag_service_url
        self.cache = {}
        
    def is_enabled(self, flag_name, user_id=None, context=None):
        flag_config = self.get_flag_config(flag_name)
        
        if not flag_config:
            return False
            
        # Check percentage rollout
        if 'percentage' in flag_config:
            user_hash = hash(f"{flag_name}:{user_id}") % 100
            if user_hash >= flag_config['percentage']:
                return False
        
        # Check user targeting rules
        if 'target_users' in flag_config and user_id:
            if user_id in flag_config['target_users']:
                return True
                
        # Check context-based rules
        if 'rules' in flag_config and context:
            return self.evaluate_rules(flag_config['rules'], context)
            
        return flag_config.get('default', False)

Feature flags become sophisticated control mechanisms rather than simple on/off switches. You can enable features for 5% of users, target specific user segments, or enable features only in certain geographic regions.

Real World

Facebook’s Gatekeeper system manages over 100,000 feature flags across their platform, enabling them to test new features on small user populations and kill bad features within minutes if metrics show problems.

Runtime Configuration Updates

The third layer: push-based configuration updates that eliminate polling delays and provide instant config changes during incidents.

# Kubernetes ConfigMap with automatic reload
apiVersion: v1
kind: ConfigMap
metadata:
  name: payment-service-config
data:
  config.yaml: |
    payment:
      timeout: 10000
      max_retries: 3
      api_url: "https://api-v2.payments.com"
    features:
      new_checkout: false
      enhanced_security: true
---
# Deployment with config mounting and reload capability
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: payment-service
        volumeMounts:
        - name: config
          mountPath: /etc/config
        env:
        - name: CONFIG_PATH
          value: /etc/config/config.yaml
        - name: CONFIG_RELOAD_SIGNAL
          value: "SIGUSR1"
      volumes:
      - name: config
        configMap:
          name: payment-service-config

The service watches for configuration file changes and reloads config when the file is updated. Kubernetes automatically updates the mounted ConfigMap when you change it, triggering an immediate config reload without restarting the container.

Runtime configuration updates with instant propagation and rollback capability

Configuration changes propagate to all service instances within seconds, not minutes. Combined with health checks and circuit breakers, bad config changes can be detected and rolled back automatically.

Key Insight

Push-based configuration updates eliminate the polling delay and enable instant rollbacks - critical for incident response when every second matters.

The Full Architecture

Complete configuration management system with external stores, feature flags, and runtime updates

The architecture handles all configuration concerns: static configuration through external stores, feature flags through dedicated services, and runtime updates through push mechanisms.

Configuration flows through multiple channels based on urgency and type:

  1. Application config (timeouts, URLs, thresholds) comes from external config store with 30-second polling
  2. Feature flags come from dedicated flag service with real-time updates via WebSocket
  3. Emergency config changes push through Kubernetes ConfigMaps with instant reload signals
  4. All changes are versioned, audited, and rollback-capable through the config management layer

The separation enables different update patterns - feature flags can change instantly for rapid experimentation, while infrastructure config can have approval workflows and staged rollouts.

Key Insight

Configuration needs different update patterns - feature flags need instant changes for experimentation, infrastructure settings need approval workflows, and emergency changes need bypass mechanisms.

Component Deep Dives

Configuration Store Service

The config store’s job is providing versioned, auditable configuration with efficient caching and change propagation.

# Configuration store with versioning and change tracking
class ConfigStore:
    def __init__(self, database_url):
        self.db = DatabaseConnection(database_url)
        
    def get_config(self, service_name, version='latest'):
        query = """
        SELECT config_data, version, last_modified
        FROM service_configs 
        WHERE service_name = %s 
          AND (version = %s OR %s = 'latest')
        ORDER BY version DESC LIMIT 1
        """
        
        result = self.db.execute(query, [service_name, version, version])
        if result:
            return {
                'data': json.loads(result['config_data']),
                'version': result['version'],
                'last_modified': result['last_modified']
            }
        return None
    
    def update_config(self, service_name, config_data, user_id):
        # Increment version number
        current_version = self.get_latest_version(service_name)
        new_version = current_version + 1
        
        # Store new config with audit trail
        self.db.execute("""
        INSERT INTO service_configs 
        (service_name, version, config_data, updated_by, created_at)
        VALUES (%s, %s, %s, %s, NOW())
        """, [service_name, new_version, json.dumps(config_data), user_id])
        
        # Trigger change notifications
        self.notify_config_change(service_name, new_version)

Every configuration change is versioned and auditable. The system tracks who made changes, when they occurred, and provides rollback to any previous version.

Feature Flag Management

Feature flag services require sophisticated evaluation engines that can handle complex targeting rules and gradual rollouts.

# Feature flag evaluation with complex targeting
class FeatureFlagEngine:
    def evaluate_flag(self, flag_name, context):
        flag = self.get_flag_definition(flag_name)
        
        # Evaluate targeting rules in priority order
        for rule in flag.get('rules', []):
            if self.evaluate_rule(rule, context):
                return rule['variation']
        
        # Fall back to percentage rollout
        if 'rollout' in flag:
            return self.evaluate_percentage_rollout(
                flag['rollout'], 
                context.get('user_id')
            )
        
        return flag.get('default_variation', False)
    
    def evaluate_percentage_rollout(self, rollout, user_id):
        # Consistent hash-based distribution
        user_hash = hashlib.sha1(f"{user_id}".encode()).hexdigest()
        user_percentage = int(user_hash[:8], 16) % 100
        
        for bucket in rollout['buckets']:
            if user_percentage < bucket['percentage']:
                return bucket['variation']
        
        return False

The evaluation engine supports complex rules like “enable for users in California with premium accounts” while maintaining consistent assignment - the same user always gets the same variation.

Configuration Change Pipeline

Robust configuration management requires approval workflows, staged rollouts, and automatic validation.

# Configuration change pipeline with approvals
name: config-change-pipeline
on:
  pull_request:
    paths: ['configs/**']

jobs:
  validate-config:
    runs-on: ubuntu-latest
    steps:
    - name: Validate Configuration Schema
      run: |
        # Validate all config files against JSON schemas
        for config in configs/*.yaml; do
          yq eval '.' $config | jsonschema -i - schemas/config-schema.json
        done
    
    - name: Test Configuration Changes
      run: |
        # Run integration tests with new config
        docker-compose -f test-compose.yml up --abort-on-container-exit
  
  deploy-staging:
    needs: validate-config
    if: github.event.pull_request.approved
    runs-on: ubuntu-latest
    steps:
    - name: Deploy to Staging
      run: kubectl apply -f configs/staging/
    
    - name: Validate Staging Health
      run: |
        # Wait for services to stabilize with new config
        sleep 30
        curl -f https://staging-api.company.com/health
  
  deploy-production:
    needs: deploy-staging
    if: github.event.pull_request.merged
    runs-on: ubuntu-latest
    steps:
    - name: Deploy to Production
      run: kubectl apply -f configs/production/

Configuration changes go through the same rigor as code changes - validation, testing, staged deployment, and health checks - but without requiring application deployments.

Configuration Monitoring

Configuration systems need monitoring to detect bad config changes and enable rapid rollbacks.

# Configuration health monitoring
class ConfigHealthMonitor:
    def __init__(self, metrics_client):
        self.metrics = metrics_client
        
    def monitor_config_change(self, service_name, config_version):
        start_time = time.time()
        
        # Monitor key service metrics after config change
        baseline_metrics = self.get_baseline_metrics(service_name)
        
        while time.time() - start_time < 300:  # Monitor for 5 minutes
            current_metrics = self.get_current_metrics(service_name)
            
            # Check for significant degradation
            if self.detect_degradation(baseline_metrics, current_metrics):
                self.trigger_rollback(service_name, config_version - 1)
                break
            
            time.sleep(30)
    
    def detect_degradation(self, baseline, current):
        # Alert on significant increases in errors or latency
        error_rate_increase = (current.error_rate - baseline.error_rate) / baseline.error_rate
        latency_increase = (current.p95_latency - baseline.p95_latency) / baseline.p95_latency
        
        return error_rate_increase > 0.5 or latency_increase > 0.3

The monitoring system watches service health metrics after configuration changes and automatically rolls back if key metrics degrade significantly.

Comparison Table

ApproachChange SpeedRollback CapabilityAudit TrailOperational RiskBest Use Case
HardcodedHours (deploy cycle)None (requires new deploy)Git history onlyHigh (code changes for config)Development/prototypes
Environment VariablesMinutes (restart)Manual (restore env)LimitedMedium (restart disruption)Simple service configuration
External Config StoreSeconds (poll interval)Instant (version rollback)CompleteLow (data-only changes)Runtime application settings
Feature Flag ServiceInstantInstant (toggle off)Complete with targetingVery LowFeature rollouts, A/B testing
Hybrid SystemInstant to minutesInstantCompleteVery LowProduction systems at scale

The hybrid approach provides the flexibility to handle all configuration scenarios - instant feature toggles for experiments, validated config changes for infrastructure settings, and emergency overrides when needed.

Key Takeaways

Deployment coupling occurs when operational changes require development changes, making simple adjustments into complex releases • External configuration stores decouple config from deployments, enabling runtime changes without service restarts • Feature flag services provide sophisticated targeting and rollout capabilities beyond simple boolean toggles • Runtime configuration updates eliminate polling delays and enable instant rollbacks during incidents • Configuration versioning provides audit trails and rollback capabilities essential for production systems • Approval workflows balance change velocity with operational safety for different types of configuration • Health monitoring detects bad configuration changes and enables automatic rollbacks before user impact • Push-based updates are critical for incident response when every second counts

The pattern works for any system where operational flexibility matters more than architectural simplicity. Design configuration systems like data systems - with versioning, monitoring, rollback capabilities, and appropriate access controls.

Frequently Asked Questions

Q: How do you handle configuration secrets with external config stores? A: Use dedicated secret management systems like HashiCorp Vault or AWS Secrets Manager. Never store secrets in plain text config stores. Reference secrets by ID and inject them at runtime through secure channels.

Q: What happens if the configuration service goes down? A: Services should cache configuration locally with reasonable TTLs and continue operating with last-known-good config. Implement circuit breakers around config fetching and graceful degradation when config services are unavailable.

Q: How do you test configuration changes before production deployment? A: Use staged configuration environments - test config changes in staging first, validate service health, then promote to production. Feature flags can also test config changes on small user populations.

Q: What about configuration drift between environments? A: Use configuration-as-code with version control and automated deployment pipelines. Treat configuration like infrastructure - define environments declaratively and deploy consistently through automation.

Q: How do you handle configuration changes that require coordination between multiple services? A: Implement configuration change orchestration with dependency awareness. Some changes need staged rollouts across services, others need atomic updates. Design configuration schemas with backward compatibility to minimize coordination needs.

Q: What’s the performance impact of polling configuration stores? A: Modern config systems use HTTP caching, ETags, and push mechanisms to minimize overhead. A 30-second poll with proper caching typically adds less than 1ms overhead to request processing.

Interview Questions

Q: Design a configuration system that can handle both gradual feature rollouts and emergency feature kills. Expected depth: Discuss push vs pull mechanisms, caching strategies, circuit breakers, health monitoring, and the tradeoffs between consistency and availability for config changes.

Q: How would you migrate a legacy system with 200+ hardcoded configuration values? Expected depth: Cover migration strategies like strangler fig patterns, dual-read/dual-write approaches, risk assessment, rollback planning, and measuring success metrics.

Q: What are the security implications of externalized configuration? Expected depth: Discuss secret management, access controls, audit logging, configuration tampering prevention, and the attack surface differences between hardcoded and external config.

Q: How do you ensure configuration consistency across a distributed system with 50+ microservices? Expected depth: Cover configuration schemas, validation pipelines, dependency management, rollout coordination, and monitoring configuration drift across services.

Q: Design the monitoring and alerting strategy for a configuration management system. Expected depth: Discuss metrics like config change frequency, rollback rates, service health correlation with config changes, and SLAs for configuration propagation times.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article