The Config That Lived in the Code
deployment reliability
System Design Scenario
The Config That Lived in the Code
Every configuration change becomes a deployment gamble when feature flags and timeouts are baked into your source code
It’s Friday at 4:47 PM when the alert hits: payment processing is failing for 30% of transactions. Marcus, the platform engineer, knows exactly what happened - the third-party payment API changed their timeout requirements from 10 seconds to 15 seconds yesterday. No big deal, right? Just change one line in the code.
Except that timeout is hardcoded as a constant in the payment service. Changing it requires a full deployment through three environments - dev, staging, and production. The deploy pipeline takes 45 minutes if everything goes perfectly. It’s Friday evening, the QA team has left, and every minute costs the company $2,400 in failed transactions.
The codebase is littered with hardcoded values: FEATURE_NEW_CHECKOUT = false, PAYMENT_TIMEOUT = 10000, THIRD_PARTY_API_URL = "https://api-v2.payments.com", MAX_RETRY_ATTEMPTS = 3. It’s like building a house where adjusting the thermostat requires ripping out walls and rewiring the electrical system. Each configuration change becomes an architectural decision, complete with code reviews, testing cycles, and deployment risks.
By Monday morning, they’ll have lost $86,400 to hardcoded configuration. This is the deployment coupling problem - where operational changes require engineering changes, turning simple tweaks into complex releases.
Why This Happens
The path to hardcoded configuration is gradual and logical. Early in development, hardcoding values is the fastest path forward - no infrastructure to build, no external dependencies to manage, no additional complexity to debug. A startup with 100 users doesn’t need sophisticated configuration management.
The problem compounds as systems grow:
initial simplicity
-> small config changes require deploys
-> deploys become risky and slow
-> teams avoid necessary adjustments
-> systems become fragile and inflexible
-> outages from inability to adapt quickly
The core issue is coupling operational concerns with development concerns - mixing what the system does with how it should behave in specific environments.
Hardcoded configuration creates deployment coupling - operational changes require engineering changes, turning runtime adjustments into development cycles.
The Naive Solution (and where it breaks)
Most teams first reach for environment variables - move constants out of code and into environment files:
# Move from this
PAYMENT_TIMEOUT = 10000
MAX_RETRIES = 3
# To this
PAYMENT_TIMEOUT = int(os.environ.get('PAYMENT_TIMEOUT', 10000))
MAX_RETRIES = int(os.environ.get('MAX_RETRIES', 3))
Environment variables feel like a house with a separate control panel - you can adjust settings without rewiring, but you still need to turn the power off and on again to make changes take effect. The approach works for basic cases but hits limits quickly.
The problems emerge at scale:
Small scale: env vars work for basic config
Large scale: still requires restarts for changes,
no change history, no rollback capability,
difficult coordination across multiple services
Environment variables are better than hardcoded constants, but they’re still deployment-coupled. Changing a timeout still requires updating container configs, restarting services, and coordinating changes across environments.
Environment variables create restart coupling - you’ve moved config out of code but changes still require service restarts, making rapid adjustments impossible during incidents.
External Configuration Store
Here’s what actually fixes this: move configuration to an external store that services can poll or receive updates from without restarts.
The pattern separates configuration data from application code entirely. Services read config from external sources and can pick up changes dynamically, turning configuration updates into data changes instead of code changes.
# Configuration service client with caching and refresh
class ConfigService:
def __init__(self, config_url, refresh_interval=30):
self.config_url = config_url
self.refresh_interval = refresh_interval
self.cache = {}
self.last_refresh = 0
def get(self, key, default=None):
if time.time() - self.last_refresh > self.refresh_interval:
self.refresh_config()
return self.cache.get(key, default)
def refresh_config(self):
try:
response = requests.get(f"{self.config_url}/config")
self.cache.update(response.json())
self.last_refresh = time.time()
except requests.RequestException:
# Keep existing config on failure
pass
# Usage in application code
config = ConfigService("https://config.internal.com")
def process_payment():
timeout = config.get('payment_timeout', 10000)
max_retries = config.get('max_retries', 3)
# ... rest of payment logic
Services poll the configuration store every 30 seconds, picking up changes automatically. Updating a timeout becomes a data operation - change the value in the config store, wait 30 seconds, and all services have the new setting.
Netflix’s Archaius configuration system handles millions of config changes per day across thousands of services, enabling rapid response to production issues without deployments - they can disable features or adjust timeouts within seconds.
Add Feature Flag Service
The second layer: dedicated feature flag management that goes beyond simple boolean toggles to provide gradual rollouts, user targeting, and instant rollbacks.
# Feature flag service with percentage rollouts and targeting
class FeatureFlagService:
def __init__(self, flag_service_url):
self.flag_service_url = flag_service_url
self.cache = {}
def is_enabled(self, flag_name, user_id=None, context=None):
flag_config = self.get_flag_config(flag_name)
if not flag_config:
return False
# Check percentage rollout
if 'percentage' in flag_config:
user_hash = hash(f"{flag_name}:{user_id}") % 100
if user_hash >= flag_config['percentage']:
return False
# Check user targeting rules
if 'target_users' in flag_config and user_id:
if user_id in flag_config['target_users']:
return True
# Check context-based rules
if 'rules' in flag_config and context:
return self.evaluate_rules(flag_config['rules'], context)
return flag_config.get('default', False)
Feature flags become sophisticated control mechanisms rather than simple on/off switches. You can enable features for 5% of users, target specific user segments, or enable features only in certain geographic regions.
Facebook’s Gatekeeper system manages over 100,000 feature flags across their platform, enabling them to test new features on small user populations and kill bad features within minutes if metrics show problems.
Runtime Configuration Updates
The third layer: push-based configuration updates that eliminate polling delays and provide instant config changes during incidents.
# Kubernetes ConfigMap with automatic reload
apiVersion: v1
kind: ConfigMap
metadata:
name: payment-service-config
data:
config.yaml: |
payment:
timeout: 10000
max_retries: 3
api_url: "https://api-v2.payments.com"
features:
new_checkout: false
enhanced_security: true
---
# Deployment with config mounting and reload capability
apiVersion: apps/v1
kind: Deployment
spec:
template:
spec:
containers:
- name: payment-service
volumeMounts:
- name: config
mountPath: /etc/config
env:
- name: CONFIG_PATH
value: /etc/config/config.yaml
- name: CONFIG_RELOAD_SIGNAL
value: "SIGUSR1"
volumes:
- name: config
configMap:
name: payment-service-config
The service watches for configuration file changes and reloads config when the file is updated. Kubernetes automatically updates the mounted ConfigMap when you change it, triggering an immediate config reload without restarting the container.
Configuration changes propagate to all service instances within seconds, not minutes. Combined with health checks and circuit breakers, bad config changes can be detected and rolled back automatically.
Push-based configuration updates eliminate the polling delay and enable instant rollbacks - critical for incident response when every second matters.
The Full Architecture
The architecture handles all configuration concerns: static configuration through external stores, feature flags through dedicated services, and runtime updates through push mechanisms.
Configuration flows through multiple channels based on urgency and type:
- Application config (timeouts, URLs, thresholds) comes from external config store with 30-second polling
- Feature flags come from dedicated flag service with real-time updates via WebSocket
- Emergency config changes push through Kubernetes ConfigMaps with instant reload signals
- All changes are versioned, audited, and rollback-capable through the config management layer
The separation enables different update patterns - feature flags can change instantly for rapid experimentation, while infrastructure config can have approval workflows and staged rollouts.
Configuration needs different update patterns - feature flags need instant changes for experimentation, infrastructure settings need approval workflows, and emergency changes need bypass mechanisms.
Component Deep Dives
Configuration Store Service
The config store’s job is providing versioned, auditable configuration with efficient caching and change propagation.
# Configuration store with versioning and change tracking
class ConfigStore:
def __init__(self, database_url):
self.db = DatabaseConnection(database_url)
def get_config(self, service_name, version='latest'):
query = """
SELECT config_data, version, last_modified
FROM service_configs
WHERE service_name = %s
AND (version = %s OR %s = 'latest')
ORDER BY version DESC LIMIT 1
"""
result = self.db.execute(query, [service_name, version, version])
if result:
return {
'data': json.loads(result['config_data']),
'version': result['version'],
'last_modified': result['last_modified']
}
return None
def update_config(self, service_name, config_data, user_id):
# Increment version number
current_version = self.get_latest_version(service_name)
new_version = current_version + 1
# Store new config with audit trail
self.db.execute("""
INSERT INTO service_configs
(service_name, version, config_data, updated_by, created_at)
VALUES (%s, %s, %s, %s, NOW())
""", [service_name, new_version, json.dumps(config_data), user_id])
# Trigger change notifications
self.notify_config_change(service_name, new_version)
Every configuration change is versioned and auditable. The system tracks who made changes, when they occurred, and provides rollback to any previous version.
Feature Flag Management
Feature flag services require sophisticated evaluation engines that can handle complex targeting rules and gradual rollouts.
# Feature flag evaluation with complex targeting
class FeatureFlagEngine:
def evaluate_flag(self, flag_name, context):
flag = self.get_flag_definition(flag_name)
# Evaluate targeting rules in priority order
for rule in flag.get('rules', []):
if self.evaluate_rule(rule, context):
return rule['variation']
# Fall back to percentage rollout
if 'rollout' in flag:
return self.evaluate_percentage_rollout(
flag['rollout'],
context.get('user_id')
)
return flag.get('default_variation', False)
def evaluate_percentage_rollout(self, rollout, user_id):
# Consistent hash-based distribution
user_hash = hashlib.sha1(f"{user_id}".encode()).hexdigest()
user_percentage = int(user_hash[:8], 16) % 100
for bucket in rollout['buckets']:
if user_percentage < bucket['percentage']:
return bucket['variation']
return False
The evaluation engine supports complex rules like “enable for users in California with premium accounts” while maintaining consistent assignment - the same user always gets the same variation.
Configuration Change Pipeline
Robust configuration management requires approval workflows, staged rollouts, and automatic validation.
# Configuration change pipeline with approvals
name: config-change-pipeline
on:
pull_request:
paths: ['configs/**']
jobs:
validate-config:
runs-on: ubuntu-latest
steps:
- name: Validate Configuration Schema
run: |
# Validate all config files against JSON schemas
for config in configs/*.yaml; do
yq eval '.' $config | jsonschema -i - schemas/config-schema.json
done
- name: Test Configuration Changes
run: |
# Run integration tests with new config
docker-compose -f test-compose.yml up --abort-on-container-exit
deploy-staging:
needs: validate-config
if: github.event.pull_request.approved
runs-on: ubuntu-latest
steps:
- name: Deploy to Staging
run: kubectl apply -f configs/staging/
- name: Validate Staging Health
run: |
# Wait for services to stabilize with new config
sleep 30
curl -f https://staging-api.company.com/health
deploy-production:
needs: deploy-staging
if: github.event.pull_request.merged
runs-on: ubuntu-latest
steps:
- name: Deploy to Production
run: kubectl apply -f configs/production/
Configuration changes go through the same rigor as code changes - validation, testing, staged deployment, and health checks - but without requiring application deployments.
Configuration Monitoring
Configuration systems need monitoring to detect bad config changes and enable rapid rollbacks.
# Configuration health monitoring
class ConfigHealthMonitor:
def __init__(self, metrics_client):
self.metrics = metrics_client
def monitor_config_change(self, service_name, config_version):
start_time = time.time()
# Monitor key service metrics after config change
baseline_metrics = self.get_baseline_metrics(service_name)
while time.time() - start_time < 300: # Monitor for 5 minutes
current_metrics = self.get_current_metrics(service_name)
# Check for significant degradation
if self.detect_degradation(baseline_metrics, current_metrics):
self.trigger_rollback(service_name, config_version - 1)
break
time.sleep(30)
def detect_degradation(self, baseline, current):
# Alert on significant increases in errors or latency
error_rate_increase = (current.error_rate - baseline.error_rate) / baseline.error_rate
latency_increase = (current.p95_latency - baseline.p95_latency) / baseline.p95_latency
return error_rate_increase > 0.5 or latency_increase > 0.3
The monitoring system watches service health metrics after configuration changes and automatically rolls back if key metrics degrade significantly.
Comparison Table
| Approach | Change Speed | Rollback Capability | Audit Trail | Operational Risk | Best Use Case |
|---|---|---|---|---|---|
| Hardcoded | Hours (deploy cycle) | None (requires new deploy) | Git history only | High (code changes for config) | Development/prototypes |
| Environment Variables | Minutes (restart) | Manual (restore env) | Limited | Medium (restart disruption) | Simple service configuration |
| External Config Store | Seconds (poll interval) | Instant (version rollback) | Complete | Low (data-only changes) | Runtime application settings |
| Feature Flag Service | Instant | Instant (toggle off) | Complete with targeting | Very Low | Feature rollouts, A/B testing |
| Hybrid System | Instant to minutes | Instant | Complete | Very Low | Production systems at scale |
The hybrid approach provides the flexibility to handle all configuration scenarios - instant feature toggles for experiments, validated config changes for infrastructure settings, and emergency overrides when needed.
Key Takeaways
• Deployment coupling occurs when operational changes require development changes, making simple adjustments into complex releases • External configuration stores decouple config from deployments, enabling runtime changes without service restarts • Feature flag services provide sophisticated targeting and rollout capabilities beyond simple boolean toggles • Runtime configuration updates eliminate polling delays and enable instant rollbacks during incidents • Configuration versioning provides audit trails and rollback capabilities essential for production systems • Approval workflows balance change velocity with operational safety for different types of configuration • Health monitoring detects bad configuration changes and enables automatic rollbacks before user impact • Push-based updates are critical for incident response when every second counts
The pattern works for any system where operational flexibility matters more than architectural simplicity. Design configuration systems like data systems - with versioning, monitoring, rollback capabilities, and appropriate access controls.
Frequently Asked Questions
Q: How do you handle configuration secrets with external config stores? A: Use dedicated secret management systems like HashiCorp Vault or AWS Secrets Manager. Never store secrets in plain text config stores. Reference secrets by ID and inject them at runtime through secure channels.
Q: What happens if the configuration service goes down? A: Services should cache configuration locally with reasonable TTLs and continue operating with last-known-good config. Implement circuit breakers around config fetching and graceful degradation when config services are unavailable.
Q: How do you test configuration changes before production deployment? A: Use staged configuration environments - test config changes in staging first, validate service health, then promote to production. Feature flags can also test config changes on small user populations.
Q: What about configuration drift between environments? A: Use configuration-as-code with version control and automated deployment pipelines. Treat configuration like infrastructure - define environments declaratively and deploy consistently through automation.
Q: How do you handle configuration changes that require coordination between multiple services? A: Implement configuration change orchestration with dependency awareness. Some changes need staged rollouts across services, others need atomic updates. Design configuration schemas with backward compatibility to minimize coordination needs.
Q: What’s the performance impact of polling configuration stores? A: Modern config systems use HTTP caching, ETags, and push mechanisms to minimize overhead. A 30-second poll with proper caching typically adds less than 1ms overhead to request processing.
Interview Questions
Q: Design a configuration system that can handle both gradual feature rollouts and emergency feature kills. Expected depth: Discuss push vs pull mechanisms, caching strategies, circuit breakers, health monitoring, and the tradeoffs between consistency and availability for config changes.
Q: How would you migrate a legacy system with 200+ hardcoded configuration values? Expected depth: Cover migration strategies like strangler fig patterns, dual-read/dual-write approaches, risk assessment, rollback planning, and measuring success metrics.
Q: What are the security implications of externalized configuration? Expected depth: Discuss secret management, access controls, audit logging, configuration tampering prevention, and the attack surface differences between hardcoded and external config.
Q: How do you ensure configuration consistency across a distributed system with 50+ microservices? Expected depth: Cover configuration schemas, validation pipelines, dependency management, rollout coordination, and monitoring configuration drift across services.
Q: Design the monitoring and alerting strategy for a configuration management system. Expected depth: Discuss metrics like config change frequency, rollback rates, service health correlation with config changes, and SLAs for configuration propagation times.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.