Select * From Everything
performance databases api-design
System Design Scenario
Select * From Everything
When one lazy query multiplied by 10K requests/second turns your database into a data courier for columns nobody reads
It’s 3:14 AM on a Thursday and the PagerDuty alert reads db_connection_pool_saturated. You SSH into the primary PostgreSQL instance and run pg_stat_activity. There are 247 active queries, all identical: SELECT * FROM users WHERE active = true. The users table has 47 columns. The endpoint consuming this query needs three of them: id, name, and avatar_url. The other 44 columns - including a 4 KB preferences JSONB blob, a bio text field averaging 800 bytes, and an address column nobody has queried intentionally since 2021 - travel from disk to PostgreSQL shared buffers, across the network to your application server, get deserialized into objects, serialized back to JSON, and sent over the wire to a mobile client that immediately discards everything except those three fields.
Imagine ordering a pizza and the restaurant sends the entire menu with your delivery. Every single time. Ten thousand times per second.
The math is brutal. Each row weighs roughly 12 KB when fully hydrated. The three fields the client actually reads total about 200 bytes. At 10,000 requests per second, your database is shipping 120 MB/s of data when 2 MB/s would suffice. That’s 118 MB/s of pure waste - wasted disk I/O reading columns from heap pages, wasted memory in shared buffers caching data nobody reads, wasted CPU serializing JSONB blobs into wire format, wasted network bandwidth between your database and application tier.
The database isn’t slow. It’s doing exactly what you asked. You asked for everything, so it’s giving you everything. The problem isn’t the database’s performance - it’s that 94% of its effort produces data that gets immediately thrown away.
GitHub discovered this pattern in 2012 when they traced why their Rails API was consistently slower than expected. The ORM’s default behavior - User.find(id) - generated SELECT * on a table with dozens of columns including serialized preference hashes. Shopify’s engineering blog documented a similar finding: switching from SELECT * to explicit column lists on their highest-traffic product endpoints reduced PostgreSQL CPU usage by 35%.
This is the over-fetching problem.
Why This Happens
The root cause is a mismatch between the convenience API developers interact with and the actual cost of database operations. ORMs are the primary vector. When you write User.objects.get(id=123) in Django or User.find(123) in Rails, the generated SQL is SELECT * FROM users WHERE id = 123. The ORM doesn’t know which attributes you’ll access on the resulting object, so it fetches all of them defensively.
This made sense when tables had 8 columns and traffic was 50 requests per second. It stops making sense when tables accumulate years of feature additions and traffic grows three orders of magnitude.
The failure chain looks like this:
Developer writes: User.objects.filter(active=True)
ORM generates: SELECT * FROM users WHERE active = true
PostgreSQL does: Sequential scan → reads all 47 columns from heap pages
→ fills shared_buffers with full rows
→ serializes all columns to wire protocol
Network carries: 12 KB per row × 10,000 rows/sec = 120 MB/s
App server does: Deserialize all 47 fields into memory objects
→ serialize to JSON (still all 47 fields)
→ GC pressure from short-lived large objects
Client receives: 12 KB JSON payload, reads 3 fields, discards rest
The damage compounds at every layer. PostgreSQL can’t use a covering index (which serves queries entirely from the index without touching heap pages) because SELECT * by definition requires the full row. The operating system page cache fills with complete 8 KB heap pages when only a fraction of each page contains relevant data. The application server’s garbage collector works overtime processing objects that are 94% waste. Network buffers carry 60x more data than necessary.
Core Insight
Over-fetching isn’t a single point of waste - it’s a tax applied at every layer of the stack. Fix it at the query level and you get compounding savings in I/O, memory, network, serialization, and GC pressure simultaneously.
The Naive Solution
The first instinct most engineers have is to add a cache. If the database is doing too much work, put Redis in front of it and serve repeated requests from memory. This approach treats the symptom (database load) while ignoring the disease (fetching data nobody needs).
Here’s what happens when you cache the over-fetched response: you now have 12 KB entries in Redis for data that should be 200 bytes. Your Redis cluster fills up 60x faster than necessary. Cache evictions happen more frequently because each entry is bloated. Memory costs scale with the waste, not with the useful data.
The other naive fix is pagination - fetch fewer rows per request. This reduces the per-request payload but doesn’t address the per-row waste. You’re still fetching 47 columns for every row; you’re just fetching fewer rows at a time. The database still can’t use covering indexes. The per-row I/O cost is identical.
With cache (naive):
Cache HIT: Redis serves 12 KB (still 94% waste in memory)
Cache MISS: DB query still fetches all 47 columns
Result: Redis fills up 60x faster, eviction pressure increases
With pagination (naive):
Page size 20: 20 rows × 12 KB = 240 KB per page
Needed: 20 rows × 200 B = 4 KB per page
Result: Same 94% waste ratio, just in smaller batches
Neither approach fixes the fundamental problem: you’re asking for data you don’t need. The cache just moves the waste from the database layer to the caching layer. Pagination just splits the waste into smaller pieces.
Common Trap
Caching over-fetched data is like putting a faster engine in a truck that’s hauling empty pallets. You move the waste more quickly, but you’re still hauling waste. Fix the payload first, then optimize delivery.
The Better Solution
The fix operates at three layers: the query layer (what you ask the database for), the API layer (what you expose to clients), and the monitoring layer (how you detect regressions). Each layer reinforces the others.
Layer 1: Column Projection at the Query Level
Replace SELECT * with explicit column lists. In raw SQL, this is trivial. The challenge is making ORMs cooperate.
# Django: explicit column selection
users = User.objects.filter(active=True).only('id', 'name', 'avatar_url')
# Generates: SELECT id, name, avatar_url FROM users WHERE active = true
# SQLAlchemy: load only specific columns
from sqlalchemy import select
stmt = select(User.id, User.name, User.avatar_url).where(User.active == True)
-- The covering index that makes this fast
CREATE INDEX idx_users_active_projection
ON users (active) INCLUDE (id, name, avatar_url);
-- PostgreSQL can now serve this query entirely from the index
-- No heap page access needed (index-only scan)
EXPLAIN SELECT id, name, avatar_url FROM users WHERE active = true;
-- Index Only Scan using idx_users_active_projection
The covering index is the key performance unlock. When all requested columns exist in the index (either as key columns or INCLUDE columns), PostgreSQL performs an index-only scan - it reads data directly from the index pages without touching the heap. This eliminates random I/O to heap pages entirely.
Real-World Result
Shopify reported that adding covering indexes for their top 10 product listing queries reduced PostgreSQL read I/O by 62% and P99 latency from 180ms to 12ms on their catalog service.
Layer 2: Sparse Fieldsets at the API Level
Let clients declare which fields they need. This is the sparse fieldsets pattern, standardized in JSON:API and supported natively by GraphQL.
# Flask endpoint with sparse fieldsets (JSON:API style)
@app.route('/api/users')
def get_users():
requested_fields = request.args.get('fields[users]', 'id,name,avatar_url')
allowed_fields = {'id', 'name', 'email', 'avatar_url', 'created_at'}
# Validate against allowlist (prevent SQL injection / info leaks)
fields = set(requested_fields.split(',')) & allowed_fields
if not fields:
fields = {'id', 'name', 'avatar_url'} # safe default
users = db.session.query(*[getattr(User, f) for f in fields])\
.filter(User.active == True).all()
return jsonify([{f: getattr(u, f) for f in fields} for u in users])
# GraphQL: clients request exactly what they need
query {
users(active: true) {
id
name
avatarUrl
}
}
# The resolver fetches ONLY id, name, avatar_url from the database
GraphQL solves the over-fetching vs under-fetching tradeoff structurally. Clients describe their data requirements in the query itself, and the server can use that declaration to build minimal database queries. The tradeoff is complexity - you need a schema, resolvers, and careful attention to the N+1 query problem that GraphQL introduces.
For REST APIs, Google’s API design guide recommends FieldMask for gRPC services and fields query parameters for JSON APIs. Both follow the same principle: the client tells the server what it needs, and the server fetches only that.
Real-World Result
Netflix’s migration from a REST API serving full entity payloads to a GraphQL gateway reduced mobile app bandwidth consumption by 40% and cut median API response times in half, primarily by eliminating over-fetching on their title detail endpoints.
Layer 3: Query Analysis and Monitoring
You need visibility into which columns are actually consumed versus which columns are fetched. Without this, over-fetching creeps back in with every new feature addition.
# Middleware that tracks column usage (production sampling)
import random
from functools import wraps
class ColumnUsageTracker:
def __init__(self, sample_rate=0.01):
self.sample_rate = sample_rate
def track(self, endpoint, fetched_columns, accessed_columns):
if random.random() > self.sample_rate:
return
waste_ratio = 1 - (len(accessed_columns) / len(fetched_columns))
metrics.gauge('column_waste_ratio', waste_ratio,
tags=[f'endpoint:{endpoint}'])
if waste_ratio > 0.5:
logger.warning(f"Over-fetching detected: {endpoint} "
f"fetches {len(fetched_columns)} columns, "
f"uses {len(accessed_columns)}")
The Full Architecture
The happy path flows like this: a client request arrives with a fields parameter (REST), a GraphQL query document, or a gRPC FieldMask. The API gateway extracts the requested field set and validates it against an allowlist. The field resolver checks for field dependencies (requesting full_name might require fetching both first_name and last_name). The query builder generates a SELECT statement with only the projected columns. The database serves the query using a covering index where possible. The response serializer builds a JSON payload containing exactly the requested fields. A query analyzer samples requests and logs waste ratios for monitoring.
Component Deep Dives
Field Allowlist and Dependency Resolution
Not every column should be exposable to clients. Sensitive fields (ssn_hash, password_digest) must be excluded regardless of what the client requests. Some fields have dependencies - requesting a computed field may require fetching its source columns.
# Field configuration with dependencies and access control
FIELD_CONFIG = {
'users': {
'id': {'type': 'uuid', 'always_include': True},
'name': {'type': 'str', 'db_column': 'name'},
'email': {'type': 'str', 'db_column': 'email'},
'avatar_url': {'type': 'str', 'db_column': 'avatar_url'},
'full_name': {
'type': 'str',
'computed': True,
'depends_on': ['first_name', 'last_name'],
},
'created_at': {'type': 'datetime', 'db_column': 'created_at'},
# Never exposed regardless of request
'ssn_hash': {'type': 'str', 'internal_only': True},
'password_digest': {'type': 'str', 'internal_only': True},
}
}
def resolve_fields(resource: str, requested: set[str]) -> set[str]:
config = FIELD_CONFIG[resource]
db_columns = set()
for field in requested:
if field not in config or config[field].get('internal_only'):
continue
if config[field].get('computed'):
db_columns.update(config[field]['depends_on'])
else:
db_columns.add(config[field].get('db_column', field))
# Always include primary key
db_columns.add('id')
return db_columns
Dynamic Query Builder with Injection Prevention
The query builder must translate a set of validated column names into a safe SQL query. Never interpolate column names from user input directly - even after allowlist validation, use parameterized approaches.
from sqlalchemy import select, column
from sqlalchemy.orm import Session
def build_projected_query(model, columns: set[str], filters: dict):
"""Build a SELECT with only the requested columns."""
# Map validated field names to SQLAlchemy column objects
sa_columns = [getattr(model, col) for col in columns if hasattr(model, col)]
if not sa_columns:
raise ValueError("No valid columns requested")
stmt = select(*sa_columns)
for key, value in filters.items():
if hasattr(model, key):
stmt = stmt.where(getattr(model, key) == value)
return stmt
# Usage
columns = resolve_fields('users', {'id', 'name', 'avatar_url'})
query = build_projected_query(User, columns, {'active': True})
# Generates: SELECT users.id, users.name, users.avatar_url
# FROM users WHERE users.active = true
Covering Index Strategy
For your highest-traffic endpoints, create covering indexes that include all columns in the projection. This enables index-only scans and eliminates heap access entirely.
-- Identify top queries by column access pattern
SELECT query, calls, mean_exec_time
FROM pg_stat_statements
WHERE query LIKE '%FROM users%'
ORDER BY calls * mean_exec_time DESC
LIMIT 10;
-- Create covering indexes for the top access patterns
-- Pattern 1: User list (most common - 10K rps)
CREATE INDEX idx_users_list_v1
ON users (active) INCLUDE (id, name, avatar_url)
WHERE active = true;
-- Pattern 2: User profile card
CREATE INDEX idx_users_profile_v1
ON users (id) INCLUDE (name, email, avatar_url, bio);
-- Monitor index-only scan effectiveness
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public' AND relname = 'users';
Real-World Result
GitHub’s API team reported that after implementing field-aware covering indexes on their repository metadata queries, their PostgreSQL instances dropped from 78% CPU utilization to 31% without any infrastructure changes - same hardware, same traffic, just fetching less data per query.
Comparison Table
| Approach | Over-fetch Fix | Client Flexibility | Implementation Cost | N+1 Risk | Best For |
|---|---|---|---|---|---|
SELECT * (default ORM) | None | None - fixed payload | Zero | Low | Prototypes, internal tools |
| Explicit column lists | Full (server-side) | None - server decides | Low | Low | Single-client APIs, batch jobs |
| Sparse fieldsets (REST) | Full | High - client chooses | Medium | Low | Public REST APIs, mobile apps |
| GraphQL | Full | Maximum - per-query | High | High (needs DataLoader) | Multi-client platforms, BFF |
| gRPC FieldMask | Full | High - typed contract | Medium | Low | Inter-service communication |
Key Takeaways
- Column projection is the single highest-leverage database optimization most teams ignore. Replacing
SELECT *with explicit column lists can reduce I/O by 90%+ on wide tables. - Covering indexes turn projection into an index-only scan, eliminating heap page access entirely. Create them for your top 5 query patterns.
- Sparse fieldsets give REST clients the ability to request only what they need without the complexity of a full GraphQL layer.
- GraphQL solves the over-fetching vs under-fetching tradeoff structurally but introduces N+1 complexity that requires DataLoader or equivalent batching.
- Query analysis monitoring is not optional. Without tracking which columns are actually accessed in production, over-fetching will creep back in with every schema migration.
- Every additional column in a
SELECTadds cost at five layers: disk I/O, memory buffers, network transfer, application deserialization, and client processing. - The ORM is not your enemy - it just needs explicit guidance.
.only(),.defer(), andselect()exist in every major ORM. Use them. - gRPC’s
FieldMaskprovides typed, contract-enforced projection for service-to-service calls where GraphQL’s flexibility isn’t needed.
The pattern here applies beyond databases. Over-fetching happens in API-to-API calls, event payloads, cache entries, and log lines. Any time you’re moving data across a boundary, ask: does the consumer actually need all of this? If not, don’t send it.
FAQ
Q: Won’t explicit column lists break when I add new columns to the table?
No - that’s the point. With SELECT *, adding a new column silently increases every query’s payload. With explicit columns, new columns are opt-in. The API contract is explicit. You add the column to the projection only when a consumer needs it.
Q: Doesn’t GraphQL solve this entirely? Why bother with sparse fieldsets?
GraphQL solves the client-facing API problem elegantly, but it introduces server-side complexity: schema management, resolver architecture, N+1 query prevention (DataLoader), and query cost analysis to prevent abuse. For teams with a single frontend client or low API traffic, sparse fieldsets on REST achieve 80% of the benefit at 20% of the operational cost.
Q: How do I handle computed fields that depend on multiple columns?
Build a dependency graph. If the client requests full_name, your field resolver knows it needs to fetch first_name and last_name from the database. The computation happens in the application layer after the minimal query returns.
Q: What about ORM lazy-loading - doesn’t that solve over-fetching?
Lazy-loading trades over-fetching for under-fetching. Instead of one query with 47 columns, you get one query with 3 columns followed by N additional queries when you access a deferred attribute in a loop. This is the N+1 problem, and it’s often worse than over-fetching. Explicit .only() with known access patterns is the correct approach.
Q: How do I measure the actual impact of column projection?
Compare EXPLAIN ANALYZE output before and after. Look for the switch from Seq Scan or Index Scan to Index Only Scan. Monitor pg_stat_user_tables.seq_tup_read vs idx_tup_fetch. At the network layer, measure response payload sizes. A 60x reduction in payload size at 10K RPS translates directly into measurable bandwidth, CPU, and latency improvements.
Q: Should I create a covering index for every query pattern?
No. Indexes have write-side costs (every INSERT/UPDATE must maintain the index) and storage costs. Create covering indexes for your top 3-5 read-heavy query patterns - the ones accounting for 80%+ of your read traffic. Use pg_stat_statements to identify them.
Interview Questions
Q: An endpoint returns a user object with 40+ fields but mobile clients only use 5 of them. Walk me through how you’d reduce the data transfer.
Expected depth: Start at the API layer (sparse fieldsets or GraphQL), show how field selection propagates to the query layer (explicit SELECT), explain covering indexes for index-only scans, discuss the monitoring layer to detect regressions. Mention the tradeoff between client flexibility and server complexity. Touch on cache implications - projected responses can use field-set-specific cache keys.
Q: Compare GraphQL, REST sparse fieldsets, and gRPC FieldMask for solving over-fetching. When would you choose each?
Expected depth: GraphQL for multi-client platforms where each client has different data needs (web vs mobile vs third-party). REST sparse fieldsets for public APIs where simplicity matters and you don’t want to maintain a GraphQL schema. gRPC FieldMask for inter-service communication where type safety and performance matter more than client flexibility. Discuss the N+1 tradeoff GraphQL introduces and how DataLoader solves it.
Q: You’ve identified that a PostgreSQL table is serving queries that fetch all columns but consumers use only 3. How do you implement the fix without breaking existing clients?
Expected depth: Phased rollout. First, add query analysis to measure actual column usage. Second, create covering indexes for the projected column set. Third, add fields parameter support to the API with the current behavior as the default. Fourth, update clients one by one to request only needed fields. Fifth, deprecate the full-payload default after migration. Discuss backward compatibility and versioning.
Q: Your query analyzer shows that 70% of API endpoints over-fetch by more than 50%. How do you prioritize and fix this at scale?
Expected depth: Sort by (waste_ratio * request_rate) to find highest-impact endpoints first. Automate the covering index creation for common patterns. Add ORM linting rules that flag SELECT * in CI. Build a dashboard that tracks column waste ratios per endpoint over time. Discuss the organizational challenge - over-fetching often happens because teams don’t know what downstream consumers actually read.
Premium Content
Unlock the full article along with everything else in the archive — all in one place.