How to Optimise Backend Performance: A Practical Playbook

TL;DR: Backend performance work is a loop: observe, profile, fix, verify. This post covers the full cycle. Setting up observability, identifying bottlenecks with percentile metrics, applying targeted fixes (N+1 queries, indexing, caching, async offloading), and verifying improvements against p75/p95/p99 latency targets.

Why Percentiles Matter More Than Averages

Average response time is misleading. An endpoint averaging 80 ms might seem fine until you realise 5% of your users are waiting 800 ms or more.

Percentile metrics give you the actual picture:

Metric	What It Tells You
p50 (median)	The typical user experience
p75	Where the experience starts degrading
p95	The worst experience for most users
p99	The tail, your worst-case under normal load

The goal I worked towards: p95 under 200 ms, p99 under 500 ms, and critical queries completing in under 50 ms.

When you optimise, you're compressing the gap between p50 and p99. A fast median with a slow tail means your system is unpredictable, and users notice unpredictability more than raw speed.

Step 1: Establish Observability

Before touching any code, you need visibility into what your system is actually doing. I've seen teams spend weeks optimising the wrong endpoint because they didn't have the data to tell them where the real problems were.

Application Performance Monitoring (APM)

APM tools trace requests end-to-end through your stack. They break down where time goes: application code, database queries, external API calls, template rendering, serialisation.

Tools: Datadog APM, New Relic, Elastic APM, Jaeger (open-source)

What to look for in APM data:

Flame graphs give you a visual breakdown of time spent in each function call
Trace waterfalls show sequential vs. parallel execution of sub-operations
Service maps lay out which services call which, and where dependencies bottleneck

Database Profiling

Most backend latency lives in the database layer. Profiling queries tells you exactly which ones are slow and why.

Tools: Datadog Database Monitoring, pganalyze (PostgreSQL), django-debug-toolbar (local development). For a hands-on walkthrough of using django-debug-toolbar and snakeviz for local profiling, see my Case Study: Profiling Django APIs with Debug Toolbar and snakeviz.

Key metrics to track:

Query execution time: how long the database spends running each query
Query frequency: a 5 ms query executed 200 times per request is worse than a single 100 ms query
Lock wait time: queries blocked waiting for row or table locks
Rows scanned vs. rows returned: a high ratio points to missing indexes

Structured Logging

Logs are your investigation trail. When APM shows a slow trace, logs tell you what happened during that request.

import structlog

logger = structlog.get_logger()

logger.info(
    "order_processed",
    order_id=order.id,
    duration_ms=elapsed,
    item_count=len(order.items),
    cache_hit=cache_hit,
)

Log with enough context to reconstruct the request path: IDs, durations, counts, cache hit/miss status.

Dashboards and Alerting

Combine these signals into dashboards. I use Datadog dashboards tracking:

p75 / p95 / p99 latency per endpoint over time
Error rate alongside latency (slow responses often precede errors)
Database query count per request, where a sudden jump signals a regression
Queue depth for async workers

Set up both threshold-based and rate-of-change alerts. Static thresholds catch known-bad states; rate-of-change alerts catch regressions as they happen.

Screenshot to add (dd-dashboard-latency.png): Datadog dashboard showing p75/p95/p99 latency timeseries for a single endpoint. Capture a view where the three percentile lines are visible and diverging (e.g., p50 flat around 80 ms while p99 spikes to 800 ms). Include the time range selector and the endpoint name in the title. This gives readers a concrete reference for what "observability" looks like in practice.

Reading Profiling Data

Setting up observability tools is step one. Getting useful information out of them is where most people get stuck. Below is how I read the output from the three profiling interfaces I use most: Python's cProfile, Django Debug Toolbar, and Datadog APM.

Python cProfile

cProfile is built into Python and requires no dependencies. It profiles function-level execution time.

Running a profile:

import cProfile
import pstats

# Profile a function call
cProfile.run('my_slow_function()', 'output.prof')

# Read the results
stats = pstats.Stats('output.prof')
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20 functions

For profiling a Django view in isolation:

import cProfile
from django.test import RequestFactory

factory = RequestFactory()
request = factory.get('/api/orders/')

profiler = cProfile.Profile()
profiler.enable()
response = my_view(request)
profiler.disable()

profiler.print_stats(sort='cumulative')

Reading the output:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      1    0.000    0.000    1.842    1.842 views.py:45(order_list)
    200    0.003    0.000    1.650    0.008 models.py:12(get_customer)
    200    1.580    0.008    1.580    0.008 base.py:330(execute)
      1    0.001    0.001    0.180    0.180 serializers.py:88(to_representation)
      1    0.000    0.000    0.012    0.012 pagination.py:22(paginate)

Column	What It Means
`ncalls`	How many times this function was called
`tottime`	Time spent inside this function, excluding sub-calls
`percall` (first)	`tottime / ncalls`
`cumtime`	Total time spent in this function, including sub-calls
`percall` (second)	`cumtime / ncalls`

How to read this:

Start from the top (sorted by cumtime). In the example above, order_list takes 1.84 seconds total. Drilling down, get_customer is called 200 times and accounts for 1.65 seconds — that's 89% of the total. The actual time is spent in base.py:execute, which is Django's database query executor. This is a textbook N+1: 200 individual queries to fetch customer data.

What to focus on:

High ncalls on database functions: N+1 queries
High tottime on a single function: CPU-bound bottleneck (serialisation, computation)
High cumtime with low tottime: the function itself is fast but calls something slow

For a visual alternative to the text output, pipe cProfile data into snakeviz — it renders the same data as an interactive flame graph in the browser:

python -m cProfile -o output.prof my_script.py
snakeviz output.prof

snakeviz reads the profile top-down. Each box is a function, and boxes nested below others mean they were called by the function above. Wider boxes took more time. Click a box to zoom in, and sort the table below by ncalls or cumtime to find outliers.

Screenshot to add (snakeviz-flamegraph.png): snakeviz browser output showing a sunburst or icicle chart for a Django view profile. Ideally capture a view where one function (e.g., a database query) is visibly wider than the rest, with the stats table below showing ncalls, tottime, and cumtime columns. Annotate or circle the wide block to show what "this is where the time goes" looks like.

Django Debug Toolbar Profiling

Django Debug Toolbar gives you per-request profiling without writing any code. It has several panels, but for performance work the most useful are the SQL panel and the Profiling panel.

Enabling the profiler:

# settings.py (development only)
INSTALLED_APPS = [
    ...
    'debug_toolbar',
]

MIDDLEWARE = [
    'debug_toolbar.middleware.DebugToolbarMiddleware',
    ...
]

DEBUG_TOOLBAR_PANELS = [
    'debug_toolbar.panels.sql.SQLPanel',
    'debug_toolbar.panels.profiling.ProfilingPanel',
    'debug_toolbar.panels.timer.TimerPanel',
    'debug_toolbar.panels.cache.CachePanel',
]

INTERNAL_IPS = ['127.0.0.1']

The SQL Panel:

This is usually the first place to look. It shows:

Total number of queries and total time
Each individual query with its SQL, execution time, and stack trace
Duplicate queries highlighted (immediate N+1 indicator)
EXPLAIN output for each query (click to expand)

What to look for:

"Similar" or "Duplicated" badges — these are N+1 queries. The toolbar groups identical query patterns and shows how many times each pattern was executed
Total query count — a list API returning 50 items should not fire 150 queries. If it does, you're missing select_related or prefetch_related
Query time distribution — if one query takes 200 ms and the rest take 1 ms each, that single query is your target
The stack trace — click on any query to see exactly which line of Python code triggered it. This tells you whether the query came from the view, the serialiser, a model method, or a template

Screenshot to add (ddt-sql-panel.png): Django Debug Toolbar SQL panel on a page with an N+1 problem. Capture the panel showing a high query count (e.g., "187 queries in 420 ms") with several queries marked "Duplicated" or "Similar" in red/orange badges. Expand one query to show the SQL text and the stack trace link. This is the most common first encounter with N+1 queries for Django developers.

The Profiling Panel:

The profiling panel is disabled by default. Click its checkbox in the toolbar to activate it. On Python 3.12+, you need to run the dev server with --nothreading for it to work:

python manage.py runserver --nothreading

Once enabled, it shows a collapsible call tree for the current request, similar to cProfile output but rendered as an indented HTML table. Each row shows a function, its cumulative time, own time, and call count. You can expand and collapse levels to drill into the call hierarchy:

GET /api/orders/ — 1842 ms
├── OrderListView.get() — 1842 ms (cumtime)
│   ├── OrderQuerySet.all() — 12 ms
│   ├── OrderSerializer.to_representation() — 1650 ms
│   │   ├── CustomerField.to_representation() × 200 — 1580 ms
│   │   │   └── SQL: SELECT * FROM customers WHERE id = %s × 200
│   │   └── ItemSerializer.to_representation() × 200 — 60 ms
│   └── Paginator.paginate() — 180 ms

Reading this:

Nesting shows the call hierarchy. A slow parent with a fast own-time means the parent is slow because of its children
Call count (× 200) is the key signal. If a function repeats many times inside a loop, you're probably looking at an N+1 or a missing batch operation
Start from the deepest nodes with the highest cumulative time and work upward
You can adjust PROFILER_MAX_DEPTH (default: 10) and PROFILER_THRESHOLD_RATIO (default: 8) in DEBUG_TOOLBAR_CONFIG to control how deep the tree goes and which functions get included

Screenshot to add (ddt-profiling-panel.png): Django Debug Toolbar Profiling panel showing the call tree for a request. Capture a view with several levels expanded, where one branch has a high cumulative time and a high call count (e.g., a serialiser method called 200× inside a loop). The indented table format with CumTime, TotTime, and Per Call columns should be visible.

Datadog APM Resource Pages

When you open a resource (endpoint) in Datadog APM, you see several tabs and visualisations. Here's what each one tells you.

The Resource Page Overview:

The top of the page shows aggregate metrics for the selected endpoint over your chosen time range:

Requests/sec — throughput
Latency — shown as p50, p75, p90, p95, p99 over time
Errors — error rate as a percentage
Total time — the proportion of your service's total processing time spent on this resource

Latency Distribution:

A histogram showing how response times are distributed. You want a tight cluster on the left. A long tail to the right means outlier requests are much slower than typical ones. Bimodal distributions (two humps) suggest two distinct code paths — for example, cache hits completing in 20 ms and cache misses in 400 ms.

Screenshot to add (dd-latency-distribution.png): Datadog resource page latency distribution histogram for a problematic endpoint. Capture one showing a long tail (bulk of requests clustered around 50–100 ms but a visible tail stretching to 800+ ms). If you have an example of a bimodal distribution (two distinct humps), capture that as a second image (dd-latency-bimodal.png) — it's a much clearer illustration of the cache-hit vs. cache-miss pattern.

Spans (Trace Waterfall):

When you click into an individual trace, you get the span waterfall. Each span represents a unit of work:

[django.request]──────────────────────── 1200 ms
  [django.middleware]──── 5 ms
  [django.view]──────────────────────── 1190 ms
    [postgresql.query]── 4 ms    SELECT * FROM orders WHERE ...
    [postgresql.query]── 3 ms    SELECT * FROM customers WHERE id = 1
    [postgresql.query]── 4 ms    SELECT * FROM customers WHERE id = 2
    [postgresql.query]── 3 ms    SELECT * FROM customers WHERE id = 3
    ... (197 more identical spans)
    [serialization]───── 15 ms

What each span tells you:

Span Attribute	What It Shows
Service	Which service produced this span (web app, database, cache, external API)
Operation	The type of work (e.g., `postgresql.query`, `redis.command`, `http.request`)
Duration	How long this span took
Resource	The specific query, URL, or cache key
Error flag	Whether this span resulted in an error
Child count	Number of child spans (sub-operations)

How to read the waterfall:

Spans stacked vertically with small gaps are executing sequentially. This is normal for database queries within a single-threaded request
A tall stack of identical spans (same operation, same resource pattern) is an N+1. In the example above, 200 postgresql.query spans with SELECT * FROM customers WHERE id = ? is the smoking gun
Spans with long durations but no children indicate time spent in application code (CPU-bound work, synchronous I/O)
A single very wide span early in the waterfall followed by fast spans suggests a slow initial query or connection setup

Screenshot to add (dd-trace-waterfall-n1.png): Datadog trace waterfall view for a request with an N+1 problem. Capture a trace where you can see many identical postgresql.query spans stacked vertically (each 3–5 ms, but dozens of them). The total trace should be visibly long (1000+ ms). The span colours should show postgresql spans in a distinct colour from the django spans. This is the "smoking gun" visual for N+1.

Span List tab:

The span list groups spans by resource and service, sorted by span count. Instead of a timeline, you see a table with columns for resource name, number of spans, average duration, execution time, and percentage of total trace time. This is useful when you want to quickly answer "which database query ran the most times?" or "which service consumed the most time?" without scrolling through a long waterfall.

Sort by SPANS to find N+1 patterns (one query repeated hundreds of times), or by % EXEC TIME to find the single heaviest operation.

Flame Graph tab:

The flame graph is the default trace visualisation in Datadog. It shows all spans from a trace laid out on a timeline, colour-coded by service.

The x-axis is time. Wider spans took longer
The y-axis is call depth. Each row is a child of the row above it
Colours represent services by default (you can switch to group by host or container)
Spans from different services are visually distinct, so you can tell at a glance whether time is spent in your application code, the database, or an external API call

Reading the flame graph:

Look for the widest spans at the deepest level. These are where actual time is spent
Hover over any span to see the service name, operation, resource, and duration
Click a span to open the detail panel below, which includes the full query text, error details, and related logs
Use the legend at the top to see what percentage of total execution time each service accounts for. If postgresql takes 80% of the trace, the database is your bottleneck
Toggle the Errors checkbox under "Filter Spans" to highlight error spans in the graph
Compare flame graphs before and after a fix. The previously wide span should be narrower or gone

Screenshot to add (dd-flamegraph.png): Datadog flame graph for a trace, colour-coded by service. Capture one where the postgresql service takes a large portion of the total width, with many narrow child spans visible. The legend at the top should show the % Exec Time breakdown per service. If possible, capture a second image (dd-flamegraph-fixed.png) of the same endpoint after an N+1 fix — the postgresql portion should be visibly smaller.

Walkthrough: Finding a Bottleneck End-to-End

Here's how the full process looks on a real endpoint. I'll use a simplified version of a case I've worked through.

The symptom:

Datadog dashboard shows /api/orders/ with p95 at 1200 ms, well above the 200 ms target. The endpoint handles 50,000 requests/day, making it a critical priority.

Step 1: Check the Datadog resource page

Open the resource page for /api/orders/. The latency distribution shows a long tail — p50 is 180 ms, but p95 jumps to 1200 ms. The tail requests correlate with customers who have many orders.

Step 2: Drill into a slow trace

Filter traces by duration > 1000 ms. Open one. The span waterfall shows:

[django.request] ────────────────────────── 1180 ms
  [django.view] ─────────────────────────── 1170 ms
    [postgresql.query] ─── 8 ms   SELECT * FROM orders WHERE user_id = 42 ...
    [postgresql.query] ─── 3 ms   SELECT * FROM customers WHERE id = 42
    [postgresql.query] ─── 4 ms   SELECT * FROM order_items WHERE order_id = 101
    [postgresql.query] ─── 3 ms   SELECT * FROM products WHERE id = 55
    [postgresql.query] ─── 4 ms   SELECT * FROM order_items WHERE order_id = 102
    [postgresql.query] ─── 3 ms   SELECT * FROM products WHERE id = 23
    ... (380 more query spans)

Total query count: 384. Total database time: ~980 ms. The rest is serialisation overhead.

Step 3: Identify the pattern

Two N+1 patterns:

For each order, a separate query fetches its items (SELECT * FROM order_items WHERE order_id = ?)
For each item, a separate query fetches the product (SELECT * FROM products WHERE id = ?)

Step 4: Reproduce locally with Debug Toolbar

Hit the same endpoint locally with django-debug-toolbar enabled. The SQL panel confirms: 384 queries, with "Duplicated" badges on the order_items and products queries.

Step 5: Apply the fix

# Before
orders = Order.objects.filter(user=request.user)

# After
orders = (
    Order.objects
    .filter(user=request.user)
    .select_related('customer')
    .prefetch_related(
        Prefetch(
            'items',
            queryset=OrderItem.objects.select_related('product')
        )
    )
)

Query count drops from 384 to 3: one for orders, one for items, one for products (via prefetch).

Step 6: Verify

Push to staging. Monitor the Datadog resource page for /api/orders/:

p50: 180 ms → 45 ms
p95: 1200 ms → 95 ms
p99: 2400 ms → 180 ms
Query count per request: 384 → 3

The latency distribution shifts from a long-tail shape to a tight cluster under 100 ms. Ship to production.

Screenshot to add (walkthrough-before-after.png): Side-by-side or stacked comparison of the Datadog latency distribution for /api/orders/ before and after the fix. The "before" should show a long tail; the "after" should show a tight cluster. If a side-by-side isn't possible, use two separate images (walkthrough-before.png and walkthrough-after.png). This is the payoff visual — it shows the reader what success looks like in Datadog.

Step 2: Identify and Prioritise Bottlenecks

With observability in place, the next step is triage. Not every slow endpoint matters equally.

Prioritisation Framework

Rank endpoints by impact × frequency:

Endpoint	p95 Latency	Requests/day	Priority
`/api/orders/`	1200 ms	50,000	Critical
`/api/users/profile/`	400 ms	30,000	High
`/api/reports/monthly/`	3000 ms	200	Low
`/api/dashboard/`	600 ms	15,000	High

A 3-second report endpoint used 200 times a day is less urgent than a 1.2-second orders endpoint hit 50,000 times. Fix what affects the most users first.

Common Bottleneck Patterns

These are the patterns I've run into most often:

N+1 queries: a list endpoint fires one query per item instead of batching
Missing database indexes: full table scans on filtered or sorted columns
Over-fetching: loading entire rows when only a few columns are needed, especially with large text or JSON fields
Synchronous blocking: waiting on external APIs, email sending, or file processing in the request cycle
No caching: recomputing identical results on every request
Unoptimised serialisation: serialisers performing additional queries or heavy computation

As any system scales, these patterns get more noticeable. An N+1 that's invisible with 10 records becomes a real problem at 10,000.

Step 3: Apply Targeted Fixes

Start with small wins. They're often low-effort but make a disproportionate difference.

Fix N+1 Queries

N+1 is probably the most common performance bug in ORM-based backends. It happens when you load a list of objects and then access a related object on each one individually.

The problem:

# This fires 1 query for orders + N queries for customer (one per order)
orders = Order.objects.all()
for order in orders:
    print(order.customer.name)  # Each access = 1 query

The fix:

# select_related: single JOIN query for ForeignKey/OneToOne
orders = Order.objects.select_related('customer').all()

# prefetch_related: two queries for ManyToMany/reverse FK
orders = Order.objects.prefetch_related('items').all()

Detecting N+1 queries:

Datadog APM traces showing dozens of identical SELECT statements per request
django-debug-toolbar showing query count spikes on list views
Middleware that logs query count per request (useful in staging or with a debug flag):

from django.db import connection

class QueryCountMiddleware:
    def __init__(self, get_response):
        self.get_response = get_response

    def __call__(self, request):
        initial = len(connection.queries)
        response = self.get_response(request)
        total = len(connection.queries) - initial
        if total > 20:
            logger.warning("high_query_count", path=request.path, count=total)
        return response

Note: connection.queries only populates when DEBUG=True. In production, rely on APM tracing or a package like django-querycount instead.

Add Database Indexes

Indexes are the highest-leverage single change for query performance. Without one, the database scans every row.

Identifying missing indexes:

-- PostgreSQL: find slow queries and their execution plans
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE status = 'pending' AND created_at > '2026-01-01'
ORDER BY created_at DESC;

Look for Seq Scan in the output. That means no index is being used.

Adding targeted indexes:

-- Single-column index for filtered lookups
CREATE INDEX idx_orders_status ON orders(status);

-- Composite index for queries that filter + sort
CREATE INDEX idx_orders_status_created ON orders(status, created_at DESC);

-- Partial index for a common filter condition
CREATE INDEX idx_orders_pending ON orders(created_at DESC)
WHERE status = 'pending';

In Django migrations:

class Migration(migrations.Migration):
    operations = [
        migrations.AddIndex(
            model_name='order',
            index=models.Index(
                fields=['status', '-created_at'],
                name='idx_order_status_created',
            ),
        ),
    ]

Index trade-offs:

Indexes speed up reads but slow down writes (every INSERT/UPDATE must update the index)
On very large tables (100M+ rows), adding an index can lock the table. Use CREATE INDEX CONCURRENTLY in PostgreSQL to avoid this
Over-indexing wastes storage and makes the query planner's job harder

Stop Over-Fetching Data

Loading columns you don't need wastes memory and network bandwidth, especially with large text or JSON fields.

The problem:

# Loads ALL columns including a 50KB description field
products = Product.objects.all()

The fix:

# Only fetch what you need
products = Product.objects.only('id', 'name', 'price', 'status')

# Or explicitly defer heavy fields
products = Product.objects.defer('description', 'metadata_json')

# For read-only list views, use values/values_list
product_names = Product.objects.values_list('id', 'name', flat=False)

Implement Caching

Cache results that are expensive to compute and don't change frequently.

Layer 1: Application-level cache (Redis/Memcached)

from django.core.cache import cache

def get_dashboard_stats(user_id):
    cache_key = f"dashboard_stats:{user_id}"
    stats = cache.get(cache_key)

    if stats is None:
        stats = compute_expensive_stats(user_id)
        cache.set(cache_key, stats, timeout=300)  # 5 minutes

    return stats

Layer 2: Query-level caching with Django's cached queries

from django.utils.functional import cached_property

class OrderSerializer(serializers.ModelSerializer):
    @cached_property
    def _prefetched_items(self):
        return list(self.instance.items.select_related('product'))

Layer 3: HTTP caching for read-heavy endpoints

from django.views.decorators.cache import cache_page

@cache_page(60 * 5)  # Cache for 5 minutes
def product_list(request):
    ...

Cache invalidation strategies:

Time-based (TTL): simplest, works for data that can tolerate staleness
Event-based: invalidate on write operations using signals or hooks
Versioned keys: append a version counter to cache keys, increment on data changes

from django.db.models.signals import post_save
from django.dispatch import receiver

@receiver(post_save, sender=Order)
def invalidate_order_cache(sender, instance, **kwargs):
    cache.delete(f"dashboard_stats:{instance.user_id}")
    cache.delete(f"order_detail:{instance.id}")

Offload Async Work

Anything that doesn't need to happen before the HTTP response can be moved out of the request cycle. I use this pattern extensively in my Message Scheduler project — Celery workers handle email and Telegram delivery asynchronously, keeping the API response under 50 ms even during peak load.

Common candidates:

Sending emails and notifications
Generating reports or PDFs
Processing uploaded files
Syncing data with external services
Updating search indexes
Aggregating analytics

Using Celery in Django:

# tasks.py
from celery import shared_task

@shared_task
def send_order_confirmation(order_id):
    order = Order.objects.get(id=order_id)
    send_email(
        to=order.customer.email,
        subject=f"Order #{order.id} confirmed",
        body=render_confirmation(order),
    )

# views.py
def create_order(request):
    order = Order.objects.create(...)
    send_order_confirmation.delay(order.id)  # Non-blocking
    return Response({"id": order.id}, status=201)

This takes email sending (which can take 500 ms–2 s depending on the provider) out of the response path entirely. For a production example of Celery + Redis in action with retry logic and idempotency keys, see the Message Scheduler case study.

Use Connection Pooling

Opening a new database connection per request is expensive. Connection pooling keeps a pool of reusable connections.

For Django with PostgreSQL:

# settings.py
DATABASES = {
    'default': {
        'ENGINE': 'django.db.backends.postgresql',
        'NAME': 'mydb',
        'CONN_MAX_AGE': 600,  # Keep connections alive for 10 minutes
    }
}

CONN_MAX_AGE keeps connections alive across requests within a thread. For actual connection pooling with control over pool size, use PgBouncer as an external pooler between Django and PostgreSQL. This matters when you're running multiple application workers.

Use Read Replicas

For read-heavy workloads, route read queries to replica databases while writes go to the primary.

Django database router:

class PrimaryReplicaRouter:
    def db_for_read(self, model, **hints):
        return 'replica'

    def db_for_write(self, model, **hints):
        return 'default'

    def allow_relation(self, obj1, obj2, **hints):
        return True

    def allow_migrate(self, db, app_label, model_name=None, **hints):
        return db == 'default'

Caveat: Read replicas have replication lag (usually milliseconds, but it can spike under load). Don't route reads to replicas immediately after a write if the user expects to see their own changes. This causes "read-your-own-write" inconsistency.

Step 4: Query-Level Deep Dives

When quick fixes aren't enough, you need to go deeper into individual query performance.

Using EXPLAIN ANALYZE

EXPLAIN ANALYZE executes the query and shows the plan the database used. It's where you go when you need to understand exactly why a specific query is slow.

EXPLAIN ANALYZE
SELECT o.id, o.status, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at > '2026-01-01'
  AND o.status IN ('pending', 'processing')
ORDER BY o.created_at DESC
LIMIT 50;

What to look for in the output:

Indicator	Meaning	Action
`Seq Scan`	Full table scan	Add an index on the filtered columns
`Nested Loop` with high row count	Looping join on large result set	Consider a hash join, or add indexes
`Sort` with high cost	Sorting without index	Add an index that matches the ORDER BY
`Rows Removed by Filter` (high number)	Index not selective enough	Use a more specific composite index
`Buffers: shared read` (high)	Data not in memory	Increase `shared_buffers` or optimise query to touch fewer pages

Screenshot to add (explain-analyze-output.png): Terminal output from running EXPLAIN ANALYZE on a query with a Seq Scan. Highlight or annotate the Seq Scan node and the rows= vs Rows Removed by Filter values. If you have a second capture showing the same query after adding an index (showing Index Scan instead), include that as explain-analyze-indexed.png for a before/after comparison.

Reproducing Production Queries Safely

Don't run EXPLAIN ANALYZE on production directly, since it actually executes the query. Instead:

Copy the slow query from Datadog/APM traces
Run it in a read replica or staging environment with production-like data
Compare plans between staging and production (data distribution matters)

# Connect to read replica
psql -h replica-host -U readonly_user -d mydb

# Set statement timeout as a safety net
SET statement_timeout = '30s';

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT ...;

Step 5: Verify and Monitor

Every fix needs measurement. The process I follow: push to staging, monitor, confirm improvement, then ship to production. Or revert.

The Verification Process

Push fix to staging
       ↓
Monitor Datadog dashboards (p75, p95, p99)
       ↓
  ┌─── Improved? ───┐
  ↓                  ↓
 YES                 NO
  ↓                  ↓
Push to prod     Investigate further
  ↓                  ↓
Monitor prod     Iterate or revert

What to Check After Each Fix

Did p95/p99 latency actually drop?
Did N+1 fixes reduce total query count per request?
Did the fix introduce any new errors? (Performance "fixes" sometimes break things.)
Did database CPU and I/O improve? Reduced query time should show up here too.
For caching changes, is the hit ratio trending upward?

Tracking Regressions

Performance work isn't something you do once and move on. Set up monitors that alert on:

p75, p95, p99 latency exceeding a threshold for more than 5 minutes
Query count per request increasing by more than 20%
Cache hit rate dropping below 80%
New slow queries appearing (anything exceeding 500 ms)

Trade-offs and Gotchas

Performance work involves trade-offs. Here are the ones I've dealt with most.

Indexing on Large Tables

Adding indexes to tables with hundreds of millions of rows isn't always possible. CREATE INDEX can lock the table for minutes. CREATE INDEX CONCURRENTLY avoids locking but takes longer and can fail under high write throughput. Sometimes the answer is redesigning: partition the table, archive old data, or use a materialised view.

Cache Invalidation

Incorrect invalidation leads to stale data: users seeing outdated information, balance mismatches, ghost records. Start with short TTLs and event-based invalidation. Don't cache data that changes on every request.

Over-Optimisation

Not every endpoint needs 50 ms latency. A monthly report endpoint used by 3 internal users can take 5 seconds and nobody will notice. Spend your time where users are, not where the numbers look bad in isolation.

One useful approach: set alerting thresholds relative to traffic. If an endpoint handles 50,000 requests/day, alert at 150 ms p75, 300 ms p95, 500 ms p99. A low-traffic internal endpoint can have much looser thresholds.

Read Replica Lag

Replication lag is usually sub-second, but under load it can spike. Design your application to tolerate this. Route reads-after-writes to the primary, not the replica.

Tools Reference

Category	Tools
APM / Tracing	Datadog APM, New Relic, Elastic APM, Jaeger
Database Profiling	Datadog DB Monitoring, pganalyze, `django-debug-toolbar` — see also my Django profiling walkthrough
Logging	structlog, ELK Stack, Datadog Logs
Caching	Redis, Memcached, Django cache framework
Async Workers	Celery + Redis/RabbitMQ — production example in Message Scheduler
Connection Pooling	PgBouncer, Django `CONN_MAX_AGE`
Load Testing	Locust, k6, Apache Bench
Python Profiling	`cProfile`, `snakeviz`, Pyinstrument
Query Analysis	`EXPLAIN ANALYZE`, `pg_stat_statements`, `auto_explain`

For language-specific performance tradeoffs: my HealthLab project uses Go for a single-binary deployment with goroutine-based concurrency, which avoids the GIL limitations I hit in Python for CPU-bound bot processing. And my portfolio system shows how Jinja2 template rendering with LaTeX achieves a 5x improvement over the previous manual workflow.

Key Takeaways

Measure first, optimise second. Observability is a prerequisite.
Percentiles over averages. p95 and p99 show what users actually experience.
Fix the boring stuff first. N+1 queries, missing indexes, and over-fetching account for most backend latency.
If it doesn't need to happen before the response, move it out of the request cycle.
Cache deliberately: short TTLs, event-based invalidation, clear cache keys.
Verify every change. The loop is observe → fix → measure → ship, not fix → hope.
Accept trade-offs. Not everything needs to be fast, and some optimisations create new problems.

Performance work is incremental. As traffic grows and features ship, new bottlenecks surface. The system is never "done." The point is having a process that catches regressions early and fixes them before users feel it.