How I Design Reliable Backend Systems

I've been building backend systems for about five years now. Some handle 10,000+ messages per day. Others serve AI-powered responses in real-time. A few run on a single VPS for under $5/month.

Over that time, I've settled on a set of principles that keep my systems predictable and boring in production. Boring is the goal.

This post covers the patterns I return to on every project.

Start With the Failure Modes

Most backend design starts with the happy path. I start with a different question: what breaks first?

On my Message Scheduler project, the answer was clear early on. External APIs fail. Amazon SES returns transient errors. Telegram's Bot API has rate limits. If a scheduled message fails at delivery time, the user never gets it.

So the first architectural decision wasn't about the framework or database. It was about retry behaviour. I went with exponential backoff — 3 attempts over 15 minutes — before marking a message as failed. That single decision drove the rest of the stack: Celery for async workers, Redis as the broker, idempotency keys to prevent duplicate deliveries.

The principle: design for what goes wrong before designing for what goes right.

When I built HealthLab — a pathology lab management system — the critical failure mode was slot overbooking. Two patients booking the same time slot simultaneously could cause real operational problems. The fix was an atomic database update with a WHERE booked < capacity guard. If a second booking hits after the slot fills, it fails cleanly instead of creating a conflict.

func (r *TimeSlotRepository) IncrementBooked(ctx context.Context, slotID uint) error {
    return r.db.Model(&model.TimeSlot{}).
        Where("id = ? AND booked < capacity", slotID).
        Update("booked", gorm.Expr("booked + 1")).Error
}

No distributed locks. No complex coordination. Just a conditional update that makes the race condition impossible.

Keep the Request Cycle Fast

Anything that doesn't need to happen before the HTTP response should be moved out of the request cycle. This sounds obvious, but I've seen (and written) plenty of views that send emails, process files, and sync with external APIs synchronously.

In the Message Scheduler, creating a scheduled message is instant. The actual delivery happens later through Celery:

def create_order(request):
    order = Order.objects.create(...)
    send_order_confirmation.delay(order.id)  # Non-blocking
    return Response({"id": order.id}, status=201)

The user gets a 201 response in milliseconds. The email sends whenever Celery picks up the task.

For my portfolio's AI chatbot, I took this further with streaming responses. Nobody wants to stare at a spinner while an LLM generates a full answer. So the backend yields tokens via Server-Sent Events as they generate. Sub-second time-to-first-token.

I also run follow-up suggestion generation in parallel using a ThreadPoolExecutor. The suggestion chips appear the moment the main response finishes, with zero additional wait.

The pattern works across all my projects:

Accept the request
Validate inputs
Do the minimum work needed for the response
Offload everything else

Choose Boring Technology When You Can

Django + PostgreSQL + Celery + Redis. I use this stack on most Python projects because I know exactly how it behaves under load, how it fails, and how to debug it.

For the Message Scheduler, I considered FastAPI with a custom scheduler. FastAPI would have been slightly faster and lower on memory. But Django gave me the ORM, the admin panel, and battle-tested middleware. Development speed mattered more than shaving 50ms off response times.

Go was the right choice for HealthLab — single binary deployment, goroutines for handling concurrent bot conversations, and compile-time type safety. But that was a deliberate decision for a specific set of constraints, not a default.

My defaults:

Need	Default Choice	Why
Web API (Python)	Django + DRF	ORM, admin, middleware ecosystem
Task queue	Celery + Redis	Proven, debuggable, good monitoring
Database	PostgreSQL	JSON support, full-text search, partial indexes
Async delivery	Celery `apply_async(eta=...)`	Built-in ETA scheduling, no polling
Connection pooling	PgBouncer or `CONN_MAX_AGE`	Reuse connections across requests

I've written about the specific performance patterns I follow in more detail on my blog: How to Optimise Backend Performance.

Make Scheduling Precise, Not Approximate

The Message Scheduler handles timezone-aware scheduling across different user locations. The rule I follow: store everything in UTC, convert at display and delivery time.

I've seen bugs from storing local times in database columns. A user in IST schedules for 9 AM, another in PST schedules for 9 AM — if you store "09:00" without timezone context, one of them gets it wrong.

The scheduling architecture uses Celery's eta parameter instead of polling:

Same-day messages: scheduled immediately at creation time with apply_async(eta=send_at)
Future messages: a daily cron at midnight schedules that day's messages

This means one cron job per day instead of checking every minute. The tasks sit in Redis until their ETA, then fire at the exact scheduled time.

Natural language date parsing rounds this out. Type "next Friday" or "in 2 weeks" and it converts to UTC. I implemented this with chrono-node on the frontend, normalizing everything to UTC before persistence.

The full case study is on my site: Message Scheduler Case Study.

Observability Is Not Optional

I wrote a full post on this — How to Optimise Backend Performance — but the short version: you can't fix what you can't see.

The minimum I set up on every project:

Structured logging with enough context to reconstruct requests (IDs, durations, cache hit/miss)
Query count tracking per request (a sudden jump from 3 queries to 200 is an N+1)
Health endpoints that check actual dependencies (database connectivity, Redis, external APIs)
Percentile metrics (p75, p95, p99) over averages — averages hide the worst experiences

On the HealthLab project, the health endpoint checks DB connectivity. On the Message Scheduler, Celery Flower provides worker monitoring. On my portfolio, the Cloudflare Worker proxy handles error states gracefully.

Authentication Should Match the Client

Different clients have different trust levels. On HealthLab, I built a multi-auth middleware stack:

JWT for the admin dashboard (human users with sessions)
API keys for bot integrations (long-lived, server-to-server)
Webhook secret tokens for Telegram callbacks (verify origin authenticity)

// JWT for dashboard (admin users)
v1.Use(middleware.JWTAuth())

// API Key for bot integrations
botGroup := v1.Group("/bot")
botGroup.Use(middleware.APIKeyAuth())

// Secret token for Telegram webhooks
telegramGroup.Use(middleware.TelegramSecretToken())

Each auth strategy exists because it's the right fit for that client type. Forcing JWT on a webhook endpoint or API keys on a human-facing dashboard creates friction for no security benefit.

State Belongs Where It's Cheapest

For the portfolio's AI chatbot, chat history lives in the visitor's browser (localStorage with a 1-hour TTL). The last 5 exchanges get sent to the backend with each request so the LLM can handle follow-ups.

No session database. No Redis for state. The backend is stateless, which means it scales horizontally without coordination.

For HealthLab's Telegram bot, conversation state (the 5-step booking flow) lives in a Go sync.Map in memory. Booking conversations are short-lived — under 5 minutes. If the server restarts, the user starts the flow over. That's an acceptable tradeoff for a much simpler architecture.

The question I ask: how long does this state need to live, and what happens if it disappears?

Booking flow (5 minutes, restartable) → in-memory
Chat history (1 hour, nice-to-have) → client-side storage
Scheduled messages (days or months, critical) → PostgreSQL
Task queue (hours, retriable) → Redis

Deployment Should Be One Command

Every project I ship is containerized or packaged for single-command deployment.

Message Scheduler: docker-compose up → Django + Celery + Redis + PostgreSQL + Nginx
HealthLab: docker-compose up → Go API + PostgreSQL + React dashboard
Telegram Chat Manager: single PyInstaller executable — no Docker, no Python, no dependencies
Portfolio AI backend: FastAPI container on a VPS with systemd

The Telegram Chat Manager took this the furthest. The entire app — FastAPI server, embedded HTML template, Telethon client — packages into a ~20MB executable that runs on Windows, Linux, and Mac without Python installed.

What I'd Do Differently

Across all these projects, a few patterns came up too late:

Add structured logging from day one. Debugging async workers without it is painful.
Implement rate limiting early. It's harder to retrofit than to build in.
Write integration tests for external API interactions. Mocking SES and Telegram in unit tests is fine, but you also need tests that hit the real API in staging.
Use dead-letter queues. Failed messages can be silently dropped without them. Currently monitoring these through Celery Flower, but a proper DLQ would be better.

The Recurring Theme

Looking across my projects, the reliable systems share a few things:

They handle failures explicitly, not hopefully
They keep the request cycle minimal
They use boring, proven technology as the default
They store state where it's cheapest and most appropriate
They deploy in one command

The details change — Go vs Python, Celery vs goroutines, PostgreSQL vs in-memory maps — but the principles hold.

If you want to see the full implementation details, including architecture diagrams, code samples, and failure mode analysis, check out my case studies.

About me: I'm Ankit Jangwan, a Senior Software Engineer building backend systems, AI integrations, and developer tools. You can see my work at ankitjang.one.