Recurring payments are async, stateful, and fail in ways you don't expect.

We built the payment infrastructure for GoMarble's AI marketing platform, processing renewals daily across Stripe and Cashfree. The system looks simple on the surface — create a subscription, charge every month, cancel when asked. In production, it's a distributed system where lack of confirmation leaves you guessing, and ordering matters.
This is how recurring payments actually work: what data lives where, why webhooks are necessary but insufficient, and the edge cases that break your mental model.
1. The Deceptively Simple Model
The conceptual model of a subscription is straightforward:

This works in a tutorial. In production, each arrow hides complexity:
Create: API call to provider, pending state in your DB, waiting for payment
Activate: Webhook arrives (maybe), state syncs (eventually), user gets access
Renew: Provider charges automatically, webhook arrives (hopefully), invoice generated
Cancel: Immediate or at period end? Provider supports both? What if the API call fails?
The arrows are async. The states are distributed. And no event proves nothing.
Checkout completed ≠ payment settled
Here's a trap: when a user completes checkout, the provider says "checkout succeeded." That doesn't mean money moved.
"Checkout succeeded" means the user finished the flow and authorized the payment. But the actual charge might still be pending. The card could decline on the first attempt. The subscription might be in trial with no charge yet.
Activating the subscription immediately after checkout grants access before money settled — a recipe for unpaid access and support headaches.
The pattern that works: treat checkout and payment as separate events.
Checkout completed: User linked their payment method. Record the subscription as
pendingorauthorized.Payment settled: Money actually moved. The provider sends an event like
invoice.paidorpayment.succeeded. This is when you activate.
Example: Stripe's checkout.session.completed fires when the user finishes the checkout flow. invoice.paid fires when the charge succeeds — which might be seconds later, or hours later if the charge retries. Activating on the wrong event creates drift between "user thinks they paid" and "you granted access."
Treat "checkout succeeded" as a promise, not proof. Wait for the payment event.
2. The Two States Problem
A subscription isn't a single database row. It's state split across two separate systems:

The provider side stores financial truth: Did the charge succeed? When is the next billing date?
Your side stores application truth: Who is this user? What plan features should they access? Are they scheduled to downgrade?
You can't trust just one state. The provider doesn't know your users; you don't know if cards declined. Both states must stay in sync, and the only reliable mechanism is webhooks.
3. Three Ways to Sync State
You have three options for keeping your state synchronized with the provider:

Why Webhooks Win(Mostly)
Webhooks are event-driven. When something happens — payment succeeds, subscription cancels, renewal fails — the provider makes an HTTPS POST to your server with signed event data.
✅ Secure: Signature verification prevents spoofing.
✅ Real-time: Updates arrive shortly after events happen.
✅ Complete: Includes all state changes, not just API-initiated ones.
But they aren't perfect. They are eventually consistent, delivery isn't guaranteed, and they arrive out of order.
The minimum data model that makes this real
You don't need a big billing system to do this correctly. You need a small set of persistent facts that let you dedupe events, detect drift, and answer "why does this user have access?"
subscriptions:provider,provider_subscription_id,status,stage(scheduled intent),current_period_end,plan_id,updated_at.provider_events:event_idunique,provider,event_type,event_time,received_at,processed_at,error(plus optional payload/hash for audits/replay).
Example: Stripe gives you an event id (evt_xxx) and event time; Cashfree uses signature hashes. Providers use globally unique formats, so a simple unique constraint on event_id works for deduplication.
4. Why Webhooks Aren't Enough
Webhooks handle the happy path beautifully. They break in three specific ways:
Problem 1: Missed Webhooks
Your server was restarting when the renewal webhook arrived. The provider retries for 72 hours, but if your server is down that entire time, the event is lost forever. Your DB shows active; the provider shows past_due.
Problem 2: Duplicate Webhooks
Provider retry logic can send the same event multiple times. If you don't handle idempotency, you might process a payment twice.
Problem 3: Out-of-Order Webhooks
Events don't always arrive chronologically. This is chaos for state management.

If your webhook handler assumes chronological order, you'll overwrite correct state with stale data.
The Solution: Idempotency + Guarded Updates
Positive transitions (payment success, activation) apply immediately. Negative transitions (failed, cancelled) check with the provider first — this prevents out-of-order "failed" webhooks from incorrectly suspending active subscriptions.
Webhook ingestion in production: receive, persist, process
A resilient webhook handler stores the event, processes it, then responds:
Verify: validate signature + basic schema.
Persist: write a row to
provider_eventswith unique constraint onevent_id.Process: execute business logic (subscription updates, emails, analytics).
Ack: return 2xx after completion.
For high-volume systems, consider an async pipeline: persist → enqueue for background worker → ack immediately (<200ms). This prevents provider retries due to slow downstream work. Start simple; optimize when webhook processing time consistently exceeds 3-5 seconds.
5. The Safety Net: Daily Enforcement
Webhooks handle 99%+ of cases. The daily cron enforces local invariants and handles scheduled operations.

The cron runs every 24 hours and enforces local invariants: suspending subscriptions past their expiry (with grace period), executing scheduled cancellations and downgrades, and alerting on impossible states (no expiry date). It doesn't query the provider to reconcile drift — it operates on your local data and calls the provider API only when enforcing a decision (like canceling at gateway).
Optional enhancement: Add drift detection by periodically querying the provider's subscription list and comparing statuses. This catches webhooks missed during outages, but adds complexity and API quota usage.
6. The 'Stage' Field: Handling Scheduled Changes
The status field tracks current state (e.g., active). The stage field tracks scheduled operations.
Why? If a user on a $50/month plan cancels on day 5, they should keep access until day 30. Not all providers handle this "cancel at period end" logic.
User clicks "cancel": Status remains
active. We setstage = "cancellation_requested".Cron runs near period end: Sees the stage, calls the provider's immediate cancel API. Status becomes
canceled.User resubscribes mid-month: We just clear the
stagefield.
Upgrades and downgrades aren't the same operation
When a user changes plans, the obvious approach is: delete the old subscription, create a new one. That works in a spreadsheet. In production, upgrades and downgrades need different timing.
Upgrades should happen immediately: User pays more and expects features right now. Prorate the old plan (credit unused time), charge the difference, and switch them instantly. Most providers support this pattern — for example, Stripe's proration_behavior: 'create_prorations' invoices the delta and switches the plan in one operation.
Downgrades should wait until period end: User already paid for this month. Downgrading them immediately means you owe a refund — messy accounting, bad UX. Better: let them keep the current plan until their billing date, then switch. Track the pending change with a field like stage = "downgrade_scheduled", and let your cron apply it when the period ends.
Why the asymmetry? Upgrades increase revenue and meet user expectations ("I just paid, give me access"). Downgrades preserve value delivery and avoid refund complexity.
One edge case: what if an upgrade payment fails? (Card declined when charging the difference.) The system needs to handle rollback — revert to the original plan, notify the user, and keep access stable. Avoid leaving the user in a half-upgraded state where they lost features but weren't charged.
7. Production Edge Cases
Theory is clean. Production has sharp edges.


8. The Architecture
Putting it all together, here is the payment service architecture designed for failure recovery.

9. Plan Changes: Upgrades, Downgrades, and Migrations
When users change plans, the obvious approach is to delete the old subscription and create a new one. That works in a tutorial. In production, timing matters.
Upgrades: Immediate with Proration
User pays more, expects features immediately. The pattern: charge the prorated difference and switch instantly. Most providers support this—Stripe's proration_behavior: 'create_prorations' calculates the delta and switches the plan in one API call.
Edge case: What if the upgrade payment fails? You need rollback logic to revert to the original plan without leaving the user in a broken state. Don't assume the payment will succeed.
Downgrades: Wait Until Period End
User already paid for this month. Downgrading immediately means you owe a refund—messy accounting, bad UX. Better pattern: keep them on the current plan until billing date, then switch.
Use the stage field: set stage = "downgrade_scheduled". Your daily cron checks for approaching period ends and applies the change. The user keeps access they paid for, and you avoid refund complexity.
Provider Migrations: The Hard Mode
User switches from one payment provider to another (e.g., Cashfree → Stripe) mid-cycle. This requires careful coordination:
Backdate the new subscription to align billing cycles with the old provider
Apply credit for unused time from the old provider (prorate remaining days)
Cancel the old subscription only after the new one activates successfully
Handle race conditions: webhooks from both providers can arrive simultaneously
The critical pattern: create the new subscription record in your DB before calling the provider API. This prevents webhook race conditions where the provider's webhook arrives before your API response.
10. The Audit Trail You'll Wish You Had
When a user says "I cancelled but got charged," you need to prove what happened and when. An audit log is your source of truth for disputes, debugging, and compliance.
What to log for every subscription change
This structure answers the critical questions:
Who made the change? (User action vs. webhook vs. system)
Why was it changed? (Payment success vs. manual cancellation vs. drift correction)
What changed? (Before/after snapshots for every field)
When did it happen? (Timestamp with timezone)
Use cases where audit logs save you

Other scenarios audit logs solve:
Compliance: "Show me all subscription changes for user X in the last 90 days"
Debugging: "Why is this user still active when their card declined?"
Forensics: "Did this webhook actually arrive, or did we miss it?"
Rollback: "What was the state before this bad deployment?"
Implementation tip: For every updateSubscription call, log the actor (webhook/user/cron/admin), the reason, and the before/after state. Your future self will thank you during the first payment dispute.
11. Testing & Debugging (Without Breaking Production)
Testing webhooks isn't something you can "just deploy and hope." You need a way to simulate events locally, replay past failures, and trace what happened when a user says "my payment went through but I still don't have access."
Testing webhook handlers before going live
Provider test/sandbox mode: Most providers (Stripe, Cashfree, PayPal) offer test environments that deliver real webhooks to localhost. Tools like Stripe CLI (
stripe listen --forward-to localhost:3000/webhooks) or ngrok make this straightforward.Replay from an event log: If you store raw webhook payloads (as recommended in the data model section), build an internal admin endpoint that re-processes an old event by ID. Idempotency guards protect against double-application, and you can safely test "what if this event arrived now?"
Synthetic events: Craft JSON matching the provider's schema and POST it to your handler. In test mode, skip signature verification or use a test signing secret. This pattern works well for unit/integration tests.
Debugging production issues
When a user reports "I paid but nothing happened," a timeline is everything. The system should log every incoming webhook with:
Provider event ID, event type, and timestamp
Subscription ID it references
Before/after status (what changed)
Any errors or skipped updates
With that, you can trace scenarios like: "invoice.paid arrived at 10:03:22 and updated the subscription to active, but the user's access-check at 10:03:15 happened 7 seconds too early — they just need to refresh."
Or: "The webhook never arrived. The provider shows it was sent at 10:00:00 but got a 500 response. The server was restarting. Manual intervention needed or wait for the next retry."
Production payment bugs are almost always timing issues or missed events. A good event log turns "I don't know what happened" into "here's the exact sequence, and here's the gap."
Conclusion
Subscriptions in production are distributed systems. You don't control both states. And silence is ambiguous — it might mean success, failure, or "check back later."
The architecture that works relies on redundancy:
✅ Webhooks as primary sync mechanism (with Idempotency).
✅ Daily cron as a safety net (for missed events and drift).
✅ 'Stage' field for scheduling (isolating future intent from current status).
✅ Audit logs for every change (actor, reason, before/after state).
✅ Alerting on "impossible" states (like multiple active subscriptions).
If you're building this for the first time: start with Stripe. Their features like built-in proration and cancel-at-period-end save weeks of engineering. If you must support multiple providers, build the abstraction layer and the safety nets from day one.
Written by Abhinav Singhal, GoMarble Engineering Team. Building AI tools for performance marketers at gomarble.ai.