Introduction
Email deliverability is not just a marketing metric. For platform engineers, it is a reliability problem that affects user flows, security workflows, and developer automation. When your platform relies on inbound email for ticket intake, approvals, data ingestion, or automated processing, failure to receive an email is a production incident. The work spans DNS, TLS, queues, parsing, and observability. Treating email as a first-class integration, with the same rigor you apply to APIs and message buses, is how you ensure reliable email receipt at scale.
This guide focuses on inbound email deliverability and production-grade processing. You will learn how to configure DNS correctly, harden TLS, architect webhook ingestion that does not drop messages, parse MIME safely, and monitor the full delivery path. The outcome is a platform where emails are accepted, parsed, and processed with predictable latency and high availability.
Email Deliverability Fundamentals for Platform Engineers
Inbound deliverability vs acceptance
Deliverability is often discussed for outbound campaigns, but inbound deliverability matters too. Think of it as the probability that a message sent from the internet successfully arrives at your MX, your provider accepts it, and your platform processes it end-to-end. Acceptance is the SMTP layer decision to 250-OK the message. Your responsibility continues beyond acceptance, through persistence, parsing, and business logic. Design for zero data loss, idempotent processing, and fast feedback to senders when something goes wrong.
DNS building blocks
- MX records: Publish MX records for a dedicated subdomain used for inbound traffic, such as
in.example.com. Point these MX records to your inbound provider or MTA. Prefer multiple MX targets with different priorities for failover. - NS delegation: For better isolation, delegate a subdomain like
in.example.comto the provider's nameservers. This lets the provider manage MX at the subdomain level without touching apex DNS. - SPF, DKIM, DMARC: These primarily affect outbound trust. They also help your platform evaluate inbound authenticity. Store SPF/DKIM/DMARC results with each message for downstream decisions.
- DNSSEC: Sign your zones to reduce spoofing of MX and MTA-STS records.
TLS and transport hardening
- STARTTLS: Require TLS for SMTP where possible. Track the negotiated cipher and TLS version per message.
- MTA-STS: Publish an MTA-STS policy so sending MTAs use TLS to your MX. Host the policy at
https://mta-sts.example.com/.well-known/mta-sts.txtand a DNS TXT record at_mta-sts.example.com. - TLS-RPT: Enable SMTP TLS reporting via a
_smtp._tls.example.comTXT record to receive aggregate failures and misconfigurations.
Parsing and payload integrity
Inbound email is MIME, not JSON. Parsing MIME is non-trivial due to nested multiparts, encodings, and large attachments. In production, treat parsing as a distinct step with its own failure modes, backpressure, and metrics. Persist the raw RFC 5322 message so you can reparse if your parser or business rules change.
Practical Implementation
Domain strategy and DNS examples
Use a dedicated inbound subdomain per environment:
in.prod.example.comfor productionin.stg.example.comfor stagingin.dev.example.comfor development
Point MX to your provider and enable MTA-STS. Example using Cloudflare-style records via Terraform:
# MX for inbound subdomain
resource "cloudflare_record" "mx_inbound" {
zone_id = var.zone_id
name = "in"
type = "MX"
value = "mx1.inbound.example.net"
priority = 10
ttl = 300
}
resource "cloudflare_record" "mx_inbound_backup" {
zone_id = var.zone_id
name = "in"
type = "MX"
value = "mx2.inbound.example.net"
priority = 20
ttl = 300
}
# MTA-STS TXT record
resource "cloudflare_record" "mta_sts" {
zone_id = var.zone_id
name = "_mta-sts"
type = "TXT"
value = "v=STSv1; id=2024050101;"
ttl = 300
}
# TLS-RPT TXT record
resource "cloudflare_record" "tls_rpt" {
zone_id = var.zone_id
name = "_smtp._tls"
type = "TXT"
value = "v=TLSRPTv1; rua=mailto:tlsrpt@example.com;"
ttl = 300
}
MTA-STS policy file served over HTTPS:
version: STSv1
mode: enforce
mx: mx1.inbound.example.net
mx: mx2.inbound.example.net
max_age: 604800
Webhook ingestion that never drops messages
Your first goal is to acknowledge receipt quickly, then push work to durable storage. Keep the webhook lean and synchronous only for validation and enqueue. Everything else happens asynchronously.
// Node.js - Express, enqueue to SQS or Kafka
import express from "express";
import { Kafka } from "kafkajs";
const app = express();
app.use(express.json({ limit: "25mb" })); // plan for big attachments encoded in payload
const kafka = new Kafka({ clientId: "inbound", brokers: ["kafka1:9092"] });
const producer = kafka.producer();
function verifySignature(req) {
const sig = req.header("x-signature");
const ts = req.header("x-timestamp");
// Perform HMAC verification with shared secret, include body and timestamp
// Reject if timestamp skew too large
return Boolean(sig && ts);
}
app.post("/webhooks/email", async (req, res) => {
if (!verifySignature(req)) {
return res.status(401).send("invalid signature");
}
// Provide idempotency via message_id from provider
const messageId = req.body?.message?.id;
if (!messageId) {
return res.status(400).send("missing message id");
}
// Immediately enqueue, do not parse MIME here
await producer.connect();
await producer.send({
topic: "inbound-email",
messages: [{ key: messageId, value: JSON.stringify(req.body) }],
});
// Acknowledge quickly
res.status(202).send("accepted");
});
app.listen(8080, () => console.log("listening on 8080"));
Parsing stage and business processing
Downstream consumers read from the queue and perform MIME parsing, attachment extraction, and routing to services. If your provider already supplies structured JSON, store it along with the raw MIME for parity checks. Persist to object storage so you can rehydrate messages for reprocessing. Use idempotency keys to guard against retries and duplicates.
# Python - consumer skeleton with idempotency and retries
import json, os
from some_queue_client import poll_messages, ack
from db import insert_if_absent
from mime_parser import parse_message # your library or provider payload
def handle(payload):
msg_id = payload["message"]["id"]
if not insert_if_absent("processed_messages", msg_id):
return # duplicate, already processed
raw = payload["message"]["raw"] # base64 or URL to raw RFC5322
parsed = parse_message(raw)
# Save artifacts
save_to_s3(f"raw/{msg_id}.eml", raw)
save_to_s3(f"json/{msg_id}.json", json.dumps(parsed))
# Route based on headers or aliases
route = route_for(parsed)
# Hand off to your domain services
dispatch(route, parsed)
for m in poll_messages("inbound-email"):
try:
handle(json.loads(m.value))
ack(m)
except Exception as e:
# rely on queue DLQ and backoff
log_error(e)
If you want to go deeper on parsing format details and common MIME edge cases, see MIME Parsing: A Complete Guide | MailParse. For webhook reliability patterns, read Webhook Integration: A Complete Guide | MailParse. For API-centric ingestion or polling, explore Email Parsing API: A Complete Guide | MailParse.
When to use REST polling instead of webhooks
Use polling as a fallback during webhook outages, for cold start environments where public ingress is not available, or when you must traverse strict firewalls. Cap the polling interval, use ETags or since-cursors, and rate limit to avoid surprise throttling. Keep both paths enabled so you can switch quickly during incidents.
Tools and Libraries
- DNS and TLS:
dig,nslookup,openssl s_client -starttls smtp, Hardenize, CheckTLS, and SMTP TLS Reporting aggregators. - Observability: Prometheus metrics for webhook latency, acceptance rate, parse duration, and queue depth. Grafana dashboards. OpenTelemetry spans across webhook, queue, and parsing stages.
- Queues: Kafka for high-throughput topics, SQS for managed queues with DLQ, or NATS for lightweight streaming. Use a dedicated topic or queue per environment.
- MIME parsing: dkimpy for DKIM verification in Python,
mailparserormimetoolsequivalents for Node and Go, libmagic for attachment type detection. Validate declared MIME type against magic bytes. - Security: Hash attachments with SHA-256, scan using ClamAV or a managed scanning service, and quarantine suspicious content.
Providers that deliver structured JSON and original MIME remove a huge parsing burden. With MailParse, you can provision instant email addresses, receive inbound messages on a dedicated subdomain, and consume normalized JSON over webhooks or a REST polling API without reimplementing MIME edge case handling.
Common Mistakes Platform Engineers Make with Email Deliverability
- Mixing production and staging MX records: Keep a separate inbound subdomain for each environment. Never point staging MX to production parsing pipelines.
- Slow webhooks: Doing MIME parsing synchronously in the webhook path causes timeouts and retries. Acknowledge fast, enqueue, and parse asynchronously.
- No idempotency: Retries from the provider or your queue will insert duplicates without a stable key like
message.idor the RFC 5322Message-Id. - Ignoring attachment limits: Accept large emails up to your provider's maximum, then apply size-based routing and streaming storage to avoid memory spikes.
- Skipping TLS hardening: Without MTA-STS, senders may downgrade or fail TLS. Enable TLS-RPT so you can see who is failing to negotiate secure transport.
- Assuming SPF/DKIM are only for send: Record the verification results for inbound messages and feed them into anti-abuse or routing logic.
- No backpressure or DLQ: Spikes in attachment size or parsing complexity will create backlogs. Use consumer concurrency controls, circuit breakers, and DLQs.
- Dropping raw MIME: Always store the original .eml. You will need it for reprocessing, security forensics, and parser upgrades.
Advanced Patterns
Tenant-aware addressing and routing
Provide each tenant with a unique sub-address like support+tenant123@in.example.com or a dedicated alias. Map the recipient to a tenant ID during ingestion, then embed it in your message metadata, topic partition key, and storage path for clean multi-tenancy.
Attachment streaming and offloading
Stream large attachments directly to object storage to avoid memory pressure, using signed URLs. Persist only metadata and object keys in your database. Enforce extension and MIME-type allow lists. Apply antivirus scanning and checksum verification asynchronously before exposing files to downstream systems.
Authentication and trust signals
- DKIM and ARC: Verify DKIM signatures and, when available, ARC chains. Store pass or fail with reasons. Use this to downrank spoofed emails.
- SPF alignment: Compute whether the envelope sender aligns with the From domain. Record alignment results to support abuse detection.
- DMARC evaluation: If the sender publishes DMARC, store the policy and whether the message would pass alignment, even if you are the receiver.
High availability and failover
- Multi-MX, multi-region: Ensure your MX targets represent different regions or providers. Test failover by blackholing one MX and measuring acceptance continuity.
- Dual-path ingestion: Keep both webhooks and REST polling available. Switch to polling during network incidents or WAF lockouts.
- Replay pipeline: Build a replay tool that can re-enqueue messages from object storage into the queue for reprocessing.
Operational excellence
- SLIs and SLOs: Define SLIs for time-to-parse, time-to-first-byte on webhook, acceptance rate, and percent of messages with verified DKIM. Set SLOs appropriate to your business flows.
- Alerting: Alert on rising webhook 5xx, queue depth, DLQ rate, and TLS-RPT failures mentioning your MX hosts.
- Runbooks: Document DNS changes, certificate renewals for MTA-STS hostnames, and playbooks for greylisting and throttling from specific sending networks.
Conclusion
Email deliverability for platform engineers is a reliability discipline. Set up DNS correctly, enforce TLS with MTA-STS, and operate a lean webhook that never blocks. Parse MIME in a dedicated stage, persist raw messages, and process idempotently through durable queues. Add observability at each hop and treat deliverability signals like SPF, DKIM, and DMARC as first-class metadata. With the right patterns, inbound email becomes a dependable event source that fuels internal platforms and developer tools.
FAQ
What is the difference between inbound deliverability and spam filtering?
Inbound deliverability is about ensuring the message reaches your MX, your provider accepts it, and your platform processes it reliably. Spam filtering is a classification step that may quarantine or route messages after acceptance. Keep these concerns separate. First guarantee reliable transport and storage, then apply classification and policy.
Should I point my apex domain MX to my inbound provider?
Prefer a delegated subdomain like in.example.com for application ingestion. Keep user mailboxes and corporate mail separate from automated processing. This reduces blast radius, simplifies policy, and prevents your application logic from interfering with human mail workflows.
How do I monitor end-to-end email processing?
Instrument at four layers: DNS and TLS health, SMTP acceptance metrics, webhook or polling latency and error rates, and downstream parsing plus business processing success. Use distributed tracing across webhook, queue, and parser. Consume TLS-RPT to detect transport failures at remote senders and alert when reports spike.
What timeout and retry settings are safe for webhooks?
Keep webhook processing under 500 ms, acknowledge with 2xx promptly, and rely on the queue for the heavy work. If the provider retries, ensure exponential backoff and a maximum retry window that does not exceed your idempotency retention window. Always make your endpoint idempotent using stable message IDs.
How should I handle very large attachments?
Increase webhook body size limits carefully or accept a pointer to stored MIME on the provider. Stream attachments directly to object storage, do not hold them in memory. Scan asynchronously, compute checksums, and move business processing forward only after files are verified and persisted.