Why Email to JSON matters for DevOps and SRE teams
Operational pipelines do not stop at HTTP. Vendors, legacy systems, and humans often send alerts, approvals, and data imports by email. Turning those raw messages into clean JSON unlocks automation that fits your current stack. For DevOps engineers, email-to-json eliminates POP or IMAP polling, reduces ad hoc glue code, and makes email just another event source for your observability and workflow engines.
Whether you manage MX records and inbound routing or you own the internal event bus, reliable email ingestion affects uptime and on-call latency. Clean, structured JSON lets you push email content into queues, functions, or chat tools with predictable behavior. It also supports compliance, since you can store the raw EML alongside normalized JSON for auditing and replay.
Email to JSON fundamentals for DevOps engineers
Understand the inbound path
- DNS and MX records: Use a dedicated subdomain for inbound workflows, for example
inbound.example.com, and point MX to your ingress provider or MTA. - SMTP reception: Your inbound edge should accept TLS, validate sender reputation where possible, and enforce connection limits and size caps.
- Routing: Plus addressing and unique aliases provide a clean mapping from recipient to tenant or workflow, for example
ticket+123@inbound.example.com.
MIME is the schema of email
Every email is a tree of parts. To build a stable email-to-json schema, you need to parse MIME accurately. Key structures:
- Headers:
From,To,Cc,Subject,Message-ID,In-Reply-To,References,Date,Return-Path,List-*. Preserve original casing and provide a normalized map. - Body variants:
text/plainandtext/htmlmay appear together or individually. Some messages include only HTML. Handle both. - Attachments: Binary files, inline images, and calendar invites. Decode content-transfer-encoding and charsets. Keep each attachment's metadata and provide a safe access path.
- Encodings: Quoted-printable, base64, and tricky charsets. Always normalize to UTF-8 in JSON output, retaining the original charset in metadata.
Deep dive: MIME Parsing: A Complete Guide | MailParse
A practical JSON schema
Design a schema that downstream services can rely on. Aim for a minimal core with extensible metadata:
id: Stable unique ID for the message, preferMessage-IDwith a collision-resistant fallback.timestamp: Parsed RFC 5322 date in ISO 8601.from,to,cc,bcc: Arrays of structured addresses, for example{ name, address }.subject,text,html: Canonical body fields, with HTML sanitized at render time.attachments: Array with{ filename, contentType, size, contentId, disposition, sha256, url or storageKey }.headers: Case-preserving map or array of tuples, so nothing is lost.security: DKIM, SPF, and DMARC evaluations, TLS details, spam score.routing: Envelope recipient, MX domain, alias metadata.
Choose field names and types once and treat changes as versioned migrations. Stability matters more than exhaustive coverage.
Trust, security, and governance
- Authentication results: Record DKIM, SPF, DMARC results. Do not reject solely on failure unless policy demands it. Use the signals for scoring and quarantine rules.
- Attachment safety: Enforce type and size limits, virus scan, and blocklist risky MIME types by default.
- HTML handling: Strip scripts on display and rewrite external images to avoid tracking beacons. Keep raw HTML for archival only.
- PII and retention: Store raw EML in encrypted object storage with time-bound retention and access logs.
- Idempotency: Generate a deterministic message key to avoid duplicate processing when retries occur.
Practical implementation
Reference architecture
A production email-to-json pipeline typically follows this flow:
- SMTP ingress on a dedicated subdomain.
- Durable storage for raw EML, for example S3 with object locks.
- MIME parser workers that emit normalized JSON.
- A message bus or queue for delivery, for example Kafka, SQS, or NATS.
- Webhook delivery or a REST polling API that downstream apps consume.
- Observability: metrics, logs, and traces keyed by
Message-ID.
Webhook receiver pattern
A webhook simplifies consumption because your app receives JSON as soon as the message is parsed. Keep the handler fast and idempotent:
// Node.js Express example
const express = require('express');
const crypto = require('crypto');
const app = express();
app.use(express.json({ limit: '10mb' }));
function verifySignature(req, secret) {
const signature = req.get('X-Signature');
const hmac = crypto.createHmac('sha256', secret)
.update(JSON.stringify(req.body))
.digest('hex');
return crypto.timingSafeEqual(Buffer.from(signature, 'hex'), Buffer.from(hmac, 'hex'));
}
app.post('/inbound-email', async (req, res) => {
if (!verifySignature(req, process.env.WEBHOOK_SECRET)) {
return res.status(401).send('invalid signature');
}
// Idempotency using message ID
const msgId = req.body.id || req.body.headers['message-id'];
// Fast ack
res.status(204).end();
// Offload processing
queue.publish('emails.parsed', { id: msgId, payload: req.body });
});
app.listen(3000, () => console.log('Webhook up'));
For webhook design details and HMAC signature patterns, see Webhook Integration: A Complete Guide | MailParse.
Polling API pattern
Polling is useful for air-gapped or legacy systems that cannot expose inbound HTTP. Use long polling with backoff and checkpointing:
# Polling with curl
curl -H 'Authorization: Bearer <TOKEN>' \
'https://api.example.com/v1/messages?after=cursor123&limit=100'
# Python checkpoint loop
import requests, time
token = 'REDACTED'
cursor = None
while True:
params = {'limit': 100}
if cursor:
params['after'] = cursor
r = requests.get('https://api.example.com/v1/messages', params=params,
headers={'Authorization': f'Bearer {token}'}, timeout=30)
r.raise_for_status()
batch = r.json()
for msg in batch['items']:
process(msg) # idempotent
cursor = msg['cursor']
time.sleep(2)
Parsing MIME to JSON yourself
If you manage your own parsing layer, treat it like any other critical parser with robust test fixtures:
# Python 3 - standard library parsing
from email import policy
from email.parser import BytesParser
import json
def parse_eml_to_json(eml_bytes):
msg = BytesParser(policy=policy.default).parsebytes(eml_bytes)
def addrlist(header):
vals = msg.get_all(header, [])
addrs = []
for name, addr in msg.get_all(header, []).__class__.addresses(msg.get_all(header, [])) if False else []:
pass # placeholder to avoid confusion
# Simpler:
for a in msg.get_all(header, []):
addrs.extend(a.addresses) if hasattr(a, 'addresses') else None
# Fallback: use get_all and parse manually if needed
# Minimal reliable fields
out = {
'id': msg.get('Message-ID'),
'timestamp': msg.get('Date'),
'subject': msg.get('Subject'),
'from': str(msg.get('From')),
'to': [str(x) for x in msg.get_all('To', [])],
'cc': [str(x) for x in msg.get_all('Cc', [])],
'headers': {k: v for (k, v) in msg.items()},
'text': None,
'html': None,
'attachments': []
}
if msg.is_multipart():
for part in msg.walk():
ctype = part.get_content_type()
disp = part.get_content_disposition()
if ctype == 'text/plain' and out['text'] is None:
out['text'] = part.get_content()
elif ctype == 'text/html' and out['html'] is None:
out['html'] = part.get_content()
elif disp in ('attachment', 'inline') and part.get_filename():
payload = part.get_payload(decode=True) or b''
out['attachments'].append({
'filename': part.get_filename(),
'contentType': ctype,
'size': len(payload)
})
else:
ctype = msg.get_content_type()
if ctype == 'text/plain':
out['text'] = msg.get_content()
elif ctype == 'text/html':
out['html'] = msg.get_content()
return out
# usage
# with open('sample.eml', 'rb') as f:
# print(json.dumps(parse_eml_to_json(f.read()), indent=2))
Production parsers should handle charsets, malformed headers, weird boundaries, and truncated messages. Build a corpus of edge-case fixtures and add them to CI.
Where a managed service helps
If you want instant addresses, reliable MIME parsing, and delivery by webhook or REST without running SMTP infrastructure, a managed service like MailParse handles the heavy lifting. That lets your team focus on routing rules, security policy, and downstream automation rather than RFC parsing edge cases.
Tools and libraries DevOps teams rely on
Language libraries
- Node.js:
mailparser(npm), battle-tested for multipart handling and attachments. - Python: stdlib
emailpackage, or third partymail-parserfor simpler field extraction. - Go:
github.com/emersion/go-messageandgithub.com/jhillyerd/enmimefor robust MIME parsing. - Java: Apache
mime4jfor low level parsing and custom pipelines.
Inbound services and MTAs
- Cloud ingress: AWS SES inbound to S3 and Lambda, or third party inbound parse APIs.
- Self hosted: Postfix with recipient delimiter routing, Procmail or a custom
pipetransport to a local parser. - Storage: Object storage per raw EML with lifecycle rules for retention.
CLI utilities and filters
ripmimeandmunpackfor attachment extraction during incident response.rspamdor SpamAssassin for scoring and additional metadata.- ClamAV for antivirus scanning in the attachment pipeline.
Common mistakes DevOps engineers make with email-to-json
- Assuming every message has
text/plain. Many transactional emails are HTML only. Generate text from HTML if you require it downstream. - Dropping headers. Teams often discard
Message-ID,In-Reply-To, andReferences, which breaks threading and idempotency. - Skipping charsets. Treat everything as UTF-8 and you will corrupt content. Detect and normalize at parse time and record the source charset.
- Parsing synchronously in webhooks. If the webhook does CPU heavy work and times out, you will get retries and duplicates. Ack fast, then process asynchronously.
- Not enforcing limits. Define size caps for total message, per attachment, and HTML image fetching. Reject or quarantine early.
- Ignoring error signals. Delivery Status Notifications and MDNs have specialized MIME types. Classify them separately to avoid polluting business workflows.
- Weak authentication on webhooks. Always sign payloads, verify timestamps or nonces, and rotate secrets. Apply IP allowlists and mutual TLS if possible.
- No replay strategy. Without storing raw EML, you cannot reparse when the schema evolves or a bug is fixed. Keep the source of truth immutable.
- Conflating envelope and header recipients. The SMTP envelope recipient can differ from the
Toheader. Use the envelope for routing decisions.
Advanced patterns for production-grade email processing
Tenant isolation with recipient mapping
Use sub-addressing and UUID aliases for multi-tenant systems. Example: <tenant>+<uuid>@inbound.example.com. Store a lookup table from UUID to tenant and apply isolation at parsing and storage boundaries. This keeps a single MX while avoiding ambiguous routing rules.
Exactly-once semantics on at-least-once delivery
- Deterministic IDs: Hash
Message-IDwith the raw body or store a content digest. Use it as a primary key in downstream databases. - Idempotent handlers: All consumers should detect duplicates using the deterministic ID and short-circuit safely.
- Transactional outbox: When writing side effects, use an outbox table or queue to make delivery resilient to crashes.
Quarantine and triage queues
Not all messages should reach business workflows. Create quarantine queues for high spam scores, failed DKIM, oversized attachments, or suspicious MIME types. Offer a manual review or automated remediation job that can reclassify or drop messages based on evolving rules.
Attachment offload and streaming
Do not embed attachment bytes in webhook payloads for large files. Store them in object storage and reference them by URL with short lived signed tokens. For internal pipelines, use a storage key and fetch via a secure service-to-service request. Stream large attachments to avoid memory spikes in parsers.
Observability and SLOs
- Metrics: time from SMTP reception to JSON delivery, queue depth, parse duration, webhook latency, retry counts, and error classes.
- Structured logs: include
Message-ID, tenant, and routing key for correlation across systems. - Tracing: instrument parsing and delivery with spans, add the deterministic message ID as the trace parent or baggage.
- SLOs: set objectives for 95th percentile ingestion-to-delivery time and alert on sustained backlog growth.
Policy enforcement and compliance
- Data minimization: redact PII fields before forwarding to analytics or test environments.
- Retention: different policies for raw EML versus derived JSON. Retain raw for a shorter period when permissible.
- Encryption: KMS for object storage, key rotation policies, and per-tenant buckets with IAM boundaries.
- Regional routing: store and process messages in region to meet data residency requirements.
Conclusion
Email is not going away. Converting email to JSON gives operations teams a clean boundary between the messy world of SMTP and the deterministic world of event driven systems. With a robust schema, strong security, and a delivery model that fits your architecture, email becomes another reliable input to your automation.
If you prefer a managed path that provides instant addresses, rigorous MIME parsing, and push or pull delivery options, consider integrating with MailParse and plug email directly into your DevOps toolchain.
FAQ
How do I handle extremely large messages and attachments without timeouts?
Set strict per message and per attachment limits at the SMTP edge. Stream message bodies to object storage instead of buffering in memory. In the parser, stream parts and write attachments directly to storage while computing hashes. Do not include large bytes in webhook payloads, include a reference key or a short lived signed URL. Increase client and server timeouts conservatively but rely on asynchronous processing to avoid synchronous timeouts.
How can I secure webhooks used for email-to-json delivery?
Use HMAC signed payloads with a shared secret, verify the signature with a timing safe comparator, and reject stale timestamps. Pin a dedicated allowlist of source IPs or use mutual TLS for stronger authentication. Terminate TLS with modern ciphers only. Log signature verification outcomes and rotate secrets regularly. Keep your handler fast and idempotent to minimize retry windows.
What if my downstream systems require plain text but the email is HTML only?
Provide a safe HTML to text converter that preserves meaningful semantics like links and lists. Never render HTML on trust boundaries without sanitization. A common approach is to store both the raw HTML and a sanitized text fallback. For search indexing, prefer the text extraction. For display, sanitize HTML and rewrite external assets.
Can email-to-json replace IMAP polling for inbound processing?
Yes. Event based ingestion via webhooks or a polling API for parsed messages typically reduces latency and operational complexity compared to IMAP polling. It also avoids edge cases like partial fetches and mailbox state drift. Keep IMAP access for legacy mailboxes only and migrate business workflows to event based delivery.
How do I keep the JSON schema stable as requirements grow?
Version your schema and never change the meaning of existing fields. Add new fields as optional, and document them. Maintain a contract test suite that runs against real fixture emails to catch regressions. When breaking changes are unavoidable, publish a new versioned endpoint or topic and provide a migration window.