How to Build a Foundation for Running Extra-Large Language Models
What Actually Goes Into Running Large Language Models at Scale
When Cloudflare published their deep-dive into building infrastructure for extra-large language models, the engineering community paid close attention. Not because AI inference is new, but because the post laid bare something developers rarely get to see: the brutal, unglamorous work of making high-performance AI accessible at the edge. Custom memory management, quantization trade-offs, kernel-level optimizations — the kind of work that sits far beneath the surface of any chat interface.
But here's what that post implicitly highlights for the rest of us: running LLMs isn't just a matter of pointing a GPU at a model file and calling it done. There's an entire stack of infrastructure decisions, from DNS routing to SSL termination to cache behavior, that determines whether your AI-powered application is fast, reliable, and secure — or not.
This article takes a practical look at what that infrastructure foundation actually looks like, what you need to get right before you even think about model weights, and how to validate each layer of your stack.
The Infrastructure Stack Beneath Every LLM Deployment
Most AI inference discussions jump straight to GPUs, quantization, and throughput benchmarks. That's understandable — those are the hard problems. But the reality is that the infrastructure supporting your LLM deployment is just as capable of introducing latency, downtime, and security vulnerabilities as any model configuration issue.
Think about what a single inference request actually traverses before it reaches your model:
- DNS resolution to find your server
- TCP handshake and TLS negotiation
- Load balancer routing
- Reverse proxy handling
- Application layer processing
- Model inference
- Response caching (or not)
- Delivery back to the client
Each of those steps is a potential point of failure or latency. When Cloudflare talks about building a custom stack for LLMs, they're acknowledging that every layer matters. The same principle applies whether you're deploying on Cloudflare Workers AI, AWS, GCP, or your own bare metal.
DNS: The First Hop That People Ignore
DNS is almost always the first thing that breaks during a new deployment and the last thing developers think to check. For LLM deployments specifically, DNS configuration has a few unique considerations.
If you're running inference endpoints across multiple regions (which you should be, for latency reasons), you'll likely be using GeoDNS or latency-based routing to direct users to the nearest inference node. Getting this wrong means users in Singapore hitting your US-East endpoint, adding 200ms of round-trip latency before the model even starts generating tokens.
Here's a minimal example of what a well-configured DNS record set might look like for a multi-region inference API:
# Primary inference endpoint (US-East)
api.yourllm.com. 60 IN A 203.0.113.10
# Latency-based routing via CNAME to regional endpoint
api.yourllm.com. 60 IN CNAME us-east.inference.yourllm.com.
# Health check failover record
api-backup.yourllm.com. 60 IN A 203.0.113.20
Notice the TTL of 60 seconds. During active deployments or failover scenarios, you want DNS changes to propagate quickly. A 3600-second TTL means your users could be hitting a dead endpoint for an hour after you've already fixed the problem.
You can validate your DNS configuration at any time using the DNS Lookup tool, which lets you check A, AAAA, CNAME, MX, and TXT records across multiple record types without needing to drop into a terminal.
TLS and SSL: Non-Negotiable for AI APIs
Any API handling inference requests — particularly those that might include sensitive user prompts — must be running over HTTPS with a properly configured SSL certificate. This isn't a nice-to-have. It's table stakes.
But SSL configuration for high-performance APIs has some nuances that trip people up:
TLS Version and Cipher Suites
You should be enforcing TLS 1.2 at minimum, with TLS 1.3 preferred. TLS 1.3 reduces the handshake from two round trips to one, which matters when you're trying to minimize latency before inference even begins. Here's a typical Nginx configuration for an inference endpoint:
server {
listen 443 ssl http2;
server_name api.yourllm.com;
ssl_certificate /etc/ssl/certs/yourllm.crt;
ssl_certificate_key /etc/ssl/private/yourllm.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384;
ssl_prefer_server_ciphers off;
ssl_session_cache shared:SSL:10m;
ssl_session_timeout 1d;
ssl_session_tickets off;
# HSTS
add_header Strict-Transport-Security "max-age=63072000" always;
}
The ssl_session_cache directive is particularly important for inference APIs with high request volumes. Session resumption means returning clients don't need to perform a full TLS handshake on every request, which reduces latency meaningfully at scale.
Certificate Expiry Monitoring
This is embarrassingly common: a production AI API goes down because someone forgot to renew the SSL certificate. Automate your renewals with Let's Encrypt and certbot, but also set up external monitoring. You can use the SSL Certificate Checker to verify your certificate's validity, expiry date, issuer chain, and whether it's properly trusted by major browsers — all without needing server access.
Cloudflare as Infrastructure: What It Actually Provides
Cloudflare's blog post is, understandably, about their own infrastructure. But it's worth being specific about what running behind Cloudflare actually gives you for an LLM deployment, beyond just DDoS protection.
Anycast Routing
Cloudflare operates one of the largest anycast networks on the planet. When a user makes a request to your inference API, they're automatically routed to the nearest Cloudflare point of presence. This matters for time-to-first-token metrics — getting the request to your origin faster means the model starts generating sooner.
You can verify whether a domain is running behind Cloudflare using the Cloudflare Detection tool. This is also useful for competitive analysis or for debugging routing issues when you're not sure if traffic is hitting your origin directly or going through a proxy.
HTTP/2 and HTTP/3 Support
Cloudflare enables HTTP/2 and HTTP/3 (QUIC) by default. For streaming inference responses — where you're sending tokens back to the client as they're generated — HTTP/2's multiplexing and HTTP/3's reduced head-of-line blocking can make a noticeable difference in perceived responsiveness.
Caching Considerations for AI APIs
This is where things get interesting. Most LLM inference responses are dynamic and shouldn't be cached. But there are edge cases where caching makes sense: embedding generation for the same input, tokenizer outputs, or even semantic caching where similar prompts return cached responses.
If you're implementing any form of response caching, your Cache-Control headers need to be precise:
# For non-cacheable inference responses
Cache-Control: no-store, no-cache, must-revalidate
# For cacheable embedding responses
Cache-Control: public, max-age=86400, s-maxage=86400
# For semantic cache with short TTL
Cache-Control: public, max-age=300, stale-while-revalidate=60
Getting these wrong can result in users receiving stale or incorrect responses, which is a particularly bad failure mode for AI applications.
Security Headers: Protecting Your Inference Endpoints
Cloudflare's infrastructure post focuses on performance, but security is equally critical for any public-facing AI API. The attack surface for LLM deployments includes the usual suspects — injection attacks, authentication bypass, rate limit evasion — plus some AI-specific concerns like prompt injection and model extraction.
At the HTTP layer, your security headers should include at minimum:
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
X-XSS-Protection: 1; mode=block
Strict-Transport-Security: max-age=31536000; includeSubDomains
Content-Security-Policy: default-src 'self'
Referrer-Policy: strict-origin-when-cross-origin
Permissions-Policy: geolocation=(), microphone=(), camera=()
For APIs specifically, you can relax some of the browser-focused headers (CSP, X-Frame-Options) since API responses aren't rendered in browsers. But HSTS, CORS headers, and rate limiting responses should be present and correct.
You can run a full security header audit using the Vulnerability Scanner, which checks for missing security headers, common misconfiguration patterns, and XSS vulnerabilities.
Performance Optimization Beyond the Model
Cloudflare's engineering work on LLM inference — custom CUDA kernels, quantization, KV cache management — is impressive. But for most teams deploying AI applications, the performance wins available at the infrastructure layer are often larger and easier to capture than model-level optimizations.
Connection Pooling
If your application server is making HTTP calls to an inference API on every request, you're paying the TCP handshake cost every time. Use connection pooling:
import httpx
# Create a persistent client with connection pooling
client = httpx.AsyncClient(
base_url="https://api.yourllm.com",
limits=httpx.Limits(
max_keepalive_connections=20,
max_connections=100,
keepalive_expiry=30
),
timeout=httpx.Timeout(30.0, connect=5.0)
)
async def generate_completion(prompt: str) -> str:
response = await client.post(
"/v1/completions",
json={"prompt": prompt, "max_tokens": 512}
)
return response.json()["choices"][0]["text"]
This alone can reduce p99 latency by 20-40ms for high-frequency inference calls.
Streaming Responses
For user-facing applications, streaming is almost always the right choice. Users perceive streaming responses as faster even when total generation time is the same, because they see output immediately rather than waiting for the full completion.
async def stream_completion(prompt: str):
async with client.stream(
"POST",
"/v1/completions",
json={"prompt": prompt, "max_tokens": 512, "stream": True}
) as response:
async for chunk in response.aiter_text():
yield chunk
The infrastructure implication here is that your load balancer and reverse proxy need to support long-lived connections and chunked transfer encoding. Some default configurations aggressively time out connections, which will cut off streaming responses.
Timeout Configuration
LLM inference is slow compared to typical API responses. A request that generates 1000 tokens might take 10-30 seconds. Your entire infrastructure stack — load balancers, reverse proxies, CDN edge nodes — needs to have timeouts configured appropriately:
# Nginx proxy timeouts for LLM inference
proxy_connect_timeout 10s;
proxy_send_timeout 120s;
proxy_read_timeout 120s;
# For streaming responses specifically
proxy_buffering off;
proxy_cache off;
Monitoring and Observability for AI Infrastructure
Running LLMs at scale requires different monitoring approaches than typical web services. The metrics that matter are different:
- Time to first token (TTFT): How long before the user sees any output
- Tokens per second (TPS): Generation throughput
- Request queue depth: How many requests are waiting for GPU capacity
- GPU memory utilization: Are you approaching OOM conditions
- Cache hit rate: For KV cache and semantic caching
Beyond model-specific metrics, your standard infrastructure monitoring still applies: DNS resolution time, TLS handshake duration, connection establishment time, and HTTP response codes.
For SEO and discoverability of any public-facing AI tools or documentation you're building, it's worth running periodic audits with the SEO Audit tool to ensure your meta tags, structured data, and content hierarchy are properly configured.
The Deployment Checklist
Before going live with any LLM-powered application, work through this infrastructure checklist:
DNS
- TTLs are appropriate for your failover requirements
- GeoDNS or latency-based routing configured for multi-region
- Health check records configured
- DNS propagation verified across multiple resolvers
SSL/TLS
- Certificate valid and not expiring within 30 days
- TLS 1.3 enabled
- HSTS header configured
- Certificate auto-renewal in place
Security
- Security headers present and correctly configured
- Rate limiting on inference endpoints
- API authentication in place
- CORS policy configured appropriately
Performance
- HTTP/2 enabled
- Connection pooling configured
- Streaming responses working end-to-end
- Timeouts appropriate for inference latency
- Cache-Control headers correct for each endpoint type
Monitoring
- Uptime monitoring on inference endpoints
- SSL certificate expiry alerts
- Latency percentile tracking (p50, p95, p99)
- Error rate alerting
Conclusion
Cloudflare's work on custom inference infrastructure is a reminder that high-performance AI isn't just about the model — it's about every layer of the stack that carries requests to and from that model. The same engineering rigor that goes into kernel-level GPU optimizations needs to be applied to DNS configuration, TLS setup, security headers, and connection management.
For teams building AI-powered applications, the good news is that most of these infrastructure fundamentals are well-understood and well-tooled. You don't need to build custom CUDA kernels to get a fast, secure, reliable inference deployment — you need to get the basics right and verify them systematically.
OpDeck provides a suite of tools to audit and validate your infrastructure at every layer — from DNS Lookup and SSL Certificate Checker to Vulnerability Scanner and Cloudflare Detection. If you're preparing to deploy an AI application or want to audit your existing infrastructure, start with a systematic check of each layer at opdeck.co.