
CLI Over HTTPS Part 3: The Proxy Pattern
Table of Contents
In Part 1 I showed that SSH burns 10-15 round trips before delivering a single byte of command output. In Part 2 I proved it. HTTPS batch is ~17x faster than SSH at real-world latencies when the device supports it natively. Even HTTPS keep-alive, with no batching, is 3.4x faster.
The obvious objection: most devices don’t support it natively. Your Cisco IOS switches, your Juniper routers, your Arista leaf nodes, they speak SSH. And while some of them have other interfaces, SSH is not changing anytime soon.
So the question isn’t “how do I get my switches to speak HTTPS.” The question is: where does the SSH happen?
Moving SSH to the Edge
SSH is slow because of round trips. Round trips are slow because of distance. If you move the SSH session closer to the device, the round trips get cheap.
A proxy co-located with the devices, in the same data center or local network, talks SSH to the devices over a 1-2ms link where the protocol overhead is negligible. Your automation platform talks HTTPS to the proxy over the WAN, where the round-trip savings from Part 1 actually matter.
The device never knows the difference, it sees an SSH session from a local IP. Your automation never touches SSH directly, it sends an HTTP request and gets CLI output back in the response body.
The Architecture
The proxy is the only component that touches SSH. Everything upstream is HTTPS: connection pooling, TLS 1.3, request batching, proper Content-Length framing. Everything downstream is SSH, but over a link where it doesn’t matter.
Proving It
I added a proxy mode to the benchmark tool from Part 2. The proxy is an HTTPS server that receives commands via the same ASA-style endpoints (/admin/exec/, /admin/config), then opens an SSH session to a backend device and returns the output.
The test setup:
- Backend device: SSH listener with 2ms RTT (local latency)
- Proxy: HTTPS frontend with WAN latency, SSH client to backend
- Benchmark client: Talks HTTPS to the proxy, same as it would to a native HTTPS device
Two proxy modes:
- fresh-ssh: New SSH connection to the backend per request (worst case)
- pooled-ssh: Reuses one SSH connection across requests (what you’d deploy in production)
The proxy’s HTTPS listener gets the same WAN latency injection as the direct SSH and HTTPS tests from Part 2. The backend SSH link gets a fixed 2ms RTT. Both transports experience the same WAN conditions. The only difference is what happens on the last hop.
Results
All runs: 20 iterations, 5 commands per iteration. SSH direct numbers from Part 2 for comparison. The “SSH direct” column uses PTY/shell mode. What Netmiko and Ansible actually do.
| WAN RTT | SSH direct (PTY) | Proxy (fresh SSH) | Proxy (pooled SSH) | Speedup (pooled vs SSH PTY) |
|---|---|---|---|---|
| 30ms | 522ms | 132ms | 119ms | 4.4x |
| 70ms | 1,213ms | 271ms | 242ms | 5.0x |
| 150ms | 2,565ms | 494ms | 480ms | 5.3x |
The proxy pattern is 4.4-5.3x faster than SSH direct (PTY mode), even though the proxy still uses SSH on the backend.
At 150ms RTT, for a US NOC managing devices in Hong Kong, SSH direct (PTY) takes 2.6 seconds per device. The proxy with a pooled backend connection does it in 480ms. Scenarios like this are part of why I buit NAAS (Netmiko as a Service), which implements exactly this pattern utilizing Netmiko on the backend.
Why It Works
SSH direct (PTY mode) at 150ms RTT pays the full protocol tax on every round trip over the WAN:
- TCP handshake: 1 RT × 150ms
- SSH version exchange: 1 RT × 150ms
- Key exchange: 2 RT × 150ms
- Auth + channel + PTY + shell: 4 RT × 150ms
- Session prep: 2 RT × 150ms
- 5 commands with echo verification: 5 RT × 150ms
That’s ~15 round trips × 150ms = ~2,250ms of protocol overhead, plus processing time.
The proxy splits that cost across two links:
- WAN leg (HTTPS): TCP + TLS 1.3 + HTTP request = ~3 RT × 150ms = ~450ms
- Local leg (SSH): The same ~15 SSH round trips, but at 2ms = ~30ms
Total: ~480ms.
Now, to be very clear, the model is approximate. It ignores processing time, serialization, and the proxy’s own overhead. But the measured numbers land close enough to confirm the mechanism. The SSH overhead is still there. It’s just happening on a link where 15 round trips cost 30ms instead of 2,250ms.
The pooled-ssh mode is slightly faster because it eliminates the SSH handshake on the local leg for subsequent requests. But the difference is small (~14ms at 150ms WAN) because SSH setup at 2ms RTT is already cheap. The big win is relocating the WAN traffic to HTTPS.
SSH connection overhead is a non-issue at local distances. The entire thesis of this series is that SSH’s overhead is a latency problem, not a protocol problem. Move the SSH to where latency is low, and it works fine.
Fresh vs Pooled: Does It Matter?
At local latency, not much. The gap between fresh-ssh (132ms) and pooled-ssh (119ms) at 30ms WAN RTT is 13ms, the cost of one SSH handshake at 2ms RTT. In production you’d pool connections anyway for resource efficiency, but the performance argument for pooling is modest when the backend latency is low.
The operational argument is even stronger. A pooled connection means fewer SSH sessions on the device, and Network devices have finite session limits. An ASA might handle 5 concurrent SSH sessions, a catalyst might allow 16. If your proxy is serving 50 requests per second, fresh connections will exhaust those limits instantly. Pooling keeps one session open per device and multiplexes commands through it.
The pooling logic in clibench is straightforward.
getSSH() returns an existing connection if one is pooled, or dials a new one:
| |
The tradeoff is stale connections; devices reboot, sessions time out, firewalls drop idle flows. The proxy needs to detect dead connections and reconnect, the same problem as HTTP connection pooling or database connection pooling. In clibench, a failed session operation clears the pool so the next request gets a fresh connection. In production, you’d add periodic health checks and a circuit breaker for unreachable devices; which is what NAAS does, for example.
What This Looks Like in Practice
The Pattern Is Proven
If you have an internal API that accepts “run this command on this device” requests and returns the output, you’re already running a version of this.
Many example exist in the ecosystem using exactly this pattern:
- Salt proxy minions with NAPALM behind the Salt REST API.
- AWX execution environments co-located with devices.
- Oxidized’s web interface.
- The Rackspace Go microservices from Part 1.
In practice with each of the above, you can co-locate SSH with the devices (good). However, most of these don’t expose a clean HTTPS interface upstream (or they bury it under job queues, inventory sync, and YAML sprawl). The core pattern is much simpler than any of those tools, and is worth considering in your use cases or as part of a larger system.
The Minimal Proxy: ~180 Lines of Go
The proxy in clibench is a minimal implementation.
It exposes the same ASA-style endpoints as the benchmark’s HTTPS server (/admin/exec/, /admin/config), but forwards commands to a backend SSH device instead of a local command engine.
The core loop is just: accept an HTTPS request, parse the commands, open an SSH session (or reuse a pooled one), execute each command, return the concatenated output:
| |
That’s enough to prove the concept and run benchmarks, but it’s not enough to run in production. A real deployment needs multi-vendor device support, connection pooling with health checks, async job handling for long-running commands, and proper credential management. And while clibench is largely for demonstration purposes, it still shows how little code is needed to start with this pattern.
NAAS: The Production Version
NAAS (Netmiko as a Service) is what this pattern looks like when you build it for real. Written in Python, it wraps Netmiko behind a REST API, which means it supports 100+ device platforms: Cisco IOS, NX-OS, ASA, Juniper Junos, Arista EOS, Palo Alto, and everything else Netmiko handles.
You POST a JSON payload with the device address, platform, credentials, and commands. NAAS opens the SSH session, runs the commands, and returns the output in the HTTP response:
| |
What NAAS handles that a minimal proxy doesn’t:
- Multi-vendor connection pooling. Persistent SSH connections with health checks and automatic reconnection.
- Async job queue. Long-running commands (
show tech-support, bulk config pushes) run in a Redis-backed queue. Your automation gets a job ID back immediately and polls for results. - Circuit breaker and observability. Stops hammering unreachable devices, exposes Prometheus metrics for connection pool health and per-device latency.
Deploy a NAAS instance in each data center or region, and your automation talks HTTPS to the nearest one. The SSH sessions stay local.
The architecture is the same as what the benchmarks measured: HTTPS over the WAN, SSH on the last hop. NAAS just handles everything a production deployment needs on top of that pattern.
Security
The proxy doesn’t make things more or less secure. It changes the trust model.
With SSH direct, your automation server holds the SSH keys and authenticates directly to every device. With the proxy pattern, the trust boundary splits in two: your automation authenticates to the proxy (over HTTPS, using API tokens, mTLS, or whatever your org uses for service-to-service auth), and the proxy authenticates to the devices (over SSH, using keys that are available to the proxy itself).
What actually changes:
- Where the SSH keys live. They move from the automation server to the proxy. The private keys never cross the WAN in either model (SSH public key auth sends a signature, not the key), but the proxy pattern puts the keys physically closer to the devices they unlock.
- The WAN-side auth mechanism. Your automation no longer speaks SSH to devices. It speaks HTTPS to the proxy. That’s not inherently better or worse. It’s a different credential type (API token or client cert vs SSH key) managed through whatever system your organization already runs for service authentication.
- The blast radius of a compromised proxy. The proxy has access to the SSH keys for every device it manages. Compromise the proxy, and you have access to the fleet. This is the same risk profile as an SSH bastion host, which most organizations already operate and already know how to harden: minimal attack surface, restricted network access, key rotation, session logging, and monitoring. The proxy deserves the same care you’d give a bastion.
When the Proxy Doesn’t Help
The proxy pattern assumes the WAN latency between your automation and the devices is the bottleneck. If your automation server is already co-located with the devices (same rack, same DC), there’s no WAN leg to optimize. SSH at 1-2ms RTT is fast enough.
It also doesn’t help if your bottleneck is device processing time rather than transport overhead. If a show tech-support takes 30 seconds to generate on the device, the transport saves you a few hundred milliseconds on a 30-second operation. Still worth it at scale, but the relative improvement is smaller.
And the proxy adds operational complexity. It’s another service to deploy, monitor, and maintain. For a team managing 50 devices in one location, the overhead isn’t justified. For a team managing thousands of devices across multiple continents, which is where SSH overhead actually hurts, the proxy pays for itself on the first automation run.
Try It
The benchmark code includes the proxy mode. Run it yourself:
| |
To try a production version against real devices:
| |
See the NAAS getting started guide for full setup and configuration.
In Part 4, I’ll lay out a decision framework for choosing between SSH direct, an edge proxy, and native HTTPS, and look at what the industry needs to build next.
My take: The proxy pattern isn’t a workaround. It’s the right architecture for managing geographically distributed network infrastructure. SSH is fine for the last hop. HTTPS is better for everything upstream. That’s why I built NAAS.