edge-llm-at-sea-lessons
Marine AI
Six months running an edge LLM on a vessel at sea: what broke, the real cost vs. satellite API calls, and who should actually build edge AI.
Connex Labs
There's a version of edge AI that lives in slide decks. It runs forever, costs nothing after deployment, and ships once. Then there's the version that runs on a steel hull somewhere in the South China Sea, six months in, where the only person who can power-cycle the box is a chief engineer who has actual problems to deal with.
We've spent the last year putting on-device LLMs into places where calling a hosted API isn't a fallback. It's not an option at all. This post is a debrief from one of those projects. What we shipped, what broke obviously, what broke later in ways we didn't expect, and the honest cost math against just calling a hosted model over satellite.
If you're thinking about edge LLM for something similar, hopefully this saves you a few of the months we lost.
The setup: why offline wasn't optional

We built an on-vessel intelligence system for a maritime client. The use case was document and log work: crew filling out inspection forms, querying maintenance records, summarizing engine room logs, and producing structured reports that eventually sync back to shore when the vessel has bandwidth to spare.
A hosted API was off the table for three reasons.
Satellite bandwidth on commercial vessels still costs north of $8 per MB on the older VSAT plans we kept running into. Newer LEO services are better, but latency and packet loss make real-time inference unreliable for anything beyond a short request. Connectivity is also intermittent. Crews routinely go twelve hours or more with no usable link, and an AI feature that only works half the time gets abandoned inside a week. And the data is sensitive, operational logs, cargo manifests, incident reports. None of it leaves the vessel without explicit sync rules.
So the model had to live on the box. The box had to live in an engine room. And neither could be touched for months at a time.
Hardware and model choice

We landed on an NVIDIA Jetson Orin NX 16GB module as the compute target. Enough headroom for a quantized 7-8B parameter model with room left over for the application stack, and power-efficient enough to run on the vessel's 24V DC bus without active cooling drama. We'll come back to that last claim.
The stack:
Compute: Jetson Orin NX 16GB. 8GB for model and KV cache, 8GB for OS and application services.
Model: Llama 3.1 8B Instruct, quantized to Q4_K_M via llama.cpp.
Runtime: llama.cpp with the CUDA backend, batch size 1 (single-user workload).
Storage: 256GB NVMe, with a strict log rotation policy that we did not have on day one.
OS: Ubuntu 22.04 LTS, hardened image, locked-down apt sources.
Quantization mattered more than we expected. Q4_K_M gave us around 4 tokens/second on prompt processing and quality that was good enough for structured extraction. Q5_K_M was noticeably better on reasoning-heavy prompts but pushed us close to the RAM ceiling once concurrent application processes were in the mix. We stayed on Q4_K_M and spent the engineering hours on better prompting and retrieval instead.
What broke in month 1: the obvious stuff
Month one is when you find out which of your local-dev assumptions don't survive a production environment.
Cold starts were the first one. The model takes about 40 seconds to load weights into VRAM after a power cycle. Crew don't read change logs, they tap the screen, assume the system is broken, and walk off. We added a persistent model server with a heartbeat endpoint and a visible loading state on the UI. Obvious in hindsight.
Long-input latency was the second. A 4,000-token inspection log fed in as a single prompt took ninety-something seconds to process. We rebuilt the input path to chunk and summarize incrementally, then run a final synthesis pass. Total time stayed roughly the same, but the user sees progress, which turns out to be the only thing that matters when someone is standing in front of a touchscreen with grease on their hands.
Power transients were the third, and the most annoying. Vessels have dirty power. Brownouts during engine start were corrupting the model file on disk about once every two weeks. We moved to a journaled filesystem, added checksum verification at boot, and kept a known-good copy of the model on a read-only partition for recovery. Three lines of bash, in the end. Two weeks to figure out we needed them.
These are failure modes you can plan for. They show up in week two, you fix them in week three, and you feel like you've shipped.
You haven't.
What broke in month 4: the unobvious stuff
Month four is where the expensive lessons live.
Storage bloat from logs. We were logging every inference request to a SQLite database for offline observability: prompt, response, latency, token counts. Reasonable engineering. After 90 days the database had grown past 15GB and queries were timing out. Nobody had thought about a retention policy because nobody was thinking about disk pressure on a box with 256GB of NVMe. We now ship with a 30-day rolling window, aggregated metrics beyond that, and a hard ceiling enforced at the application layer.

Thermal throttling under sustained load. The Jetson's thermal profile in a benchtop test is not its thermal profile inside a sealed enclosure in a 38°C engine room with no airflow. Around month three, inference latency started creeping up by roughly a third. The Orin was downclocking itself to stay under its thermal limit. We added a software governor that throttles concurrent requests when junction temp crosses a threshold, but the real fix was an external heat sink and repositioning the box. That had to wait for a port call. No other option.
Drift, but not the kind you read about. The weights don't drift. They're frozen. What drifted was the input distribution. The crew started using the system for things we hadn't designed prompts for. By month four, about a fifth of queries were "what's the procedure for X" questions. Knowledge-base lookups, not document analysis. The 8B model hallucinated procedures that didn't exist. We added a retrieval layer over the vessel's actual SOPs and routed those query types through RAG instead of free generation. Without telemetry sync back to shore we wouldn't have caught it for another quarter.
OTA updates without bandwidth. The patch for the thermal governor was a 40MB binary delta. At maritime satellite rates that's a few hundred dollars per vessel, per push. We rebuilt the update pipeline around binary diffs, compressed deltas, and a queue that holds updates until the vessel hits port Wi-Fi or a usable LEO window. Most updates now ship in under 2MB. The ones that don't, wait.
Clock drift. No NTP for weeks at a stretch. Logs came back to shore with timestamps that disagreed with our records by hours. We added a GPS-derived time source as a fallback, with a sanity check against the last good sync.
The pattern across all of these: every failure mode in month four was a second-order effect of something we'd already "solved" in month one.
The honest cost comparison
Here's the math that matters. Numbers are approximate and depend heavily on usage patterns, but the shape is what counts.
Edge deployment, per vessel, year one:
Hardware (Jetson, enclosure, sensors): around $1,800
Engineering amortization (deployment, integration, OTA infra): around $2,500 spread across the fleet
Maintenance and support: around $800
Total: roughly $5,100 in year one. About $1,000/year after that.
Hosted API over satellite, same workload:
Estimated 8,000 inference requests/month per vessel, averaging 2,500 input + 500 output tokens
API cost (mid-tier model): around $80/month
Satellite data cost: roughly 20MB/month of API traffic at $8/MB = $160/month
Reliability cost: features unavailable during outages, which the crew works around by not using them at all
Total: about $2,900/year per vessel, every year.
On a pure dollar basis, edge breaks even somewhere around month eighteen per vessel. On reliability, edge wins from day one, because the hosted version simply doesn't work during the thirty-some percent of operating hours with no connectivity.
The catch is that the cost comparison only looks good after you've already paid the engineering bill for OTA, observability, thermal management, and the dozen other things that don't appear in a hardware BOM. If you're deploying to one vessel, hosted is probably cheaper end to end. If you're deploying to fifty, edge wins, and the gap widens with every unit.
Who should actually do this
Edge LLM is the right answer if at least two of these are true:
Connectivity is genuinely unreliable or genuinely expensive. Not "we'd prefer not to pay for API calls." Actually expensive.
You have data residency or sensitivity constraints that rule out cloud inference.
You're deploying at fleet scale, where per-unit engineering amortizes.
Your use cases are bounded enough that an 8B-class model can handle them with good prompting and retrieval.
You have, or can build, the operational muscle for remote device management.
It's the wrong answer if:
You're optimizing for cost on a workload that fits comfortably in a cloud API budget.
You need frontier model capability. The gap between an 8B local model and a frontier hosted model is real, and it's not closing as fast as the hype suggests.
You don't have a plan for OTA, observability, or thermal management before you ship.
Your workload is bursty. Edge wins on sustained, predictable inference, not occasional heavy lifts.
The honest version: most teams asking us about edge LLM should start with a cloud-first prototype, prove the use case, and only move to edge when the connectivity or cost math forces them to. The teams that should go straight to edge usually already know they should. They're the ones whose users can't get a signal in the first place.
What we'd do differently next time
Three things, ranked by how much they would have saved us.
Treat observability as day-zero infrastructure, not month-six. We would have caught the input distribution drift, the storage bloat, and the thermal throttling weeks earlier if we'd built proper telemetry first.
Plan OTA economics before writing the first line of model code. Binary diffs, scheduled sync windows, and a ruthless update size budget aren't optimizations. They're the difference between a deployable system and a science project.
Build the retrieval layer first, then add generation. Most of what users actually asked for was retrieval, not generation. We over-indexed on the LLM and underweighted the boring infrastructure that made it useful.
If you're considering edge LLM for a similar deployment, whether maritime, fleet, industrial, or off-grid, we'd like to hear about it before you commit to a hardware target. The mistakes are expensive, and most of them are avoidable if you know where to look.
Connex Labs is an AI product engineering studio building custom AI products for enterprise and startup clients across govtech, marine, fleet management, healthtech, and mobility.
