How We Built a Sub-200ms Multilingual Chat System with Our Own LLM

A guest from Tokyo checks into a hotel in Istanbul. They want to ask about breakfast. The receptionist speaks Turkish and English. The guest writes in Japanese.

For decades this meant confused gestures, dictionary apps, and often guests giving up entirely.

We spent two years building the infrastructure to remove this friction completely. The result runs in production across 700+ hotels in 50+ countries, translating live conversations between guests and staff with median latency under 200 milliseconds. This is how the system works under the hood.

Why Generic Translation APIs Failed for Us

We started where most teams would: piping messages through commercial translation APIs. Within three months in production we hit three walls that pushed us to build our own model.

The first wall was tone. Hospitality is a register-sensitive domain. A polite Japanese request like 「お湯をいただけますか」 came out as flat imperative English. The reverse was worse — neutral English requests landed in Japanese, Korean, and German as informal or even rude. Asian languages with explicit honorific systems and Slavic languages with ty/vy distinctions were consistently mistranslated. Guest satisfaction scores in those languages were measurably lower.

The second wall was domain vocabulary. Hospitality has its own lexicon. Concierge, do not disturb, continental breakfast, valet parking, early check-in, late checkout, no-show, compendium, turn-down service — these have established equivalents across hospitality but generic models translate them literally. A Russian guest asking about «услуга вечерней подготовки номера» should translate to turn-down service. Generic APIs returned things like "evening room preparation service".

The third wall was operational. Per-character pricing scaled unpredictably as our network grew. Upstream rate limits caused tail latency spikes during peak hours when hotels needed reliability most. We had no control over model updates that could change behavior overnight, sometimes breaking carefully tuned hospitality phrasings without notice.

By month four we had decided to build our own. By month eight we had a working prototype. Today, two years in, iRoom LLM handles every message that flows through our network.

iRoom LLM: A Two-Year Engineering Investment

iRoom LLM is a transformer-based translation and reasoning model fine-tuned specifically on hospitality conversational data. It runs entirely on our own infrastructure across multi-region GPU clusters, with no dependency on third-party inference APIs.

The training corpus took eighteen months to assemble. We built it from four sources:

A base of multilingual web text filtered for hospitality and travel domain content
A curated dataset of professionally-translated hotel collateral across 47 languages — welcome letters, in-room directories, service menus, signage, FAQ pages from luxury hotel chains
Real conversational chat data from our own platform, with guest consent and full anonymization
A synthetic dataset of hospitality dialogue scenarios we generated and then had native-speaker reviewers correct for naturalness in 30+ target languages

Training was iterative. We started from a 7B-parameter open-weight base, ran continued pretraining on our hospitality corpus, then progressively fine-tuned through supervised instruction-tuning and reinforcement learning from human feedback using hotel staff and native speakers as raters. The model was evaluated weekly against a held-out benchmark of 12,000 real guest-staff exchanges across the 25 most common language pairs in our network, scored on three axes: semantic accuracy, register preservation, and domain vocabulary fidelity.

The result is a model that consistently outperforms commercial translation APIs on hospitality benchmarks while running 4-6 times faster on our own hardware because we control the inference stack end-to-end.

The Production Architecture

The full message path:

When a guest scans the QR code in their room, the frontend opens a Progressive Web App. Before the chat interface even renders, three things happen in parallel: the browser language is captured, IP-based country and likely-language are inferred from a cached GeoIP database we maintain in-edge, and the room's stored guest profile is queried for any language preference set during check-in. These three signals collapse into a single confirmed guest_locale that gets locked in for the session.

When the guest types a message, the client sends the raw UTF-8 text plus the guest_locale and a session token to our edge nodes over a persistent WebSocket. We never trust client-side language detection — clients can be wrong, and WebViews on certain Android devices misreport language entirely.

At the edge, the message hits a router that does three things atomically:

Writes the raw message to our primary store with the original language tagged
Publishes the message ID to a translation queue keyed by target locale set
Acknowledges the client

Acknowledgement happens within 20–40ms on the p50 because the original-language storage write is the only blocking operation.

The translation queue is consumed by inference workers running iRoom LLM. On a cache hit, the worker returns the translation in under 5 milliseconds. On a miss, it runs iRoom LLM inference at p50 of 95ms on warm GPU.

Each worker performs cache lookup against a normalized representation of the message — Unicode NFC, casefold, punctuation strip, whitespace collapse, named entities like guest names and room numbers replaced with typed placeholders. On a miss, it constructs a prompt containing the hospitality system instruction, the conversation context window of the last 3-5 messages, source and target language tags, and the message itself. The result is stored keyed by normalized form, linked to the original message ID, and emitted on the WebSocket bus.

When staff reads a message, the frontend requests rendering in their staff_locale. The backend returns a payload with both the original and the translated text. Staff sees the translated version by default, with a tap or hover to reveal the original. Staff replies follow the same pipeline reversed.

The Five Engineering Decisions That Made This Work

1. Original-as-source-of-truth, translation-as-derived-data

Every message is stored exactly as the sender wrote it, in the source language, with the language explicitly tagged. Translations are derivative artifacts linked by message ID. We never overwrite a stored message with its translation.

This sounds trivial but pays dividends repeatedly. When iRoom LLM was retrained at version 1.4 and again at 2.0, we re-translated portions of historical conversations to backfill improved quality without touching the canonical message store. When a hotel reported a translation error six months after the fact, we could trace the exact model version that produced it and deploy a corrected mapping. When a property requested an export of all guest conversations for legal compliance, we produced it in source-of-truth form, no language ambiguity.

2. Pivot through English, but smarter

We pivot non-English to non-English translations through English. Naive direct translation between every pair would require N times N-1 optimized model paths. With pivoting, we maintain N high-quality paths to and from English.

The naive concern is that pivoting compounds errors: Turkish to English to Korean should be worse than direct Turkish to Korean. In practice we found the opposite for most pairs because iRoom LLM's training data is much denser for non-English to English pairs than for non-English to non-English pairs. The English representation acts as a high-quality semantic intermediate.

We do bypass the pivot for a small set of high-volume pairs where we have enough native data to make direct translation reliably better — Japanese to Korean, Spanish to Portuguese, Russian to Ukrainian, Arabic to Farsi. The bypass list is data-driven and adjusted quarterly based on benchmark performance.

3. Aggressive normalized caching

Hospitality conversation is more repetitive than developers expect. "What time is breakfast?" "Is the gym open?" "Where is the pool?" "Can I get extra towels?" These exact phrases — across thousands of variant spellings, capitalizations, and punctuations — flow through the network millions of times per month.

Our cache layer stores translations keyed by normalized text plus source language plus target language. Normalization runs as a deterministic function: Unicode NFC normalization, casefold, punctuation stripping, whitespace collapse, and named-entity substitution. "Hi!" "hi" "HI!!!" all hit the same entry. "Can I get an extra towel for room 412?" and "Can I get an extra towel for room 815?" both normalize to the same form and share a cache slot.

Production hit rate is 47% network-wide, with translation cost cut roughly in half and p50 latency on cache hits at 4ms. The cache is sharded across regions, eventually consistent, with TTLs tuned per language pair based on observed translation drift between iRoom LLM versions.

4. Lazy fan-out, never eager translation

A conversation might have one guest writing in Japanese and three staff members watching in Turkish, English, and Russian. Naive eager translation generates three translations per guest message. Lazy translation generates one per staff member, only when they actually read.

Most staff don't read most messages in real-time — they read the queue when they sit down at the front desk between guest interactions. Lazy translation cut our compute budget by approximately 70% versus eager generation, with no perceptible impact on perceived latency because translation happens during the read request, which already has expected network latency anyway.

Cache hit rate on these requests is even higher than on guest writes because the same message is often read by multiple staff members in different languages — second and third reads are nearly always cached.

5. Domain context as system prompt, conversation history as context window

iRoom LLM receives more than just the message to translate. Each inference call constructs a prompt containing the hospitality system instruction (preserve register, use formal honorifics where appropriate, recognize hospitality terminology, avoid literal translation of cultural idioms), the last 3-5 messages of conversation context, source and target language tags, and the message itself.

The conversation context window is the difference between coherent threading and isolated mistranslation. "Yes, please" translated standalone is ambiguous in many languages. With the previous message visible ("Would you like extra pillows?"), the translation becomes correct in every target language we support.

We tuned the context window length empirically. Three messages is the sweet spot — longer windows added latency and noise from older off-topic exchanges; shorter windows lost critical referent information.

Production Numbers

Two years in, here is what the system delivers in steady state across the network:

End-to-end p50 latency from send to translated display: 180ms
End-to-end p95 latency: 480ms
End-to-end p99 latency: 950ms
iRoom LLM inference p50 on warm GPU: 95ms
iRoom LLM inference p95: 240ms
Cache hit rate network-wide: 47%
Cache hit rate at high-volume properties: 70%+
Translation cost per active room per month: ~$0.04
Hospitality benchmark score combining BLEU-4, COMET, and register preservation: 0.91
System availability over 12-month rolling window: 99.97%

For comparison, commercial translation APIs we benchmarked against scored 0.74-0.81 on the same hospitality benchmark.

What We Got Wrong on the Way Here

Plenty.

Our first translation cache implementation used unnormalized message text as the key. Hit rate was 8%. Adding deterministic normalization and named-entity substitution lifted it to 47%. We could have done this in week one but didn't think repetition would matter that much. It mattered enormously.

We initially built without a conversation context window. iRoom LLM was translating each message in isolation. Quality on multi-turn dialogues was significantly worse than on standalone messages, and we didn't catch it for months because our automated benchmarks tested isolated message pairs. Adding a 3-message context window fixed it; building the benchmark for multi-turn dialogues should have come first.

We over-engineered the streaming inference path early on. Token-by-token streaming of translations to the client felt like a good idea — staff would see translation appear as it generated. In practice, hotel staff prefer to see the complete translated message at once because partial messages look broken on their dashboards. We removed streaming and saved the implementation complexity.

Our first multi-region deployment routed all inference to a single primary region. Tail latency for Asia-Pacific hotels was awful. We migrated to region-local inference clusters with model weight replication, which cut p99 latency in half and increased cost only marginally.

Conclusion

This entire system powers the chat layer of iRoom Help — used by 700+ hotels worldwide. The chat is one feature among many, but it is the piece that took the most engineering investment and the one that most clearly removed a problem hotels could not solve before.

If you are building anything that involves real-time cross-lingual communication at scale, hopefully some of these decisions save you time. We learned most of them by getting them wrong first.

Frequently asked questions

What is the median end-to-end latency for a translated message in iRoom?

Median end-to-end latency from send to translated display is 180ms; p95 is 480ms and p99 is 950ms across our 50+ country network.

How many languages does iRoom LLM support?

iRoom LLM supports translation across 100+ languages, with the densest training data and best benchmark performance on the top 30 hospitality-relevant languages.

Why build a custom translation model instead of using commercial APIs?

Generic APIs failed on tone preservation, hospitality-specific vocabulary, and operational reliability. Our domain-tuned model scores 0.91 on a hospitality benchmark vs 0.74-0.81 for commercial alternatives.

How does iRoom keep translation costs low at scale?

Aggressive normalized caching with named-entity substitution achieves a 47% network-wide hit rate. Combined with lazy fan-out translation (only translate when staff actually reads), compute cost drops by roughly 70% versus naive eager translation.

Does iRoom store the original guest message or only the translation?

Every message is stored exactly as the sender wrote it, in the source language with explicit language tagging. Translations are derivative artifacts linked by message ID — the original is always preserved as the source of truth.