# Proposal — Live Visitor-Locale Auto-Detection & Bot Reply Language

**Status:** draft (audit iter 11, 2026-05-16)
**Owner:** TBD
**Effort:** 4 days (2 backend, 1 widget, 1 docs+QA)
**Plan gate:** all plans (table-stakes feature, no gating)

---

## Problem

Today the bot replies in the language set by the agent's
`language_default` (typically `en`). When a Spanish-speaking visitor
asks a question in Spanish on a workspace whose default is English:

- The system prompt is English.
- The bot replies in English even if the user wrote in Spanish.
- Visitor either silently drops off or asks again in broken English.

Buyer feedback (multiple LATAM/MENA buyers, batches v2-v4):
> "Mis visitantes hablan español. ¿Por qué responde en inglés?"
> "Most of my traffic is from UAE — I need Arabic replies, not just
>  the launcher label."
> "The starter prompts translate, but actual answers don't."

Current state — only the WIDGET CHROME (launcher label, lead form
labels, error toasts) translates via `App\Services\I18n\LocaleCatalog`.
The conversation content stays in the agent's default language.

Competitors:
- **Intercom** — auto-detect via browser `Accept-Language` + per-message
  language detection; bot responds in detected language.
- **Drift** — same pattern.
- **Crisp** — full message-level detection, supports 45 languages
  natively.

Pitchbar's English-only replies are a real LATAM/EMEA buyer blocker.

---

## Goals

1. Detect the visitor's preferred language on widget boot:
   - Trust `navigator.language` first (visitor's explicit browser
     pref).
   - Fall back to first parseable `Accept-Language` header at the
     widget-init endpoint.
   - Per-conversation message-level detection (visitor wrote in
     Spanish even though browser says English → switch reply
     language).
2. The bot replies in the detected language by appending a small
   directive to the system prompt:
   > "The visitor's preferred language is {locale}. Reply in that
   >  language unless the visitor explicitly switches."
3. Curated answers and starter prompts get a `lang` column already —
   filter by detected language and fall back to default.
4. Widget chrome (launcher label, lead form, error toasts) already
   localizes via `LocaleCatalog` — just wire the detected locale to
   that resolver instead of `app()->getLocale()`.

---

## Non-goals

- **No machine-translation pre-processing of the system prompt.** We
  pass the visitor's locale as a string to the LLM and let modern
  multilingual models (Llama 3.3, GPT-4o, Claude) handle the
  generation. The system prompt itself stays English — instruction
  following is best in the prompt's natural language for these models.
- **No translation of crawled knowledge base.** Out of scope for v1.
  RAG retrieval stays in the indexed language (typically the website's
  language); the bot synthesizes the reply in the visitor's language
  from English-indexed chunks. The big LLMs are competent at this
  cross-lingual synthesis.
- **No translation of operator-facing UI text.** That's a separate
  i18n project for the admin SPA.
- **No language switcher in the widget v1.** Auto-detect only. Manual
  switch is a follow-up.

---

## Detection pipeline

### Widget side
- On widget boot (`resources/widget/src/core/init.ts`), capture
  `navigator.language` and `navigator.languages[0..2]` and POST them
  to `/api/v1/widget/init` as `visitor_locales`.
- POST as ranked array, not a single value — the server picks the
  highest-priority supported locale.

### Server side (new code)
- `App\Services\I18n\VisitorLocaleResolver::resolve(Request, Agent): string`
  - Parse `visitor_locales` body field.
  - Intersect with `LocaleCatalog::supportedLocales()` (already exists).
  - Fall back to `Accept-Language` header.
  - Final fallback: agent's `language_default`.
  - Return BCP-47 tag (`es`, `pt-BR`, `ar`, `zh-CN`).
- Persist on `conversations.visitor_locale` (new column, nullable).
- Per-turn drift detection:
  - `MessageStreamController` calls
    `LocaleDetector::detectFromText($message)` on the visitor's first
    1-2 messages. If detection differs from stored locale with high
    confidence (>0.85), update `conversations.visitor_locale`.
  - `LocaleDetector` impl uses a tiny n-gram heuristic (no extra
    dependency) — competitive with cld3 for the top 30 languages.
    Cached per-conversation.

### Prompt injection
`PromptBuilder::build($agent, $messages, $sources, $verticalFragment, $visitorLocale)`
appends:

```
The visitor's preferred reply language is {locale-label}. Reply in
that language. Quote source text verbatim when citing — translate
your own narration but never translate quoted source material.
```

Where `{locale-label}` resolves to `"Spanish (es)"`, `"Arabic (ar)"`,
etc. via the existing `LocaleCatalog`.

---

## Data model

New migration:

```php
Schema::table('conversations', function (Blueprint $table) {
    $table->string('visitor_locale', 16)->nullable()->after('page_url')->index();
});
```

No new tables.

`curated_answers.lang` column already exists — `CuratedAnswerMatcher`
must add a `where('lang', $visitorLocale)->orWhere('lang', $agentDefault)`
filter (ordered by lang match precedence).

`agents.starter_prompts` is already a JSON array — extend the renderer
to look for `{ lang: 'es', prompts: [...] }` shape entries first,
falling back to the flat array if no locale match.

---

## Hot-path safety

- `VisitorLocaleResolver::resolve()` runs at widget INIT (not per-turn).
  Result cached on the conversation row.
- `LocaleDetector::detectFromText()` runs per-turn but on visitor's
  text only, NOT on retrieved chunks. Cost: ~1-2ms n-gram lookup. No
  DB call — uses an in-memory n-gram table loaded at boot.
- Prompt injection adds ~30 tokens — well within budget.
- Zero new HTTP calls. Zero new DB writes before first token.

---

## Test plan

Pest feature tests:
- `VisitorLocaleResolver` ranks `['es-AR', 'en-US', 'pt-PT']` → returns `es`.
- Falls back to `Accept-Language` when body field empty.
- Falls back to `agent.language_default` when both empty / unsupported.
- `LocaleDetector` returns `es` for "Hola, ¿cuál es el precio?".
- `LocaleDetector` returns `ar` for Arabic text + RTL flag.
- `LocaleDetector` returns null (no change) when confidence < 0.85.
- Stream test: visitor message in Spanish → conversation row updated
  → bot reply directive includes "Spanish (es)".
- Curated answer in `es` lang matched before agent-default `en`
  curated answer.
- Starter prompts: shape `{lang:'es',prompts:['…']}` rendered for
  Spanish visitor; flat array for English.
- Cross-tenant: workspace A's curated `es` answer not exposed to
  workspace B's conversation.

UI test plan:
1. Set `navigator.language='es-AR'` in DevTools, reload widget.
2. Open the demo agent's launcher → confirm starter prompts in
   Spanish (where translated).
3. Send "¿Cuáles son sus precios?" → confirm reply in Spanish.
4. Set `navigator.language='en-US'`, reload — confirm reply switches
   back to English.
5. Send English message on Spanish-locale conversation, then a
   Spanish message — confirm bot follows visitor's last-used language
   (drift detection).

---

## Rollout

1. Phase 1 (2 days): `VisitorLocaleResolver` + DB migration +
   PromptBuilder hookup + tests. Behind feature flag
   `agents.i18n_auto_detect_enabled` (default ON).
2. Phase 2 (1 day): widget bootstrap change + per-turn drift
   detection.
3. Phase 3 (1 day): curated/starter localization + docs +
   troubleshooting page.
4. Canary: enable on Pitchbar demo agent first. Watch
   `conversations.visitor_locale` distribution; investigate any
   `null` rate >5%.

---

## Risks / open questions

- **Model competence in long-tail languages.** Llama 3.3 is strong in
  EN/ES/PT/DE/FR/IT, decent in AR/ZH/JA, weak below the top-20.
  When `visitor_locale` is in the long tail, prompt directive still
  fires; model may reply in EN with broken target-language phrases.
  Acceptable degradation — better than English-always.
- **Mixed-language conversations.** Visitor writes EN then ES then
  EN. Drift detection switches the locale each time. Could be jarring;
  consider a 2-turn hysteresis (require 2 consecutive turns in a new
  language to flip).
- **RAG citation translation.** Bot is instructed to keep source
  text verbatim and only translate its own narration. Workers AI
  Llama 3.3 sometimes still translates quotes — acceptable but worth
  flagging.
- **Spam abuse.** Attacker forges `visitor_locales` to a long-tail
  locale to confuse the bot. No real risk — bot just replies in that
  language. Rate-limit not needed.

---

## Why now

- 3+ buyers in LATAM/MENA blocked or churned over this in the past 90
  days. Smallest engineering surface among the open proposals (~4
  days).
- Modern LLMs make this almost free — the whole feature is "tell the
  model the visitor's language and it handles it". The work is the
  detection + plumbing, NOT the multilingual NLU.
- Closes a competitive gap (Intercom/Drift/Crisp all ship this) AND
  unlocks new geographic markets for Pitchbar's CodeCanyon listing
  (which currently markets ~85% English-speaking buyers).