Auto-index is a per-agent toggle that grows your knowledge base automatically as visitors browse your site. When a visitor lands on a page the agent has never seen, that page goes into the crawl queue — silently, in the background.
Open the agent's settings page (/app/agents/{id}/settings) and
flip Auto-index visited pages. The change takes effect
immediately — there's no separate publish step for this toggle.
On every /v1/widget/init call, after the agent passes the
origin and quota checks, AutoIndexPageVisit::attempt() runs
a chain of seven guards. All seven must pass for a crawl to be queued:
auto_index_visited_pages = true.10.x, 172.16-31.x, 192.168.x), loopback (localhost, 127.x, ::1), link-local (169.254.x, fe80:), 0.x, IPv6 ULA (fc00:), and .local / .internal domains are blocked./admin, /login, /checkout, /profile, /account, /settings, /cart, /api are skipped.Origin header matches the agent's allowed_origins (or matches the page URL's origin when allowed_origins is *).
If everything passes, we lazy-create a type=auto source and
dispatch a CrawlPageJob on the crawl queue. The
visitor's request returns immediately — auto-index never blocks the hot
path.
The path blocklist exists because authenticated pages are noisy and
risky to index — a logged-in /profile or
/account/orders page leaks the visitor's data into your
knowledge base. The full list lives in
AutoIndexPageVisit::SKIP_PATH_PATTERNS (case-insensitive,
matches the path segment with optional trailing slash):
/account, /my-account, /profile, /settings/admin/login, /signin, /signup, /register, /logout, /auth, /password/checkout, /cart, /order, /orders/portal, /customer), the default blocklist
won't catch them. Disable auto-index, or pre-list the exact public
URLs you want crawled and skip the toggle entirely.
The 30-crawls-per-agent-per-hour cap is a token bucket keyed in Redis as
auto-index:agent:{id}:hour:{YmdH}. If a popular page on your
site is getting hammered the limit will quickly throttle, but normal
traffic patterns rarely hit it.
Auto-indexed pages become type=auto sources. They show up
in the regular Sources list with a small "auto" pill so
you can see what's been picked up. You can preview, reindex, or delete
them like any other source.
The auto-source's title is the page <title> if available,
otherwise the URL. Path-based deduplication means the same URL crawled
twice doesn't create two sources.
Turn the toggle off and no new pages will be queued, but existing
auto-sources stay. To clean them up, filter the sources list by
type = auto and delete in bulk.