ARA — Control Panel

Place a Call

Phone Number (E.164 format)

Language

Agent Slots —/—

Use Reset Slots if you see "All agents busy" but no active call is running — a slot may be stuck.

Last Call

—

initiated

Post-Call Analysis

Duration

—

User Turns

—

Bot Turns

—

Barge-ins

—

—✓ —✗

Conversation Timeline

No analysis loaded

✓ Accepted

✗ Rejected (too short / low-confidence)

Recent Calls

No calls yet

LLM Chat Test

Type a message to test the LLM using current prompt & settings.

Uses current Settings (temperature, max tokens) and Prompts tab system prompt.

BaseCamp — Prompt Builder

Answer the questions below to auto-generate and evaluate your voice bot prompt

0 of 10 questions answered

Start from a Template

1 Bot Goal *

Describe what the bot should accomplish — be specific.

2 Bot Name & Company

3 Tone

Professional Friendly Formal Casual

4 Response Length

Very short — 1 sentence Short — 1-2 sentences (recommended) Balanced — 2-3 sentences

5 Off-limits Topics

6 Escalation — When to Transfer

7 Fallback Phrase

8 Language Rule

English only Japanese only Match caller's language

9 Fixed Facts to Embed

10 Special Rules

Generate prompt in

Draft Prompt

Fill in the questionnaire and click Generate Draft

The AI will build a voice-optimised prompt for you

System Prompt

0 chars

Initial Greeting

Apply to Prompts will overwrite the saved EN or JP prompt (matching the language you generated in).

Quality Evaluation

Generate a draft and click Evaluate

The AI will score 5 quality dimensions and suggest improvements

Evaluating draft…

Top Suggestion

🇺🇸 English Prompt

Saved changes take effect on the next call. Use {date} for today's date.

System Prompt

0 chars

Initial Message (bot's first line)

🇯🇵 Japanese Prompt

Saved changes take effect on the next call. Use {date} for today's date.

System Prompt

0 chars

Initial Message (bot's first line)

Version History

Model & Call Settings

Temperature 0.67

0 — precise2 — creative

Max Tokens

200–500 recommended for voice calls.

TTS Speed 0.92

0.5 — slow2.0 — fast

Default Inbound Language

🇺🇸 English 🇯🇵 Japanese

Used when the INBOUND_LG env var is not set.

Prompt Evaluation

Auto-tests the active prompt — the LLM acts as judge and scores each response

8 default scenarios • click ✨ to generate prompt-specific tests

—

Pass Rate

—

Avg Score

Recommendations

Custom Test Scenarios

Loading…

Eval History

Score Trend — avg per run, oldest → newest

A/B Prompt Comparison

Prompt A — Saved Version

Prompt B — Current Active

Saved prompt (auto-loaded)

Runs all active scenarios against both versions sequentially

Conversation Simulator

Define a conversation script and auto-run it against the active bot prompt

Script — one user turn per row

Turn 1

Simulation Transcript

Simulation of Chat

Auto-generate best, worst, and neutral questions from your prompt — then score each response

0 / 10 0%

Click "Generate Questions" to create test scenarios from your current prompt

The AI will create best-case, worst-case, and neutral questions, then score each bot response

Average Score

—

Best Case Avg

—

Worst Case Avg

—

Neutral Avg

—

Best Case

Worst Case

Neutral

How to Use

📞 Dial Tab

Place a Call: Enter a phone number in E.164 format (e.g. +819012345678), select the language (EN or JP), then click Dial.
Language: Determines which system prompt and initial message the bot uses for that call.
Analyze: After a call, click 📊 Analyze to see the full conversation timeline, turn counts, barge-in events, and accepted/rejected interruptions.
Recent Calls: The last 20 calls are listed below the dial form. Click 📊 View on any row to analyze it.

🤖 Chat Tab

Interactive test: Chat directly with the LLM using the current active prompt and settings — without making a real phone call.
Language: Switch EN/JP to load the corresponding system prompt into the conversation context.
Clear: Resets the conversation history. Language switches also clear history automatically.

💬 Prompts Tab

System Prompt: The bot's core personality and instructions. Use {date} to inject today's date at call time.
Initial Message: The first line the bot speaks when a call connects.
Save: Saves the prompt — takes effect on the next call (no restart needed). Every save auto-creates a snapshot.
Reset to Default: Reverts to the factory-default prompt for that language.
📸 Snapshot: Manually save a named version of the current prompt (e.g. "Before tone change").
Version History: Lists all saved snapshots. Diff shows a side-by-side comparison against the current editor content. Restore overwrites the active prompt with the selected version.

⚙️ Settings Tab

Temperature: Controls response creativity. Lower = more deterministic answers; higher = more varied. Recommended: 0.5–0.8 for voice bots.
Max Tokens: Maximum response length. Recommended: 200–500 for voice calls to keep responses concise.
TTS Speed: Speech playback speed for the TTS engine. 0.9–1.0 sounds natural.
Default Inbound Language: Language used for inbound calls when no environment variable overrides it.

🧪 Eval Tab

Tests the active prompt by simulating single-turn user inputs and having the LLM judge the bot's response. Each scenario is scored 1–5 and marked PASS (≥4) or FAIL.

🧪 Scenarios: Run the 8 built-in test scenarios against the current prompt. Click ▶ Run All to test all at once, or run individual scenarios. Export results as CSV or PDF.
📝 Custom: Create your own test scenarios with a name, icon, user message, and pass criteria. Save them to the server and load them into the active eval set.
📈 History: Every time you use Run All, the results are saved locally in your browser. This tab shows a score trend chart and a history table. Export any run as CSV/PDF.
⚖️ A/B: Compare two prompt versions head-to-head. Select a saved version as Prompt A — the current active prompt is Prompt B. Run all scenarios against both and see which wins per scenario.

🎭 Simulate Tab

Purpose: Pre-define a multi-turn conversation script and run it automatically without typing each message manually.
Script: Each row is one user turn. Add turns with + Add Turn, remove with ×.
Run: Click ▶ Run Simulation — the bot responds to every turn in order using the current prompt and settings. The transcript appears live as each response arrives.
Use case: Regression testing after prompt changes, scripted demo walkthroughs, edge case testing across multiple turns.

🔬 RAG Evaluation Tab

Evaluate the bot against a real knowledge base. Upload a Q&A dataset, then use three sub-modes to measure accuracy, chat interactively with retrieved context, or auto-generate & run test scenarios.

Step 1 — Upload a Dataset

Supported formats: CSV, XLSX, XLS, PDF (max 10 MB).
Required columns: one column for questions and one for expected answers. An optional context column provides pre-retrieved text used for faithfulness scoring.
Column mapping: After upload, use the dropdowns to map detected columns to Question, Expected Answer, and optionally Context. Adjust the row limit to test a subset first.
Preview: The first 5 rows are shown so you can verify the column mapping before running evaluation.

Sub-mode: 📊 Evaluate

Metrics: Choose from 6 LLM-judge metrics — Correctness, Completeness, Relevance, Faithfulness, Conciseness, Context Utilisation. Each is scored 1–5; a row passes at ≥ 4.
Run: Click ▶ Run RAG Evaluation. The bot answers each question using the current prompt; an LLM judge then scores each answer per-metric.
Summary card: Shows average score per metric and the top improvement recommendations.
Per-question results: A table lists every question with bot answer, metric scores, overall score, and PASS/FAIL. Expand any row for detailed feedback.
Export: Download results as CSV or PDF.

Sub-mode: 💬 RAG Chat

Purpose: Conversationally test the bot with questions answered using the uploaded dataset as live context — mirrors real RAG behaviour at runtime.
How retrieval works: Each message is scored against every dataset row by keyword overlap. The top-N rows are injected into the bot's system prompt before the LLM is called.
Top-N: Select how many rows to retrieve as context (3, 5, 8, or 12). More rows give richer context but may reduce relevance precision.
Context citations: Each reply shows "📚 N context sources used" — expand it to see exactly which rows were injected so you can audit retrieval quality.
Multi-turn: The last 10 turns are sent with each request so the bot maintains conversational continuity.
Clear: Resets the conversation history.

Sub-mode: 🧪 Generate & Test

Generate Scenarios: The LLM analyses your current bot prompt + a random sample of your dataset and auto-creates 14–20 targeted test scenarios covering 7 test types:

✅ Happy Path — question is in the KB; bot should answer correctly.

🔍 Partial Match — question partially overlaps KB entries; tests completeness.

🚫 Out of Scope — question is not in the KB; bot must not hallucinate.

🔄 Rephrased — same KB question asked differently; tests semantic robustness.

🧩 Multi-Step — requires combining multiple KB entries.

↩ Off-Topic — unrelated to bot domain; bot should deflect gracefully.

⚡ Edge Case — ambiguous, very short, or wrong-language input.

Sample rows: Controls how many dataset rows are shown to the LLM when generating scenarios (8–20). More rows produce more diverse scenarios.
Run All Tests: Evaluates every generated scenario using Correctness, Completeness, Relevance, Faithfulness. The most relevant KB rows are auto-retrieved as context per question. Scenario cards update in real-time as results arrive.
Result summary: After completion, a pass-rate breakdown by test type shows where the bot struggles (e.g. Out-of-Scope or Edge-Case failures).
Export: Download all scenario results as CSV or PDF.

📞 ダイヤルタブ

電話をかける：E.164形式の電話番号（例：+819012345678）を入力し、言語（EN または JP）を選んでダイヤルをクリックします。
言語：選択した言語に対応したシステムプロンプトと最初のメッセージをボットが使用します。
通話分析：通話後に📊 Analyzeをクリックすると、会話タイムライン・ターン数・バージイン（割り込み）の受諾・拒否イベントが確認できます。
最近の通話：直近20件の通話がフォームの下に一覧表示されます。各行の📊 Viewから分析できます。

🤖 チャットタブ

インタラクティブテスト：実際の電話をかけずに、現在のプロンプトと設定を使ってLLMと直接会話できます。
言語切り替え：EN/JPを切り替えると、対応するシステムプロンプトが会話コンテキストに読み込まれます。
クリア：会話履歴をリセットします。言語切り替え時も自動的にリセットされます。

💬 プロンプトタブ

システムプロンプト：ボットの個性と指示を定義します。{date}と入力すると通話時に今日の日付が自動挿入されます。
最初のメッセージ：通話開始時にボットが最初に話す内容です。
保存：プロンプトを保存します。次回の通話から有効になります（再起動不要）。保存のたびに自動スナップショットが作成されます。
デフォルトに戻す：その言語の初期プロンプトに戻します。
📸 スナップショット：現在のプロンプトに名前をつけて手動保存できます（例：「トーン変更前」）。
バージョン履歴：保存済みスナップショットの一覧です。Diffで現在の編集内容との差分を並べて確認でき、Restoreで選択したバージョンを有効プロンプトに上書き復元できます。

⚙️ 設定タブ

Temperature（温度）：応答の創造性を制御します。低いほど一貫した回答、高いほど多様な回答になります。音声ボットには0.5〜0.8が推奨です。
Max Tokens：応答の最大長さです。音声通話では200〜500が推奨です。
TTS Speed：TTSエンジンの音声再生速度です。0.9〜1.0が自然に聞こえます。
デフォルト受信言語：環境変数の上書きがない場合に着信通話で使用される言語です。

🧪 Evalタブ

シングルターンのユーザー入力をシミュレートし、LLMがボットの回答を採点することで現在のプロンプトをテストします。各シナリオは1〜5点で採点され、4以上でPASS、3以下でFAILとなります。

🧪 シナリオ：8つの組み込みテストシナリオを実行します。▶ Run Allで全件実行、個別実行も可能です。結果はCSVまたはPDFでエクスポートできます。
📝 カスタム：独自のテストシナリオを作成できます（名前・アイコン・ユーザーメッセージ・判定基準）。サーバーに保存して評価セットとして利用できます。
📈 履歴：「Run All」を実行するたびに結果がブラウザのローカルストレージに保存されます。スコアトレンドチャートと履歴テーブルを確認でき、各実行結果をCSV/PDFでエクスポートできます。
⚖️ A/B比較：2つのプロンプトバージョンを比較します。保存済みバージョンをプロンプトA、現在の有効プロンプトをプロンプトBとして全シナリオを実行し、どちらが優れているかをシナリオごとに確認できます。

🎭 シミュレートタブ

目的：複数ターンの会話スクリプトを事前に定義し、手動で入力せずに自動実行できます。
スクリプト：各行が1ターンのユーザー発言です。+ Add Turnでターンを追加し、×で削除できます。
実行：▶ Run Simulationをクリックすると、現在のプロンプトと設定を使ってボットが各ターンに順番に応答します。応答が届くごとにリアルタイムでトランスクリプトが表示されます。
用途：プロンプト変更後のリグレッションテスト、スクリプト化したデモ、複数ターンにまたがるエッジケースのテスト。

🔬 RAG評価タブ

実際のナレッジベースに対してボットを評価します。Q&Aデータセットをアップロードし、3つのサブモードでパフォーマンス測定・インタラクティブなチャット・テストシナリオの自動生成＆実行が行えます。

Step 1 — データセットをアップロード

対応フォーマット：CSV、XLSX、XLS、PDF（最大10MB）。
必須列：質問列と期待される回答列が必要です。オプションでコンテキスト列を指定すると忠実性（Faithfulness）スコアリングに使用されます。
列マッピング：アップロード後、検出された列を質問・期待される回答・オプションでコンテキストにドロップダウンで対応付けます。まずサブセットを試す場合は行数制限も調整できます。
プレビュー：最初の5行がテーブル表示されるので、実行前にマッピングが正しいか確認できます。

サブモード：📊 Evaluate（評価）

メトリクス：6種類のLLM評価メトリクスから選択できます — 正確性・完全性・関連性・忠実性・簡潔性・コンテキスト活用度。各メトリクスは1〜5点で採点され、4以上でPASSとなります。
実行：▶ Run RAG Evaluationをクリックします。ボットが現在のプロンプトで各質問に回答し、LLMジャッジが各メトリクスのスコアを割り当てます。
サマリーカード：メトリクスごとの平均スコアとトップ改善提案が表示されます。
質問別結果：全質問についてボット回答・各メトリクスのスコア・総合スコア・PASS/FAILを一覧表示。行を展開すると詳細なフィードバックが確認できます。
エクスポート：結果をCSVまたはPDFでダウンロードできます。

サブモード：💬 RAG Chat

目的：アップロードしたナレッジベースをコンテキストとして使いながら対話形式で質問できます。実際のボットのRAGの動作をそのままミラーリングします。
取得の仕組み：メッセージを送信すると、キーワードの重複スコアで全データセット行を評価し、上位Top-N行をボットのシステムプロンプトに注入してからLLMを呼び出します。
Top-N：コンテキストとして取得する行数を制御します（3・5・8・12）。大きくするとコンテキストが増えますが、関連性が薄れる場合があります。
コンテキスト引用：各ボット回答に「📚 N個のコンテキストソースを使用」欄が表示され、展開すると注入された正確な行を確認できます。取得品質の検証に役立ちます。
マルチターン：直近10ターンの履歴が各リクエストに含まれ、会話の連続性が保たれます。
クリア：チャット履歴をリセットしてバブルエリアを空にします。

サブモード：🧪 Generate & Test（生成＆テスト）

シナリオ生成：LLMが現在のボットプロンプトとデータセットのサンプルを分析し、14〜20件のテストシナリオを自動作成します。

✅ ハッピーパス — KBにある質問。ボットは正確に答えられるべき。

🔍 部分一致 — KBと部分的に重複する質問。完全性をテスト。

🚫 スコープ外 — KBにない質問。ボットはハルシネーションしてはいけない。

🔄 言い換え — 同じKB質問を別の表現で聞く。意味的ロバスト性をテスト。

🧩 マルチステップ — 複数のKBエントリを組み合わせる必要がある質問。

↩ オフトピック — ボットのドメインと無関係。適切に対応できるかテスト。

⚡ エッジケース — 曖昧・非常に短い・言語が異なる入力など。

サンプル行数：シナリオ生成時にLLMに見せるデータセット行数を制御します（8〜20）。多いほど多様なシナリオが生成されます。
Run All Tests：正確性・完全性・関連性・忠実性を使って全シナリオを評価します。各質問に対して最も関連性の高いKB行が自動取得されてコンテキストに使用されます。シナリオカードはリアルタイムで更新されます。
結果サマリー：全テスト完了後、テストタイプごとのパス率がピル形式で表示されます。ボットが苦手な領域（例：スコープ外・エッジケース）を一目で把握できます。
エクスポート：生成されたシナリオと結果をCSVまたはPDFでダウンロードできます。

RAG Evaluation

Upload a Q&A dataset → evaluate, chat, or generate & test using the bot prompt + knowledge base

Upload Dataset

Supported: CSV, XLSX, XLS, PDF (max 10 MB). Required columns: question + expected answer. Optional: context.

Drop file here or browse

CSV · XLSX · XLS · PDF

Map columns:

Question column *

Expected Answer column *

Context column (optional)

Row limit

Select Evaluation Metrics

Factual Correctness

Is the bot answer factually correct vs the expected answer?

Completeness

Does the answer cover ALL key points from the expected answer?

Answer Relevance

Does the bot answer directly address the question?

Faithfulness / No Hallucination

Does the bot avoid inventing facts not in the expected answer?

Conciseness

Is the answer appropriately concise without padding?

Context Utilization

Does the answer correctly use the provided context? (Needs context column)

Dataset Preview

Ready

Evaluation Summary

Top Recommendations

Per-Question Results

⚠ No dataset loaded. Upload a file in the Evaluate tab first to enable RAG context retrieval.

💬 Chat with RAG Context

Ask questions — the bot will retrieve the most relevant rows from your dataset as context.

Top-N:

💬

Ask a question about your knowledge base…

⚠ No dataset loaded. Upload a file in the Evaluate tab first.

🧪 Generate Test Scenarios from RAG + Prompt

The LLM studies your current bot prompt + knowledge base and generates targeted test scenarios covering happy paths, out-of-scope, edge cases, and more.

Sample rows:

Generating scenarios…

RAG Hallucination Check

Upload a Q&A dataset, run each question against the bot, and detect hallucinated claims with match scoring

Upload Q&A Dataset

CSV, XLSX, or PDF with question and expected answer columns (max 10 MB)

Map Columns

Question column

Expected answer column

Context column (optional)

Row limit:

0 / 0 0%

Avg Hallucination Score

—

Passed

Failed

Avg Matching %

—

Place a Call

Post-Call Analysis

Recent Calls

BaseCamp — Prompt Builder

Draft Prompt

Quality Evaluation

🇺🇸 English Prompt

🇯🇵 Japanese Prompt

Version History

Model & Call Settings

Prompt Evaluation

Custom Test Scenarios

Eval History

A/B Prompt Comparison

Conversation Simulator

Simulation Transcript

Simulation of Chat

Best Case

Worst Case

Neutral

How to Use

📞 Dial Tab

🤖 Chat Tab

💬 Prompts Tab

⚙️ Settings Tab

🧪 Eval Tab

🎭 Simulate Tab

🔬 RAG Evaluation Tab

📞 ダイヤルタブ

🤖 チャットタブ

💬 プロンプトタブ

⚙️ 設定タブ

🧪 Evalタブ

🎭 シミュレートタブ

🔬 RAG評価タブ

RAG Evaluation

Upload Dataset

Select Evaluation Metrics

Dataset Preview

Evaluation Summary

Per-Question Results

RAG Hallucination Check

Upload Q&A Dataset

Map Columns

Prompt Diff