awslabs · tonykchen · Apr 9, 2024 · Apr 7, 2024 · Apr 8, 2024 · Apr 8, 2024
diff --git a/README.md b/README.md
@@ -2,14 +2,14 @@
 
 Agent Evaluation is a LLM-powered framework for testing virtual agents.
 
-## Documentation
+## 📚 Documentation
 
 To get started, please visit the full documentation [here](https://awslabs.github.io/agent-evaluation/). To contribute, please refer to [CONTRIBUTING.md](./CONTRIBUTING.md)
 
-## Creators
+## ✨ Contributors
 
-[<img src="https://github.com/tonykchen.png" width="60px;"/><br /><sub><a href="https://github.com/tonykchen">tonykchen</a></sub>](https://github.com/tonykchen/)
+Shout out to these awesome contributors:
 
-[<img src="https://github.com/bobbywlindsey.png" width="60px;"/><br /><sub><a href="https://github.com/bobbywlindsey">bobbywlindsey</a></sub>](https://github.com/bobbywlindsey/)
-
-[<img src="https://github.com/sharonxiaohanli.png" width="60px;"/><br /><sub><a href="https://github.com/sharonxiaohanli">sharonxiaohanli</a></sub>](https://github.com/sharonxiaohanli/)
+<a href="https://awslabs/agent-evaluation/graphs/contributors">
+  <img src="https://contrib.rocks/image?repo=awslabs/agent-evaluation" />
+</a>
diff --git a/docs/evaluators/bedrock/claude.md b/docs/evaluators/bedrock/claude.md
@@ -7,8 +7,8 @@ This evaluator is implemented using [Anthropic's Claude](https://www.anthropic.c
 The principal must have [InvokeModel](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModel.html) access for the following models:
 
 - Anthropic Claude 3 Sonnet (`anthropic.claude-3-sonnet-20240229-v1:0`)
+- Anthropic Claude 3 Haiku (`anthropic.claude-3-haiku-20240307-v1:0`)
 - Anthropic Claude (`anthropic.claude-v2:1`)
-- Anthropic Claude Instant (`anthropic.claude-instant-v1`)
 
 ## Configurations
 
@@ -23,8 +23,8 @@ evaluator:
 The Claude model to use as the Evaluator. This should be one of following:
 
 - `claude-sonnet`
+- `claude-haiku`
 - `claude`
-- `claude-instant`
 
 If unspecified, `claude-sonnet` will be used.
 

diff --git a/docs/evaluators/index.md b/docs/evaluators/index.md
@@ -1,6 +1,28 @@
 # Evaluators
 
-An Evaluator is an agent that evaluates a [Target](../targets/index.md) on a test.
+An Evaluator is an agent that evaluates a [Target](../targets/index.md) on a test. The diagram below depicts the workflow used during evaluation.
+
+``` mermaid
+graph TD
+  A((Start)) --> B{Initial<br>prompt?}
+  B -->|yes| C(Invoke agent)
+  B -->|no| D(Generate initial prompt)
+  D --> C
+  C --> E(Get test status)
+  E --> F{All steps<br>attempted?}  
+  F --> |yes| G(Evaluate conversation)
+  F --> |no| H{Max turns<br>reached?}
+  H --> |yes| I(Fail)
+  style I stroke:#f00
+  H --> |no| J(Generate user response)
+  J --> C
+  G --> K{All expected<br>results<br>observed?}
+  K --> |yes| L(Pass)
+  style L stroke:#0f0
+  K --> |no| I(Fail)
+  I --> M((End))
+  L --> M
+```
 
 ---
 

diff --git a/docs/user_guide.md b/docs/user_guide.md
@@ -11,7 +11,6 @@ This will create an `agenteval.yml` file in the current directory.
 ```yaml
 evaluator:
   type: bedrock-claude
-  model: claude-sonnet
 target:
   type: bedrock-agent
   bedrock_agent_id: null
@@ -64,40 +63,52 @@ tests:
       - The agent returns a list of open claims.
 ```
 
-If your test cases are complex, consider breaking them down into multiple `steps`, `expected_results`, and/or `tests`.
+If your test case is complex, consider breaking it down into multiple, smaller `tests`.
 
 ### Multi-turn conversations
 
 To test multiple user-agent interactions, you can provide multiple `steps` to orchestrate the interaction.
 
 ```yaml
 tests:
-  - name: GetOpenClaimWithDetails
+  - name: GetOpenClaimsWithDetails
     steps:
       - Ask the agent which claims are open.
-      - Once the agent responds with the list of open claims, ask for the details
-        on claim-006.
+      - Ask the agent for details on claim-006.
     expected_results:
+      - The agent returns a list of open claims.
       - The agent returns the details on claim-006.
 ```
 
 The maximum number of turns allowed for a conversation is configured using the `max_turns` parameter for the test (defaults to `2` when not specified).
 If the number of turns in the conversation reaches the `max_turns` limit, then the test will fail.
 
-### Specify the first user message
+### Providing data
 
-By default, the first user message in the test is automatically generated based on the list of `steps`. To override this message, you can specify the `initial_prompt`.
+You can test an agent's ability to prompt the user for data when you include it within the step. For example:
 
 ```yaml
 tests:
-  - name: GetOpenClaimWithDetails
+  - name: GetAutoOpenClaims
     steps:
       - Ask the agent which claims are open.
-      - Once the agent responds with the list of open claims, ask for the details
-        on claim-006.
+        When the agent asks for the claim type, respond with "Auto".
     expected_results:
-      - The agent returns the details on claim-006.
-    initial_prompt: Can you let me know which claims are open?
+      - The agent returns claim-001 and claim-002
+```
+
+### Specify the first user message
+
+By default, the first user message in the test is automatically generated based on the first step. To override this message, you can specify the `initial_prompt`.
+
+```yaml
+tests:
+  - name: GetClaimsWithMissingDocuments
+    steps:
+      - Ask agent which claims still have missing documents.
+    initial_prompt: Can you let me know which claims still have missing documents?
+    expected_results:
+      - The agent returns claim-003 and claim-004
 ```
 
 ## Evaluation hooks

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -50,7 +50,11 @@ repo_url: https://github.com/awslabs/agent-evaluation
 markdown_extensions:
   - admonition
   - pymdownx.details
-  - pymdownx.superfences
+  - pymdownx.superfences:
+      custom_fences:
+        - name: mermaid
+          class: mermaid
+          format: !!python/name:pymdownx.superfences.fence_code_format
   - pymdownx.snippets
   - attr_list
   - md_in_html

diff --git a/src/agenteval/cli.py b/src/agenteval/cli.py
@@ -73,7 +73,6 @@ def run(
     num_threads: Optional[int],
     work_dir: Optional[str],
 ):
-
     try:
         plan = Plan.load(plan_dir)
         if work_dir:

diff --git a/src/agenteval/evaluators/aws/bedrock/claude/claude_evaluator.py b/src/agenteval/evaluators/aws/bedrock/claude/claude_evaluator.py
@@ -14,25 +14,49 @@
 _PROMPT_TEMPLATE_ROOT = "evaluators/claude"
 _SYSTEM_PROMPT_DIR = "system"
 _PROMPT_TEMPLATE_NAMES = [
-    "start_conversation",
-    "user_response",
-    "task_status",
-    "evaluate",
+    "generate_initial_prompt",
+    "generate_user_response",
+    "generate_test_status",
+    "generate_evaluation",
 ]
 
+# enable backwards-compatible StrEnum
+try:
+    from enum import StrEnum
+except ImportError:
+    from enum import Enum
 
-_TASK_STATUS_COMPLETED_CATEGORY = "A"
-_TASK_STATUS_NOT_COMPLETED_CATEGORY = "B"
-_TASK_STATUS_UNABLE_TO_COMPLETE_CATEGORY = "C"
-_EVAL_ALL_EXPECTED_RESULT_OBSERVED_CATEGORY = "A"
-_EVAL_NOT_ALL_EXPECTED_RESULT_OBSERVED_CATEGORY = "B"
+    class StrEnum(str, Enum):
+        pass
+
+
+class TestStatusCategories(StrEnum):
+    ALL_STEPS_ATTEMPTED = "A"
+    NOT_ALL_STEPS_ATTEMPTED = "B"
+
+
+class EvaluationCategories(StrEnum):
+    ALL_EXPECTED_RESULTS_OBSERVED = "A"
+    NOT_ALL_EXPECTED_RESULTS_OBSERVED = "B"
+
+
+class Results(StrEnum):
+    MAX_TURNS_REACHED = "Maximum turns reached."
+    ALL_EXPECTED_RESULTS_OBSERVED = (
+        "All of the expected results can be observed in the conversation."
+    )
+    NOT_ALL_EXPECTED_RESULTS_OBSERVED = (
+        "Not all of the expected results can be observed in the conversation."
+    )
 
 
 class ClaudeEvaluator(BedrockEvaluator):
     def __init__(
         self,
         model: Literal[
-            "claude-sonnet", "claude", "claude-instant"
+            "claude-sonnet",
+            "claude-haiku",
+            "claude",
         ] = model_configs.DEFAULT_MODEL,
         **kwargs,
     ):
@@ -86,10 +110,10 @@ def _generate(
         return output, reasoning
 
     def _generate_initial_prompt(self) -> str:
-        system_prompt = self._prompt_template_map["start_conversation"][
+        system_prompt = self._prompt_template_map["generate_initial_prompt"][
             "system"
         ].render()
-        prompt = self._prompt_template_map["start_conversation"]["prompt"].render(
+        prompt = self._prompt_template_map["generate_initial_prompt"]["prompt"].render(
             step=self.test.steps[0]
         )
 
@@ -107,35 +131,39 @@ def _generate_initial_prompt(self) -> str:
         )
         return initial_prompt
 
-    def _generate_task_status(self) -> str:
-        system_prompt = self._prompt_template_map["task_status"]["system"].render()
-        prompt = self._prompt_template_map["task_status"]["prompt"].render(
+    def _generate_test_status(self) -> str:
+        system_prompt = self._prompt_template_map["generate_test_status"][
+            "system"
+        ].render()
+        prompt = self._prompt_template_map["generate_test_status"]["prompt"].render(
             steps=self.test.steps, conversation=self.conversation
         )
-        task_status, reasoning = self._generate(
+        test_status, reasoning = self._generate(
             system_prompt=system_prompt,
             prompt=prompt,
-            output_xml_element="task_status",
+            output_xml_element="category",
         )
         self.trace.add_step(
             system_prompt=system_prompt,
             prompt=prompt,
-            task_status=task_status,
+            test_status=test_status,
             reasoning=reasoning,
         )
-        return task_status
+        return test_status
 
-    def _generate_evaluation(self) -> str:
-        system_prompt = self._prompt_template_map["evaluate"]["system"].render()
-        prompt = self._prompt_template_map["evaluate"]["prompt"].render(
+    def _generate_evaluation(self) -> tuple[str, str]:
+        system_prompt = self._prompt_template_map["generate_evaluation"][
+            "system"
+        ].render()
+        prompt = self._prompt_template_map["generate_evaluation"]["prompt"].render(
             expected_results=self.test.expected_results,
             conversation=self.conversation,
         )
 
         evaluation, reasoning = self._generate(
             system_prompt=system_prompt,
             prompt=prompt,
-            output_xml_element="eval",
+            output_xml_element="category",
         )
         self.trace.add_step(
             system_prompt=system_prompt,
@@ -147,8 +175,10 @@ def _generate_evaluation(self) -> str:
         return evaluation, reasoning
 
     def _generate_user_response(self) -> str:
-        system_prompt = self._prompt_template_map["user_response"]["system"].render()
-        prompt = self._prompt_template_map["user_response"]["prompt"].render(
+        system_prompt = self._prompt_template_map["generate_user_response"][
+            "system"
+        ].render()
+        prompt = self._prompt_template_map["generate_user_response"]["prompt"].render(
             steps=self.test.steps, conversation=self.conversation
         )
 
@@ -174,12 +204,12 @@ def _invoke_target(self, user_input) -> str:
 
     def evaluate(self) -> TestResult:
         success = False
-        eval_reasoning = ""
-        result = "Max turns reached."
+        result = Results.MAX_TURNS_REACHED.value
+        reasoning = ""
 
         while self.conversation.turns < self.test.max_turns:
             if self.conversation.turns == 0:
-                # start convo
+                # start conversation
                 if self.test.initial_prompt:
                     user_input = self.test.initial_prompt
                 else:
@@ -188,29 +218,29 @@ def evaluate(self) -> TestResult:
                 # generate next user response
                 user_input = self._generate_user_response()
 
+            # add turn to the conversation
             self.conversation.add_turn(user_input, self._invoke_target(user_input))
 
-            # get task status
-            task_status_category = self._generate_task_status()
-            if task_status_category in (
-                _TASK_STATUS_COMPLETED_CATEGORY,
-                _TASK_STATUS_UNABLE_TO_COMPLETE_CATEGORY,
-            ):
-                # evaluate
-                eval_category, eval_reasoning = self._generate_evaluation()
-                if eval_category == _EVAL_ALL_EXPECTED_RESULT_OBSERVED_CATEGORY:
-                    success = True
-                    result = "All expected results observed."
-                elif task_status_category == _TASK_STATUS_UNABLE_TO_COMPLETE_CATEGORY:
-                    result = "Agent was unable to complete a step."
+            # get test status
+            test_status = self._generate_test_status()
+            if test_status == TestStatusCategories.ALL_STEPS_ATTEMPTED:
+                # evaluate conversation
+                eval_category, reasoning = self._generate_evaluation()
+                if (
+                    eval_category
+                    == EvaluationCategories.NOT_ALL_EXPECTED_RESULTS_OBSERVED.value  # noqa: W503
+                ):
+                    result = Results.NOT_ALL_EXPECTED_RESULTS_OBSERVED.value
                 else:
-                    result = "Not all of the expected results were observed."
-                # break since task has been completed
+                    result = Results.ALL_EXPECTED_RESULTS_OBSERVED.value
+                    success = True
+
                 break
+
         return TestResult(
             test_name=self.test.name,
             success=success,
             result=result,
-            reasoning=eval_reasoning,
+            reasoning=reasoning,
             conversation_handler=self.conversation,
         )
diff --git a/src/agenteval/evaluators/aws/bedrock/claude/model_configs.py b/src/agenteval/evaluators/aws/bedrock/claude/model_configs.py
@@ -1,7 +1,7 @@
 CLAUDE_MODEL_ID_MAP = {
     "claude": "anthropic.claude-v2:1",
-    "claude-instant": "anthropic.claude-instant-v1",
     "claude-sonnet": "anthropic.claude-3-sonnet-20240229-v1:0",
+    "claude-haiku": "anthropic.claude-3-haiku-20240307-v1:0",
 }
 DEFAULT_MODEL = "claude-sonnet"
 

diff --git a/...l/templates/evaluators/claude/evaluate.j2 → .../evaluators/claude/generate_evaluation.j2 b/...l/templates/evaluators/claude/evaluate.j2 → .../evaluators/claude/generate_evaluation.j2
diff --git a/...s/evaluators/claude/start_conversation.j2 → ...luators/claude/generate_initial_prompt.j2 b/...s/evaluators/claude/start_conversation.j2 → ...luators/claude/generate_initial_prompt.j2
diff --git a/...emplates/evaluators/claude/task_status.j2 → ...evaluators/claude/generate_test_status.j2 b/...emplates/evaluators/claude/task_status.j2 → ...evaluators/claude/generate_test_status.j2
diff --git a/...plates/evaluators/claude/user_response.j2 → ...aluators/claude/generate_user_response.j2 b/...plates/evaluators/claude/user_response.j2 → ...aluators/claude/generate_user_response.j2
diff --git a/...ates/evaluators/claude/system/evaluate.j2 → ...tors/claude/system/generate_evaluation.j2 b/...ates/evaluators/claude/system/evaluate.j2 → ...tors/claude/system/generate_evaluation.j2
@@ -1,11 +1,12 @@
-You are a quality assurance engineer that is testing a virtual agent.
+You are a quality assurance engineer evaluating a conversation between an USER and an AGENT.
 
-Your job is analyze the conversation in the <conversation> tags a list of expected results in <expected_results> tags.
+Your job is to analyze the conversation in <conversation> tags and a list of expected results
+in <expected_results> tags.
 
 You will classify the the conversation into the following categories:
 
 - A: All of the expected results can be observed in the conversation.
 - B: Not all of the expected results can be observed in the conversation.
 
 Please think hard about the response in <thinking> tags before providing only the category letter
-within <eval> tags.
+within <category> tags.