Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix inconsistent evaluation results #28

Merged
merged 11 commits into from
Apr 9, 2024
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,14 @@

Agent Evaluation is a LLM-powered framework for testing virtual agents.

## Documentation
## 📚 Documentation

To get started, please visit the full documentation [here](https://awslabs.github.io/agent-evaluation/). To contribute, please refer to [CONTRIBUTING.md](./CONTRIBUTING.md)

## Creators
## ✨ Contributors

[<img src="https://github.com/tonykchen.png" width="60px;"/><br /><sub><a href="https://github.com/tonykchen">tonykchen</a></sub>](https://github.com/tonykchen/)
Shout out to these awesome contributors:

[<img src="https://github.com/bobbywlindsey.png" width="60px;"/><br /><sub><a href="https://github.com/bobbywlindsey">bobbywlindsey</a></sub>](https://github.com/bobbywlindsey/)

[<img src="https://github.com/sharonxiaohanli.png" width="60px;"/><br /><sub><a href="https://github.com/sharonxiaohanli">sharonxiaohanli</a></sub>](https://github.com/sharonxiaohanli/)
<a href="https://awslabs/agent-evaluation/graphs/contributors">
<img src="https://contrib.rocks/image?repo=awslabs/agent-evaluation" />
</a>
4 changes: 2 additions & 2 deletions docs/evaluators/bedrock/claude.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ This evaluator is implemented using [Anthropic's Claude](https://www.anthropic.c
The principal must have [InvokeModel](https://docs.aws.amazon.com/bedrock/latest/APIReference/API_runtime_InvokeModel.html) access for the following models:

- Anthropic Claude 3 Sonnet (`anthropic.claude-3-sonnet-20240229-v1:0`)
- Anthropic Claude 3 Haiku (`anthropic.claude-3-haiku-20240307-v1:0`)
- Anthropic Claude (`anthropic.claude-v2:1`)
- Anthropic Claude Instant (`anthropic.claude-instant-v1`)

## Configurations

Expand All @@ -23,8 +23,8 @@ evaluator:
The Claude model to use as the Evaluator. This should be one of following:

- `claude-sonnet`
- `claude-haiku`
- `claude`
- `claude-instant`

If unspecified, `claude-sonnet` will be used.

Expand Down
24 changes: 23 additions & 1 deletion docs/evaluators/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,28 @@
# Evaluators

An Evaluator is an agent that evaluates a [Target](../targets/index.md) on a test.
An Evaluator is an agent that evaluates a [Target](../targets/index.md) on a test. The diagram below depicts the workflow used during evaluation.

``` mermaid
graph TD
A((Start)) --> B{Initial<br>prompt?}
B -->|yes| C(Invoke agent)
B -->|no| D(Generate initial prompt)
D --> C
C --> E(Get test status)
E --> F{All steps<br>attempted?}
F --> |yes| G(Evaluate conversation)
F --> |no| H{Max turns<br>reached?}
H --> |yes| I(Fail)
style I stroke:#f00
H --> |no| J(Generate user response)
J --> C
G --> K{All expected<br>results<br>observed?}
K --> |yes| L(Pass)
style L stroke:#0f0
K --> |no| I(Fail)
I --> M((End))
L --> M
```

---

Expand Down
35 changes: 23 additions & 12 deletions docs/user_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ This will create an `agenteval.yml` file in the current directory.
```yaml
evaluator:
type: bedrock-claude
model: claude-sonnet
target:
type: bedrock-agent
bedrock_agent_id: null
Expand Down Expand Up @@ -64,40 +63,52 @@ tests:
- The agent returns a list of open claims.
```

If your test cases are complex, consider breaking them down into multiple `steps`, `expected_results`, and/or `tests`.
If your test case is complex, consider breaking it down into multiple, smaller `tests`.

### Multi-turn conversations

To test multiple user-agent interactions, you can provide multiple `steps` to orchestrate the interaction.

```yaml
tests:
- name: GetOpenClaimWithDetails
- name: GetOpenClaimsWithDetails
steps:
- Ask the agent which claims are open.
- Once the agent responds with the list of open claims, ask for the details
on claim-006.
- Ask the agent for details on claim-006.
expected_results:
- The agent returns a list of open claims.
- The agent returns the details on claim-006.
```

The maximum number of turns allowed for a conversation is configured using the `max_turns` parameter for the test (defaults to `2` when not specified).
If the number of turns in the conversation reaches the `max_turns` limit, then the test will fail.

### Specify the first user message
### Providing data

By default, the first user message in the test is automatically generated based on the list of `steps`. To override this message, you can specify the `initial_prompt`.
You can test an agent's ability to prompt the user for data when you include it within the step. For example:

```yaml
tests:
- name: GetOpenClaimWithDetails
- name: GetAutoOpenClaims
steps:
- Ask the agent which claims are open.
- Once the agent responds with the list of open claims, ask for the details
on claim-006.
When the agent asks for the claim type, respond with "Auto".
expected_results:
- The agent returns the details on claim-006.
initial_prompt: Can you let me know which claims are open?
- The agent returns claim-001 and claim-002
```

### Specify the first user message

By default, the first user message in the test is automatically generated based on the first step. To override this message, you can specify the `initial_prompt`.

```yaml
tests:
- name: GetClaimsWithMissingDocuments
steps:
- Ask agent which claims still have missing documents.
initial_prompt: Can you let me know which claims still have missing documents?
expected_results:
- The agent returns claim-003 and claim-004
```

## Evaluation hooks
Expand Down
6 changes: 5 additions & 1 deletion mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,11 @@ repo_url: https://github.com/awslabs/agent-evaluation
markdown_extensions:
- admonition
- pymdownx.details
- pymdownx.superfences
- pymdownx.superfences:
custom_fences:
- name: mermaid
class: mermaid
format: !!python/name:pymdownx.superfences.fence_code_format
- pymdownx.snippets
- attr_list
- md_in_html
Expand Down
1 change: 0 additions & 1 deletion src/agenteval/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,6 @@ def run(
num_threads: Optional[int],
work_dir: Optional[str],
):

try:
plan = Plan.load(plan_dir)
if work_dir:
Expand Down
118 changes: 74 additions & 44 deletions src/agenteval/evaluators/aws/bedrock/claude/claude_evaluator.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,25 +14,49 @@
_PROMPT_TEMPLATE_ROOT = "evaluators/claude"
_SYSTEM_PROMPT_DIR = "system"
_PROMPT_TEMPLATE_NAMES = [
"start_conversation",
"user_response",
"task_status",
"evaluate",
"generate_initial_prompt",
"generate_user_response",
"generate_test_status",
"generate_evaluation",
]

# enable backwards-compatible StrEnum
try:
from enum import StrEnum
except ImportError:
from enum import Enum

_TASK_STATUS_COMPLETED_CATEGORY = "A"
_TASK_STATUS_NOT_COMPLETED_CATEGORY = "B"
_TASK_STATUS_UNABLE_TO_COMPLETE_CATEGORY = "C"
_EVAL_ALL_EXPECTED_RESULT_OBSERVED_CATEGORY = "A"
_EVAL_NOT_ALL_EXPECTED_RESULT_OBSERVED_CATEGORY = "B"
class StrEnum(str, Enum):
pass


class TestStatusCategories(StrEnum):
ALL_STEPS_ATTEMPTED = "A"
NOT_ALL_STEPS_ATTEMPTED = "B"


class EvaluationCategories(StrEnum):
ALL_EXPECTED_RESULTS_OBSERVED = "A"
NOT_ALL_EXPECTED_RESULTS_OBSERVED = "B"


class Results(StrEnum):
MAX_TURNS_REACHED = "Maximum turns reached."
ALL_EXPECTED_RESULTS_OBSERVED = (
"All of the expected results can be observed in the conversation."
)
NOT_ALL_EXPECTED_RESULTS_OBSERVED = (
"Not all of the expected results can be observed in the conversation."
)


class ClaudeEvaluator(BedrockEvaluator):
def __init__(
self,
model: Literal[
"claude-sonnet", "claude", "claude-instant"
"claude-sonnet",
"claude-haiku",
"claude",
] = model_configs.DEFAULT_MODEL,
**kwargs,
):
Expand Down Expand Up @@ -86,10 +110,10 @@ def _generate(
return output, reasoning

def _generate_initial_prompt(self) -> str:
system_prompt = self._prompt_template_map["start_conversation"][
system_prompt = self._prompt_template_map["generate_initial_prompt"][
"system"
].render()
prompt = self._prompt_template_map["start_conversation"]["prompt"].render(
prompt = self._prompt_template_map["generate_initial_prompt"]["prompt"].render(
step=self.test.steps[0]
)

Expand All @@ -107,35 +131,39 @@ def _generate_initial_prompt(self) -> str:
)
return initial_prompt

def _generate_task_status(self) -> str:
system_prompt = self._prompt_template_map["task_status"]["system"].render()
prompt = self._prompt_template_map["task_status"]["prompt"].render(
def _generate_test_status(self) -> str:
system_prompt = self._prompt_template_map["generate_test_status"][
"system"
].render()
prompt = self._prompt_template_map["generate_test_status"]["prompt"].render(
steps=self.test.steps, conversation=self.conversation
)
task_status, reasoning = self._generate(
test_status, reasoning = self._generate(
system_prompt=system_prompt,
prompt=prompt,
output_xml_element="task_status",
output_xml_element="category",
)
self.trace.add_step(
system_prompt=system_prompt,
prompt=prompt,
task_status=task_status,
test_status=test_status,
reasoning=reasoning,
)
return task_status
return test_status

def _generate_evaluation(self) -> str:
system_prompt = self._prompt_template_map["evaluate"]["system"].render()
prompt = self._prompt_template_map["evaluate"]["prompt"].render(
def _generate_evaluation(self) -> tuple[str, str]:
system_prompt = self._prompt_template_map["generate_evaluation"][
"system"
].render()
prompt = self._prompt_template_map["generate_evaluation"]["prompt"].render(
expected_results=self.test.expected_results,
conversation=self.conversation,
)

evaluation, reasoning = self._generate(
system_prompt=system_prompt,
prompt=prompt,
output_xml_element="eval",
output_xml_element="category",
)
self.trace.add_step(
system_prompt=system_prompt,
Expand All @@ -147,8 +175,10 @@ def _generate_evaluation(self) -> str:
return evaluation, reasoning

def _generate_user_response(self) -> str:
system_prompt = self._prompt_template_map["user_response"]["system"].render()
prompt = self._prompt_template_map["user_response"]["prompt"].render(
system_prompt = self._prompt_template_map["generate_user_response"][
"system"
].render()
prompt = self._prompt_template_map["generate_user_response"]["prompt"].render(
steps=self.test.steps, conversation=self.conversation
)

Expand All @@ -174,12 +204,12 @@ def _invoke_target(self, user_input) -> str:

def evaluate(self) -> TestResult:
success = False
eval_reasoning = ""
result = "Max turns reached."
result = Results.MAX_TURNS_REACHED.value
reasoning = ""

while self.conversation.turns < self.test.max_turns:
if self.conversation.turns == 0:
# start convo
# start conversation
if self.test.initial_prompt:
user_input = self.test.initial_prompt
else:
Expand All @@ -188,29 +218,29 @@ def evaluate(self) -> TestResult:
# generate next user response
user_input = self._generate_user_response()

# add turn to the conversation
self.conversation.add_turn(user_input, self._invoke_target(user_input))

# get task status
task_status_category = self._generate_task_status()
if task_status_category in (
_TASK_STATUS_COMPLETED_CATEGORY,
_TASK_STATUS_UNABLE_TO_COMPLETE_CATEGORY,
):
# evaluate
eval_category, eval_reasoning = self._generate_evaluation()
if eval_category == _EVAL_ALL_EXPECTED_RESULT_OBSERVED_CATEGORY:
success = True
result = "All expected results observed."
elif task_status_category == _TASK_STATUS_UNABLE_TO_COMPLETE_CATEGORY:
result = "Agent was unable to complete a step."
# get test status
test_status = self._generate_test_status()
if test_status == TestStatusCategories.ALL_STEPS_ATTEMPTED:
# evaluate conversation
eval_category, reasoning = self._generate_evaluation()
if (
eval_category
== EvaluationCategories.NOT_ALL_EXPECTED_RESULTS_OBSERVED.value # noqa: W503
):
result = Results.NOT_ALL_EXPECTED_RESULTS_OBSERVED.value
else:
result = "Not all of the expected results were observed."
# break since task has been completed
result = Results.ALL_EXPECTED_RESULTS_OBSERVED.value
success = True

break

return TestResult(
test_name=self.test.name,
success=success,
result=result,
reasoning=eval_reasoning,
reasoning=reasoning,
conversation_handler=self.conversation,
)
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
CLAUDE_MODEL_ID_MAP = {
"claude": "anthropic.claude-v2:1",
"claude-instant": "anthropic.claude-instant-v1",
"claude-sonnet": "anthropic.claude-3-sonnet-20240229-v1:0",
"claude-haiku": "anthropic.claude-3-haiku-20240307-v1:0",
}
DEFAULT_MODEL = "claude-sonnet"

Expand Down
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
You are a quality assurance engineer that is testing a virtual agent.
You are a quality assurance engineer evaluating a conversation between an USER and an AGENT.

Your job is analyze the conversation in the <conversation> tags a list of expected results in <expected_results> tags.
Your job is to analyze the conversation in <conversation> tags and a list of expected results
in <expected_results> tags.

You will classify the the conversation into the following categories:

- A: All of the expected results can be observed in the conversation.
- B: Not all of the expected results can be observed in the conversation.

Please think hard about the response in <thinking> tags before providing only the category letter
within <eval> tags.
within <category> tags.
Loading
Loading