Skip to content

Conversation

@MilesHolland
Copy link
Contributor

Add new service based groundedness evaluator, which uses the rai service to determine groundedness.

This has a few extra adaptations compared to a normal rai service evaluator, including:

  • A new column remapping function in the evaluate function to rename the evaluator's output label to a passing score when aggregated into a metric.
  • Some custom column renaming within the evaluator itself, since the desired output column prefixes (groundedness_pro) differs from the rai service name for this evaluation, while also needing some post processing to convert a numeric groundedness score into a true/false label (Note this will be further adapted once the binarization PR is merged)

@MilesHolland MilesHolland requested a review from a team as a code owner October 23, 2024 21:28
@github-actions github-actions bot added the Evaluation Issues related to the client library for Azure AI Evaluation label Oct 23, 2024
@azure-sdk
Copy link
Collaborator

API change check

APIView has identified API level changes in this PR and created following API reviews.

azure-ai-evaluation

@MilesHolland MilesHolland changed the title Eval/feature/groundedness pro Add groundedness pro eval Oct 24, 2024
@MilesHolland MilesHolland merged commit 578b16c into Azure:main Oct 25, 2024
21 checks passed
result = await super()._do_eval(eval_input)
real_result = {}
real_result[self._output_prefix + "_label"] = (
result[EvaluationMetrics.GROUNDEDNESS + "_score"] >= self._passing_score
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MilesHolland, why do we not output the binary output as part of the AACS API? Is it because it is not part of the service call?

azure_ai_project,
**kwargs,
):
self._passing_score = 3 # TODO update once the binarization PR is merged
Copy link
Member

@changliu2 changliu2 Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@posaninagendra, in order to reach parity with AACS groundedness, any ungrounded content detected will make its AACS binary output ungroundedDetected to be True.

  1. isn't ungroundedDetected part of the service call output?
  2. if not (meaning SDK only receives ungroundedPercentage as the output), then to match the logic for ungroudnedDetected, this self._passing_score should be 5, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, here is the sample output on the AACS doc:

{ "ungroundedDetected": true, "ungroundedPercentage": 1, "ungroundedDetails": [ { "text": "12/hour.", "offset": { "utf8": 0, "utf16": 0, "codePoint": 0 }, "length": { "utf8": 8, "utf16": 8, "codePoint": 8 }, "reason": "None. The premise mentions a pay of \"10/hour\" but does not mention \"12/hour.\" It's neutral. " } ] }

l0lawrence pushed a commit to l0lawrence/azure-sdk-for-python that referenced this pull request Feb 19, 2025
* Adding service based groundedness

* groundedness pro eval

* remove groundedness and fix unit tests

* run black

* change evaluate label

* Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py

Co-authored-by: Neehar Duvvuri <[email protected]>

* Update sdk/evaluation/azure-ai-evaluation/azure/ai/evaluation/_evaluators/_service_groundedness/_service_groundedness.py

Co-authored-by: Neehar Duvvuri <[email protected]>

* comments and CL

* re record tests

* black and pylint

* comments

* nits

* analysis

* re cast

* more mypy appeasement

---------

Co-authored-by: Ankit Singhal <[email protected]>
Co-authored-by: Neehar Duvvuri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Evaluation Issues related to the client library for Azure AI Evaluation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants