Skip to content

Conversation

@misrasaurabh1
Copy link
Contributor

@misrasaurabh1 misrasaurabh1 commented Jun 3, 2025

User description

Each test invocation id now has its precise associated stdout which can be accurately compared across runs
This solves the problem with test fixtures printing non-deterministic stuff on stdout.


PR Type

Enhancement, Tests


Description

  • Introduce test_stdout_tag for distinct stdout markers

  • Use new $######...######$ tag format

  • Update regex and parsing logic to extract output

  • Adjust tests to expect new stdout tags


Changes walkthrough 📝

Relevant files
Enhancement
instrument_existing_tests.py
Add stdout tagging and print enhancements                               

codeflash/code_utils/instrument_existing_tests.py

  • Add AST assignment for test_stdout_tag variable
  • Switch print statements to $###### tag format
  • Simplify printing logic for performance mode
  • +51/-74 
    parse_test_output.py
    Enhance test output parsing logic                                               

    codeflash/verification/parse_test_output.py

  • Split regex into start/end match patterns
  • Implement pairing of begin/end match segments
  • Extract runtime and iteration_id correctly
  • +35/-21 
    Tests
    test_instrument_all_and_run.py
    Update instrument_all_and_run tests                                           

    tests/test_instrument_all_and_run.py

  • Update expected prints to use $###### tags
  • Adjust multi-line string assertions for new tags
  • +23/-45 
    test_instrument_tests.py
    Adjust instrumentation test templates                                       

    tests/test_instrument_tests.py

  • Modify instrumentation strings to define test_stdout_tag
  • Replace print calls with new $###### tag format
  • +172/-82
    test_instrumentation_run_results_aiservice.py
    Revise aiservice instrumentation stdout tags                         

    tests/test_instrumentation_run_results_aiservice.py

  • Revise behavior logging code to define test_stdout_tag
  • Replace prints to use $###### tag format
  • Remove redundant imports
  • +35/-24 
    test_test_runner.py
    Fix spacing in test_runner assertion                                         

    tests/test_test_runner.py

    • Fix equality assertion spacing
    +1/-3     

    Need help?
  • Type /help how to ... in the comments thread for any questions about PR-Agent usage.
  • Check out the documentation for more information.
  • @github-actions
    Copy link

    github-actions bot commented Jun 3, 2025

    PR Reviewer Guide 🔍

    (Review updated until commit 1070815)

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 No security concerns identified
    ⚡ Recommended focus areas for review

    Regex robustness

    New regex for parsing start/end tags may not correctly handle edge cases like colons in names or unexpected whitespace; verify patterns against valid and invalid tag formats.

    matches_re_start = re.compile(r"!\$######(.*?):(.*?)([^\.:]*?):(.*?):(.*?):(.*?)######\$!\n")
    matches_re_end = re.compile(r"!######(.*?):(.*?)([^\.:]*?):(.*?):(.*?):(.*?)######!")
    AST instrumentation fragility

    Hard-coded lineno offsets in AST node insertion could break if wrapper function structure changes; consider using relative positions or validating offsets dynamically.

    ),
    *(
        [
            ast.Assign(
                targets=[ast.Name(id="test_stdout_tag", ctx=ast.Store())],
                value=ast.JoinedStr(
                    values=[
                        ast.FormattedValue(value=ast.Name(id="test_module_name", ctx=ast.Load()), conversion=-1),
                        ast.Constant(value=":"),
                        ast.FormattedValue(
                            value=ast.IfExp(
                                test=ast.Name(id="test_class_name", ctx=ast.Load()),
                                body=ast.BinOp(
                                    left=ast.Name(id="test_class_name", ctx=ast.Load()),
                                    op=ast.Add(),
                                    right=ast.Constant(value="."),
                                ),
                                orelse=ast.Constant(value=""),
                            ),
                            conversion=-1,
                        ),
                        ast.FormattedValue(value=ast.Name(id="test_name", ctx=ast.Load()), conversion=-1),
                        ast.Constant(value=":"),
                        ast.FormattedValue(value=ast.Name(id="function_name", ctx=ast.Load()), conversion=-1),
                        ast.Constant(value=":"),
                        ast.FormattedValue(value=ast.Name(id="loop_index", ctx=ast.Load()), conversion=-1),
                        ast.Constant(value=":"),
                        ast.FormattedValue(value=ast.Name(id="invocation_id", ctx=ast.Load()), conversion=-1),
                    ]
                ),
                lineno=lineno + 9,
            ),
    Performance overhead

    Printing detailed stdout tags for each invocation adds extra I/O; validate that performance impact is acceptable for large test suites.

        codeflash_wrap.index[test_id] = 0
    codeflash_test_index = codeflash_wrap.index[test_id]
    invocation_id = f'{{line_id}}_{{codeflash_test_index}}'
    test_stdout_tag = f"{{test_module_name}}:{{(test_class_name + '.' if test_class_name else '')}}{{test_name}}:{{function_name}}:{{loop_index}}:{{invocation_id}}"
    print(f"!$######{{test_stdout_tag}}######$!")

    @github-actions
    Copy link

    github-actions bot commented Jun 3, 2025

    PR Code Suggestions ✨

    Latest suggestions up to 1070815
    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix missing end markers check

    Replace the redundant or not begin_matches check with a check against missing end
    markers so that invocations are skipped only when either start or end markers are
    absent.

    codeflash/verification/parse_test_output.py [277]

    -if not begin_matches or not begin_matches:
    +if not begin_matches or not end_matches:
    Suggestion importance[1-10]: 8

    __

    Why: The condition if not begin_matches or not begin_matches redundantly checks begin_matches twice; replacing the second check with or not end_matches correctly skips when end markers are missing, avoiding parsing errors.

    Medium
    General
    Use positional matching for end markers

    Avoid using the groups tuple as a dict key since multiple invocations may share the
    same tag and overwrite entries. Instead, collect end matches in a list and select
    the next appropriate end marker by comparing positions.

    codeflash/verification/parse_test_output.py [269-275]

    -end_matches = {}
    -for match in matches_re_end.finditer(sys_stdout):
    +end_matches = list(matches_re_end.finditer(sys_stdout))
    +
    +for match_index, match in enumerate(begin_matches):
         groups = match.groups()
    -    if len(groups[5].split(":")) > 1:
    -        iteration_id = groups[5].split(":")[0]
    -        groups = groups[:5] + (iteration_id,)
    -    end_matches[groups] = match
    +    # find the first end match after this start
    +    end_match = next((em for em in end_matches if em.start() > match.end()), None)
    +    iteration_id, runtime = groups[5], None
    +    if end_match:
    +        stdout = sys_stdout[match.end():end_match.start()]
    +        # parse iteration_id and runtime as before...
    Suggestion importance[1-10]: 7

    __

    Why: Using the tuple of groups as a dict key can lead to overwrites when tags repeat; collecting end matches in a list and matching by position ensures each start marker pairs with the correct end marker and preserves all entries.

    Medium
    Anchor end regex to newline

    Anchor the end‐marker regex to include the trailing newline (as done for the start
    regex) to avoid capturing unintended trailing characters.

    codeflash/verification/parse_test_output.py [40]

    -matches_re_end = re.compile(r"!######(.*?):(.*?)([^\.:]*?):(.*?):(.*?):(.*?)######!")
    +matches_re_end = re.compile(r"!######(.*?):(.*?)([^\.:]*?):(.*?):(.*?):(.*?)######!\n")
    Suggestion importance[1-10]: 5

    __

    Why: Adding a \n anchor to the end‐marker regex aligns it with the start pattern and prevents stray characters from being captured, improving match precision without major side effects.

    Low

    Previous suggestions

    Suggestions up to commit bc47662
    CategorySuggestion                                                                                                                                    Impact
    Possible issue
    Fix missing end-match check

    The condition mistakenly repeats not begin_matches instead of checking for missing
    end markers. Change the second part to not end_matches to detect incomplete capture
    pairs.

    codeflash/verification/parse_test_output.py [277]

    -if not begin_matches or not begin_matches:
    +if not begin_matches or not end_matches:
         test_results.add(
             FunctionTestInvocation(
                 ...,
                 stdout="",
             )
         )
    Suggestion importance[1-10]: 9

    __

    Why: The duplicated not begin_matches condition skips checking end_matches, breaking the pair matching logic; correcting it addresses a real bug in incomplete capture detection.

    High
    General
    Include newline in end regex

    The end-marker regex doesn’t account for the newline that print appends, causing
    slicing offsets to be off by one. Include \n after ######! so the match spans the
    full marker.

    codeflash/verification/parse_test_output.py [40]

    -matches_re_end = re.compile(r"!######(.*?):(.*?)([^\.:]*?):(.*?):(.*?):(.*?)######!")
    +matches_re_end = re.compile(r"!######(.*?):(.*?)([^\.:]*?):(.*?):(.*?):(.*?)######!\n")
    Suggestion importance[1-10]: 6

    __

    Why: Incorporating \n into matches_re_end ensures the trailing newline from print is matched, preventing off-by-one slicing errors in extracted stdout.

    Low

    @misrasaurabh1
    Copy link
    Contributor Author

    will fix some remaining tests soon. The changes should be ready to review now though. I want to deploy this today

    @misrasaurabh1 misrasaurabh1 requested a review from a team June 3, 2025 09:10
    @misrasaurabh1 misrasaurabh1 marked this pull request as ready for review June 3, 2025 09:10
    @github-actions
    Copy link

    github-actions bot commented Jun 3, 2025

    Persistent review updated to latest commit 1070815

    @openhands-ai
    Copy link

    openhands-ai bot commented Jun 3, 2025

    Looks like there are a few issues preventing this PR from being merged!

    • GitHub Actions are failing:
      • end-to-end-test

    If you'd like me to help, just leave a comment, like

    @OpenHands please fix the failing actions on PR #273

    Feel free to include any additional details that might help me get this PR into a better state.

    You can manage your notification settings

    @misrasaurabh1 misrasaurabh1 merged commit 1ee8c8a into main Jun 4, 2025
    15 of 17 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    3 participants