- 
                Notifications
    You must be signed in to change notification settings 
- Fork 144
chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
          
     Merged
      
        
      
    
  
     Merged
                    Changes from all commits
      Commits
    
    
            Show all changes
          
          
            91 commits
          
        
        Select commit
          Hold shift + click to select a range
      
      f63e48a
              
                chore: LangChain based accuracy tests
              
              
                himanshusinghs 7efe7be
              
                chore: use vercel AI SDK instead of langchain
              
              
                himanshusinghs 6f7b99a
              
                chore: integrate capturing accuracy snapshots
              
              
                himanshusinghs add4204
              
                chore: correct env names
              
              
                himanshusinghs f0c1d38
              
                chore: more consolidated prompt tests
              
              
                himanshusinghs 8fe4942
              
                chore: add a few more tests and some more models
              
              
                himanshusinghs d220f22
              
                chore: add AzureOpenAI model in the model list
              
              
                himanshusinghs 1c58427
              
                chore: use ListDatabasesTool response creator for tests
              
              
                himanshusinghs 5ce954e
              
                chore: use ListCollectionsTool response creators in tests
              
              
                himanshusinghs cfce256
              
                chore: tests for collection-indexes tool
              
              
                himanshusinghs c3a0a72
              
                modify prompt for list-collections prompt and log tools provided
              
              
                himanshusinghs c71ac44
              
                chore: have mock generators return Promise of ToolResult as well
              
              
                himanshusinghs f6a8fcd
              
                chore: tests for collection-schema tool
              
              
                himanshusinghs ed0a6da
              
                chore: do not fail tests on dropped accuracy
              
              
                himanshusinghs c6da0b5
              
                chore: added tests for find tool
              
              
                himanshusinghs 774640b
              
                chore: tests for insert-many tool
              
              
                himanshusinghs 6e894bc
              
                chore: tests for delete-many tool
              
              
                himanshusinghs 942bfc0
              
                chore: add oepnai provider
              
              
                himanshusinghs 34bd4c2
              
                chore: fixes accuracy scorer for position independent matching
              
              
                himanshusinghs 537fe2a
              
                chore: replace mock mcp client with real (mockable) mcp client
              
              
                himanshusinghs 0bd9167
              
                chore: moved all existing tests to vercel mcp client
              
              
                himanshusinghs efefd9d
              
                chore: adds tests for the rest of the tools
              
              
                himanshusinghs 06422a7
              
                chore: adds missed out tests for tools
              
              
                himanshusinghs 6039b1d
              
                chore: MongoDB based snapshot storage for accuracy runs
              
              
                himanshusinghs 8b39a1c
              
                chore: remove file based snapshot
              
              
                himanshusinghs ca49d40
              
                wip: snapshot summary generator
              
              
                himanshusinghs 92413df
              
                chore: single entry point for running accuracy tests with different c…
              
              
                himanshusinghs 8c50ecf
              
                chore: reformat
              
              
                himanshusinghs 8c8a25b
              
                chore: lint fixes
              
              
                himanshusinghs ebe14d5
              
                chore: simplified toolCallingAccuracy calculation
              
              
                himanshusinghs ad316f7
              
                chore: account for types moved around
              
              
                himanshusinghs b34f6bc
              
                chore: adds accuracyRunStatus to snapshot entries
              
              
                himanshusinghs 815952d
              
                chore: add disk based accuracy storage for local runs
              
              
                himanshusinghs 5c99f85
              
                chore: revert changes done to any of the src files
              
              
                himanshusinghs 0d6938a
              
                chore: handle test failures and appropriately mark them as failed in …
              
              
                himanshusinghs cbb137a
              
                chore: make snapshot storage independent of accuracyRunId and commitSHA
              
              
                himanshusinghs 9321563
              
                chore: bail on first failure and add some explanation for update-accu…
              
              
                himanshusinghs f636c3f
              
                chore: refactor to make tests writing simpler and other QOL improveme…
              
              
                himanshusinghs ebcc19d
              
                chore: generate accuracy test summary post test
              
              
                himanshusinghs b1bf731
              
                chore: add Github workflow to trigger test runs
              
              
                himanshusinghs 2e08208
              
                chore: fix permissions issue
              
              
                himanshusinghs 509a23c
              
                chore: bring back packages post merge
              
              
                himanshusinghs be957b5
              
                chore: update report generation to include comparison with baseline a…
              
              
                himanshusinghs bad3012
              
                Update .github/workflows/accuracy-tests.yml
              
              
                himanshusinghs bc6e755
              
                Update .github/workflows/accuracy-tests.yml
              
              
                himanshusinghs 3e094fa
              
                Update .github/workflows/accuracy-tests.yml
              
              
                himanshusinghs dca7217
              
                Update .github/workflows/accuracy-tests.yml
              
              
                himanshusinghs 05c81c0
              
                chore: secrets as per conventions
              
              
                himanshusinghs e47922f
              
                chore: updated how we store accuracy result
              
              
                himanshusinghs fe47c61
              
                chore: move accuracy scripts inside accuracy
              
              
                himanshusinghs 727be10
              
                chore: addresses more PR feedback
              
              
                himanshusinghs a0b9802
              
                chore: use @ai-sdk/google
              
              
                himanshusinghs f4ddec2
              
                chore: use npm script in ci
              
              
                himanshusinghs ea25ac5
              
                chore: shift only when arguments are passed to the script
              
              
                himanshusinghs d50824d
              
                chore: azure url is on vars
              
              
                himanshusinghs 772a0a3
              
                chore: use env vars for mongo namespace
              
              
                himanshusinghs 1c2295a
              
                chore: ensure the generated asset directory is present
              
              
                himanshusinghs a3ba9e0
              
                chore: generate a markdown brief for PR comments
              
              
                himanshusinghs bf0e696
              
                chore: use lockfile for updating local test results
              
              
                himanshusinghs e845e1a
              
                chore: make expectedToolCalls part of PromptResult
              
              
                himanshusinghs 4f41af5
              
                chore: make omitted fields a const
              
              
                himanshusinghs e421125
              
                chore: update formatRunStatus as per feedback
              
              
                himanshusinghs 2c2c428
              
                chore: move saveModelResponseForPromptAtomic to atomic update pipeline
              
              
                himanshusinghs 34214ad
              
                chore: prefer exclusive reads for public interface
              
              
                himanshusinghs 508f906
              
                chore: minor refactor of disk-storage (#370)
              
              
                nirinchev d3f1f73
              
                chore: simplify getAccuracyResult
              
              
                himanshusinghs ea127bf
              
                chore: simplified the update pipeline and added tool call serialization
              
              
                himanshusinghs acba3b4
              
                chore: use $literal instead of serializing the tool calls
              
              
                himanshusinghs f0d9c79
              
                chore: don't import what is not used
              
              
                himanshusinghs 7798eb1
              
                chore: should use $literal also for expectedToolCalls
              
              
                himanshusinghs f303bb4
              
                chore: should recreate comment and hide previous one
              
              
                himanshusinghs eb24505
              
                chore: rebase fixes and move to vitest
              
              
                himanshusinghs 8db0e6f
              
                chore: run unit and integration for test script
              
              
                himanshusinghs 83157d3
              
                chore: PR feedback
              
              
                himanshusinghs 6c57c38
              
                chore: add return type annotation for accuracy testing client
              
              
                himanshusinghs ba37196
              
                chore: update test file names per naming convention
              
              
                himanshusinghs c2a51fd
              
                chore: update sdk file names per naming convention
              
              
                himanshusinghs a66553b
              
                chore: update accuracy file name per convention
              
              
                himanshusinghs ab99613
              
                chore: move test config out of functions
              
              
                himanshusinghs 093ebcf
              
                chore: move left out test config out of functions
              
              
                himanshusinghs 8496b03
              
                chore: remove unused func
              
              
                himanshusinghs 4bbcba1
              
                chore: remove orphan checks
              
              
                himanshusinghs 7c3061d
              
                chore: update the test prompt
              
              
                himanshusinghs ec52ee5
              
                chore: allow adding custom parameter scorers
              
              
                himanshusinghs 743cbfa
              
                chore: ts fixes
              
              
                himanshusinghs 3491a3b
              
                fix: tweak the arg shapes to improve tool accuracy (#381)
              
              
                nirinchev 2909e8a
              
                Replace the matcher framework
              
              
                nirinchev 49bfac4
              
                remove microdiff
              
              
                nirinchev 356512b
              
                fix tests
              
              
                nirinchev 8a5a9d2
              
                don't omit fields for MongoDB storage
              
              
                nirinchev 2d4e750
              
                fix test coverage
              
              
                nirinchev File filter
Filter by extension
Conversations
          Failed to load comments.   
        
        
          
      Loading
        
  Jump to
        
          Jump to file
        
      
      
          Failed to load files.   
        
        
          
      Loading
        
  Diff view
Diff view
There are no files selected for viewing
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| name: Accuracy Tests | ||
|  | ||
| on: | ||
| workflow_dispatch: | ||
| push: | ||
| branches: | ||
| - main | ||
| pull_request: | ||
| types: | ||
| - labeled | ||
|  | ||
| jobs: | ||
| run-accuracy-tests: | ||
| name: Run Accuracy Tests | ||
| runs-on: ubuntu-latest | ||
| permissions: | ||
| contents: read | ||
| pull-requests: write | ||
| if: | | ||
| github.event_name == 'workflow_dispatch' || | ||
| (github.event_name == 'pull_request' && github.event.label.name == 'accuracy-tests') | ||
| env: | ||
| MDB_OPEN_AI_API_KEY: ${{ secrets.ACCURACY_OPEN_AI_API_KEY }} | ||
| MDB_GEMINI_API_KEY: ${{ secrets.ACCURACY_GEMINI_API_KEY }} | ||
| MDB_AZURE_OPEN_AI_API_KEY: ${{ secrets.ACCURACY_AZURE_OPEN_AI_API_KEY }} | ||
| MDB_AZURE_OPEN_AI_API_URL: ${{ vars.ACCURACY_AZURE_OPEN_AI_API_URL }} | ||
| MDB_ACCURACY_MDB_URL: ${{ secrets.ACCURACY_MDB_CONNECTION_STRING }} | ||
| MDB_ACCURACY_MDB_DB: ${{ vars.ACCURACY_MDB_DB }} | ||
| MDB_ACCURACY_MDB_COLLECTION: ${{ vars.ACCURACY_MDB_COLLECTION }} | ||
| MDB_ACCURACY_BASELINE_COMMIT: ${{ github.event.pull_request.base.sha || '' }} | ||
| steps: | ||
| - uses: GitHubSecurityLab/actions-permissions/monitor@v1 | ||
| - uses: actions/checkout@v4 | ||
| - uses: actions/setup-node@v4 | ||
| with: | ||
| node-version-file: package.json | ||
| cache: "npm" | ||
| - name: Install dependencies | ||
| run: npm ci | ||
| - name: Run accuracy tests | ||
| run: npm run test:accuracy | ||
| - name: Upload accuracy test summary | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: accuracy-test-summary | ||
| path: .accuracy/test-summary.html | ||
| - name: Comment summary on PR | ||
| if: github.event_name == 'pull_request' && github.event.label.name == 'accuracy-tests' | ||
| uses: marocchino/sticky-pull-request-comment@d2ad0de260ae8b0235ce059e63f2949ba9e05943 # v2 | ||
| with: | ||
| # Hides the previous comment and add a comment at the end | ||
| hide_and_recreate: true | ||
| hide_classify: "OUTDATED" | ||
| path: .accuracy/test-brief.md | ||
  
    
      This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
      Learn more about bidirectional Unicode characters
    
  
  
    
              | Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -11,3 +11,5 @@ state.json | |
|  | ||
| tests/tmp | ||
| coverage | ||
| # Generated assets by accuracy runs | ||
| .accuracy | ||
      
      Oops, something went wrong.
        
    
  
      
      Oops, something went wrong.
        
    
  
  Add this suggestion to a batch that can be applied as a single commit.
  This suggestion is invalid because no changes were made to the code.
  Suggestions cannot be applied while the pull request is closed.
  Suggestions cannot be applied while viewing a subset of changes.
  Only one suggestion per line can be applied in a batch.
  Add this suggestion to a batch that can be applied as a single commit.
  Applying suggestions on deleted lines is not supported.
  You must change the existing code in this line in order to create a valid suggestion.
  Outdated suggestions cannot be applied.
  This suggestion has been applied or marked resolved.
  Suggestions cannot be applied from pending reviews.
  Suggestions cannot be applied on multi-line comments.
  Suggestions cannot be applied while the pull request is queued to merge.
  Suggestion cannot be applied right now. Please check back later.
  
    
  
    
Uh oh!
There was an error while loading. Please reload this page.