Skip to content

Conversation

fileames
Copy link
Member

This PR makes the necessary changes to make sure our integrations pass the standard tests offered in langchain-tests.

Changes include:

  • Previously, inserting documents with duplicate IDs could raise a unique constraint error and fail the entire batch. We now use batcherrors=True (https://python-oracledb.readthedocs.io/en/latest/user_guide/batch_statement.html#handling-data-errors ) so per-row errors don’t invalidate other inserts. Only successfully inserted IDs are returned.

  • Optional upsert behavior: Standard tests expect rows with duplicate IDs to be updated rather than erroring. To preserve backward compatibility, we introduced a constructor parameter mutate_on_duplicate:
    False (default): preserve previous behavior (no updates on duplicate IDs).
    True: update existing rows (texts, metadata, etc.) when duplicate IDs are provided.

  • New methods: Added get_by_ids and aget_by_ids.

  • ID handling and hashing

    • In our current implementation, when IDs aren’t provided on add_texts, we generate them via uuid.uuid4() and store a hashed version in a RAW column. Users need these generated ids to use in delete or get_by_ids. To enable this add_texts is expected to return these generated ids.
    • However, we return the hashed versions, which does not work given in delete or get_by_ids as we hash them again to search in the documents:
original_documents = [
    Document(page_content="foo1", metadata={"id": "1"}),
    Document(page_content="bar2", metadata={"id": "2"}),
]
ids = store.add_documents(original_documents)
store.delete(ids)

assert len(store.similarity_search("foo", k=10)) == 0 # FAILS
  • This behaviour is fixed to return the unhashed versions.

  • similarity_search functions returned Documents did not have the id field as we did not have the original unhashed ids not saved to DB. To keep the table structure same for users with existing tables, these original ids are added to the metadata with the key "__orcl_internal_doc_id", which is then used to return Documents including the id fields.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Sep 19, 2025
@fileames
Copy link
Member Author

Hi @cjbj, if you have any comments, I'd be happy to address

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant