Semantic search with pgvector #340

jankrepl · 2025-08-25T09:00:25Z

Closes #329

Useful links

The extension: https://github.com/pgvector/pgvector
Python helper package: https://github.com/pgvector/pgvector-python
pgvector in AWS RDS - https://aws.amazon.com/about-aws/whats-new/2024/05/amazon-rds-postgresql-pgvector-0-7-0/
pgvector in Azure - https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-use-pgvector

TODO

app/queries/common.py

GianlucaFicarelli · 2025-08-27T07:26:30Z

app/queries/common.py

@@ -291,6 +293,15 @@ def router_read_many[T: BaseModel, I: Identifiable](  # noqa: PLR0913
        .limit(pagination_request.page_size)
    )

+    # Add semantic similarity ordering if embedding is provided and model has embedding field
+    if embedding is not None and hasattr(db_model_class, "embedding"):
+        # Remove existing ordering clauses and replace with semantic similarity ordering


As quickly discussed before, this is an important thing to agree on, because it depends on the use case for the semantic search. It means that:

the order of records is different (species, strain and brain_region are ordered by name by default. we don't allow ordering by other attributes, but it's technically possible)

in general, the user cannot order the results by other attributes. In any case with this approach it could be better to explicitly raise an error if the order_by is explicitly passed by the user together with the semantic search, instead of silently ignore it.

all the records are returned, even the ones with very low similarity (that will be at the bottom of the list or in the last pages)

in general, the user cannot order the results by other attributes. In any case with this approach it could be better to explicitly raise an error if the order_by is explicitly passed by the user together with the semantic search, instead of silently ignore it.

We initially wanted to raise an exception but then we realized that somehow by default the swagger UI attaches the order_by=name query param by default. That is why we decided to just ignore it in case semantic_search is provided too. Alternatively, we could just somehow make sure the order_by=name is not sent by default. Feel free to decide.

the order of records is different (species, strain and brain_region are ordered by name by default. we don't allow ordering by other attributes, but it's technically possible)

Point taken that one can provide multiple entries in the order_by. Now, there are two options

semantic_search comes first. Then I would argue there is no need for other tiebreaker entries (e.g. name, created_date,...) since no two entries in the DB will have the same distance. The current PR basically implements this case since it disregards all the other order_by clauses.

semantic_search does not come first. If for some reason someone wants to sort by creation day (not date) first and only then by semantic search then I do agree that our current PR does not support it. However, I would argue that it is not that useful. However, it can be implemented but maybe I would leave that for a future PR since the iteraction between the FastAPI filter library and this custom semantic_search is not obvious.

all the records are returned, even the ones with very low similarity (that will be at the bottom of the list or in the last pages)

I can make the same argument about order_by=name or any other column you order by. As discussed in person, we will use this endpoint by adding the page_size=5 and not caring about the other pages.

GianlucaFicarelli · 2025-08-27T07:30:46Z

app/utils/embedding.py

+    openai_api_key = settings.OPENAI_API_KEY.get_secret_value()
+
+    # Generate embedding using OpenAI API
+    client = openai.OpenAI(api_key=openai_api_key)


Are there alternatives instead of calling the OpenAI API? Is there a rate limiting?

Yes. There are many alternatives (other API providers and even self hosting custom embedding models). However, since neuroagent uses OpenAI in production we thought about making things simple for now.

However, the clear downside is that entitycore will make calls to a 3rd party API from now on that requires an api key .

Is there a rate limiting?

I think this is an important point - do we have any idea if they have an SLA?
I see https://status.openai.com/ but I didn't find latency numbers.

Yes. There is rate limiting. Check https://platform.openai.com/docs/guides/rate-limits/usage-tiers?context=tier-one

The reason why we did not implement any retry logic or some exception handling:

GET - semantic_search is optional so the default behavior is not to use semantic search

in POST - we assume that the three entities this PR is concerned with (species, strain and brain region) won't change that much (or at all)

However, if you want we can write some extra logic.

Co-authored-by: Gianluca Ficarelli <[email protected]>

mgeplf · 2025-09-01T06:40:59Z

app/utils/embedding.py

+    openai_api_key = settings.OPENAI_API_KEY.get_secret_value()
+
+    # Generate embedding using OpenAI API
+    client = openai.OpenAI(api_key=openai_api_key)


Is there a rate limiting?

I think this is an important point - do we have any idea if they have an SLA?
I see https://status.openai.com/ but I didn't find latency numbers.

mgeplf · 2025-09-01T06:41:06Z

app/utils/embedding.py

+    openai_api_key = settings.OPENAI_API_KEY.get_secret_value()
+
+    # Generate embedding using OpenAI API
+    client = openai.OpenAI(api_key=openai_api_key)


Is there a rate limiting?

I think this is an important point - do we have any idea if they have an SLA?
I see https://status.openai.com/ but I didn't find latency numbers.

mgeplf · 2025-09-01T06:41:48Z

app/utils/embedding.py

+    Raises:
+        ValueError: If OpenAI API key is not configured
+    """
+    if settings.OPENAI_API_KEY is None:


I don't think we need to check the key every time; having it in the config will be enough.

It was mostly for the type checker so that we can call get_secret_value().

Note that we want want entitycore to be runnable (locally) without that env variable (of course the semantic search features would not work in that case)

mgeplf · 2025-09-01T06:43:14Z

app/service/species.py

+    embedding = None
+
+    if semantic_search is not None:
+        # Generate embedding using OpenAI API


I don't think these comments are needed; how the embedding is a detail that isn't needed here, IMO.

Removed the comment in species.py, strain.py and brain_region.py

da4645f

mgeplf · 2025-09-01T06:44:15Z

app/schemas/species.py



 class NestedSpeciesRead(SpeciesCreate, IdentifiableMixin):
-    pass
+    embedding: list[float] | None = Field(default=None, exclude=True)


Is the embedding required on the client side?
To me it seems like an internal details used for semantic search, so it would be nice to keep it internal.

mgeplf · 2025-09-01T06:44:20Z

app/schemas/species.py



 class NestedSpeciesRead(SpeciesCreate, IdentifiableMixin):
-    pass
+    embedding: list[float] | None = Field(default=None, exclude=True)


Is the embedding required on the client side?
To me it seems like an internal details used for semantic search, so it would be nice to keep it internal.

The client should never see the embeddings. This is why here we do exclude=True. Which effectively excludes it from the model_dump.

Note that in the parent class SpeciesCreate we use embedding: SkipJsonSchema because we don't want it to show up in openapi.json but we do want it to be included in model_dump to hook up to entitycore's logic.

I see.

The rule of thumb I've been going by is that the schema are what enforces the API boundaries with respect to the clients (basically the json payloads for creates and responses.)

I wonder if we could, instead of adding the embedding to the schema, handle the generation of the vectors in router_create_one based on class or endpoint, or a create_embedding argument.

That's probably something for @GianlucaFicarelli to decide.

It is your call @GianlucaFicarelli and @mgeplf. We can adjust the code. The main issue is that your current abstractions assume that CreateEntity model will be exactly the same as what you are putting in the DB. Which does not work in this case. We tried to bypass that mechanism in the simplest possible way.

mgeplf · 2025-09-01T06:45:52Z

alembic/versions/20250825_123447_7aa80d34dbdd_add_pgvector.py

+        # Extract embeddings from response
+        embeddings = [embedding.embedding for embedding in response.data]
+    else:
+        # Use random vectors when OpenAI key is not provided


Do we really want random vectors? Since it's nullable, can't null be used?
How does the search work w/ null values?

Fair point. We avoided making the column nullable because IMO the documentation does not really document the behavior that well. There is some mention in the README.md https://github.com/pgvector/pgvector but it is in the section on an indexing algorithm HNSW (that we don't use).

Essentially, we wanted to avoid having to append queries with WHERE embedding IS NOT NULL. Anyway, if you find some good resource that documents the behavior then I would be happy to change it.

GianlucaFicarelli · 2025-09-09T08:27:14Z

app/db/model.py

@@ -175,12 +176,14 @@ class BrainRegion(Identifiable):
    hierarchy_id: Mapped[uuid.UUID] = mapped_column(
        ForeignKey("brain_region_hierarchy.id"), index=True
    )
+    embedding: Mapped[Vector] = mapped_column(Vector(1536), nullable=False)


To avoid repeating the vector definition, we could define a mixin that can be reused also if we want to add semantic search to more tables.
It would be helpful also to add a comment to explain why the dim of the vector is 1536.

GianlucaFicarelli · 2025-09-09T08:32:57Z

app/service/strain.py

@@ -69,6 +78,9 @@ def read_one(id_: uuid.UUID, db: SessionDep) -> StrainRead:
 def create_one(
    json_model: StrainCreate, db: SessionDep, user_context: AdminContextDep
 ) -> StrainRead:
+    # Generate embedding using OpenAI API


Minor, this comment has been removed in read_many but not in the create_ones

GianlucaFicarelli · 2025-09-09T09:31:11Z

alembic/versions/20250825_123447_7aa80d34dbdd_add_pgvector.py

+
+        # Generate all embeddings in a single API call
+        names = [entity[2] for entity in all_entities]
+        response = client.embeddings.create(model="text-embedding-3-small", input=names)


For confirmation, is this request below the limits indicated in https://platform.openai.com/docs/api-reference/embeddings/create?

GianlucaFicarelli · 2025-09-09T10:04:32Z

alembic/versions/20250825_123447_7aa80d34dbdd_add_pgvector.py

+def upgrade() -> None:
+    # ### commands auto generated by Alembic - please adjust! ###
+    # Enable the pgvector extension
+    op.execute("CREATE EXTENSION IF NOT EXISTS vector;")


The manual query to create the extension doesn't seem to work well with alembic automatic migration: running make migration on the up to date branch would create a migration file that reverts the change, trying to remove the extension.

To fix it, it's possible to add in triggers.py the following code:

entities += [ PGExtension(schema="public", signature="vector"), ]

With this change, alembic would also be able to automatically generate the commands

public_vector = PGExtension(schema="public", signature="vector") op.create_entity(public_vector)

and

public_vector = PGExtension(schema="public", signature="vector") op.drop_entity(public_vector)

We could also rename triggers.py to something more generic, such as entities.py or pg_entities.py or sql_entities.py, although entity is also a class and a table so it may be a bit misleading, but I don't have a better name.

GianlucaFicarelli · 2025-09-09T11:39:59Z

alembic/versions/20250825_123447_7aa80d34dbdd_add_pgvector.py

+    # Generate embeddings based on available API key
+    api_key = os.getenv("OPENAI_API_KEY")
+
+    if api_key:


The migration should fail if executed in staging or production and the api key isn't defined by mistake.
There is an env variable ENVIRONMENT that represents if the image is built for prod or dev, so it's not exactly the same as the running environment, but it should be enough to check that.

GianlucaFicarelli · 2025-09-09T11:55:42Z

app/schemas/species.py

@@ -10,26 +11,28 @@ class SpeciesCreate(BaseModel):
    model_config = ConfigDict(from_attributes=True)
    name: str
    taxonomy_id: str
+    embedding: SkipJsonSchema[list[float] | None] = None


If I'm not wrong the embedding attribute in the create schema is needed only because it's set in the create_one method.
Since embedding is completely internal, it seems cleaner to not even define it in the pydantic schema, and pass an additional parameter to router_create_one, similarly to what has been already done in router_read_many.
In either case, the embedding that is defined in the read schemas doesn't seem to be used, so it can be removed.

GianlucaFicarelli · 2025-09-09T12:03:20Z

app/service/brain_region.py

+    embedding = None
+
+    if semantic_search is not None:
+        embedding = generate_embedding(semantic_search)


I agree that the call to generate_embedding can be done in the service layer because it doesn't really belong to the query/db layer, but on the other side it's an internal detail, so we could pass directly semantic_search to router_read_many, and call generate_embedding there? The same can be done also for router_create_one.
This would also avoid repeating this call in each read_many endpoints where it's needed, especially, if we think to extend it to other endpoints later.

GianlucaFicarelli · 2025-09-09T12:13:06Z

alembic/versions/20250825_123447_7aa80d34dbdd_add_pgvector.py

+    # ### commands auto generated by Alembic - please adjust! ###
+    # Enable the pgvector extension
+    op.execute("CREATE EXTENSION IF NOT EXISTS vector;")
+


As a reminder: in staging and production there is postgresql 17.4 and the (pg)vector extension seems available.
However, before releasing and deploying we should ensure that no other actions are needed to activate the extension.

GianlucaFicarelli · 2025-09-09T12:25:21Z

app/utils/embedding.py

+    response = client.embeddings.create(model=model, input=text)
+
+    # Return the generated embedding
+    return response.data[0].embedding


This function would propagate any error (such as missing api key, errors with the external api) and cause a generic error 500, that would require someone to check the logs on the server side.
Instead of propagating the errors, we can improve the error handling, but I'm ok also if we do that in a separate PR.

jankrepl added 20 commits August 25, 2025 10:46

First commit

d463985

Change docker image + add pgvector python package

2237610

First version of the migration script

436dd42

Write migration script

afbc0a6

POC of POST

59e9a73

Finalize POST species

3ce692b

POST strain + GETs

116c3b8

FInalize GET on brain_region

d427695

GET strain|species

e90a7d8

Change test docker image

91984c2

Fix test_construct_model_reconstruction_morphology

6436a69

WIP fixing tests

560a222

WIP tests

954f239

WIP tests

d99598c

WIP tests

46722ea

WIP tests

c1b3050

WIP tests

0da2118

WIP tests

a8c4ac3

WIP tests

1ec4124

ruff

45cad24

jankrepl marked this pull request as ready for review August 26, 2025 15:04

jankrepl requested review from GianlucaFicarelli and jdcourcol August 26, 2025 15:04

GianlucaFicarelli reviewed Aug 27, 2025

View reviewed changes

GianlucaFicarelli requested a review from mgeplf August 27, 2025 07:34

Update app/queries/common.py

80b2903

Co-authored-by: Gianluca Ficarelli <[email protected]>

mgeplf reviewed Sep 1, 2025

View reviewed changes

jankrepl added 2 commits September 1, 2025 13:32

Put SkipJsonSchema to all child classes

a6fc299

Remove redundant comments

da4645f

GianlucaFicarelli reviewed Sep 9, 2025

View reviewed changes

Semantic search with pgvector #340

Are you sure you want to change the base?

Semantic search with pgvector #340

Uh oh!

Conversation

jankrepl commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Useful links

TODO

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jankrepl Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jankrepl Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GianlucaFicarelli Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jankrepl commented Aug 25, 2025 •

edited

Loading

jankrepl Aug 27, 2025 •

edited

Loading

jankrepl Sep 2, 2025 •

edited

Loading

GianlucaFicarelli Sep 9, 2025 •

edited

Loading