-
Notifications
You must be signed in to change notification settings - Fork 0
Semantic search with pgvector #340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@@ -291,6 +293,15 @@ def router_read_many[T: BaseModel, I: Identifiable]( # noqa: PLR0913 | |||
.limit(pagination_request.page_size) | |||
) | |||
|
|||
# Add semantic similarity ordering if embedding is provided and model has embedding field | |||
if embedding is not None and hasattr(db_model_class, "embedding"): | |||
# Remove existing ordering clauses and replace with semantic similarity ordering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As quickly discussed before, this is an important thing to agree on, because it depends on the use case for the semantic search. It means that:
- the order of records is different (species, strain and brain_region are ordered by name by default. we don't allow ordering by other attributes, but it's technically possible)
- in general, the user cannot order the results by other attributes. In any case with this approach it could be better to explicitly raise an error if the order_by is explicitly passed by the user together with the semantic search, instead of silently ignore it.
- all the records are returned, even the ones with very low similarity (that will be at the bottom of the list or in the last pages)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general, the user cannot order the results by other attributes. In any case with this approach it could be better to explicitly raise an error if the order_by is explicitly passed by the user together with the semantic search, instead of silently ignore it.
We initially wanted to raise an exception but then we realized that somehow by default the swagger UI attaches the order_by=name
query param by default. That is why we decided to just ignore it in case semantic_search
is provided too. Alternatively, we could just somehow make sure the order_by=name
is not sent by default. Feel free to decide.
the order of records is different (species, strain and brain_region are ordered by name by default. we don't allow ordering by other attributes, but it's technically possible)
Point taken that one can provide multiple entries in the order_by
. Now, there are two options
semantic_search
comes first. Then I would argue there is no need for other tiebreaker entries (e.g. name, created_date,...) since no two entries in the DB will have the same distance. The current PR basically implements this case since it disregards all the otherorder_by
clauses.semantic_search
does not come first. If for some reason someone wants to sort by creation day (not date) first and only then by semantic search then I do agree that our current PR does not support it. However, I would argue that it is not that useful. However, it can be implemented but maybe I would leave that for a future PR since the iteraction between the FastAPI filter library and this customsemantic_search
is not obvious.
all the records are returned, even the ones with very low similarity (that will be at the bottom of the list or in the last pages)
I can make the same argument about order_by=name
or any other column you order by. As discussed in person, we will use this endpoint by adding the page_size=5
and not caring about the other pages.
openai_api_key = settings.OPENAI_API_KEY.get_secret_value() | ||
|
||
# Generate embedding using OpenAI API | ||
client = openai.OpenAI(api_key=openai_api_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there alternatives instead of calling the OpenAI API? Is there a rate limiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. There are many alternatives (other API providers and even self hosting custom embedding models). However, since neuroagent
uses OpenAI in production we thought about making things simple for now.
However, the clear downside is that entitycore will make calls to a 3rd party API from now on that requires an api key .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a rate limiting?
I think this is an important point - do we have any idea if they have an SLA?
I see https://status.openai.com/ but I didn't find latency numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a rate limiting?
I think this is an important point - do we have any idea if they have an SLA?
I see https://status.openai.com/ but I didn't find latency numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. There is rate limiting. Check https://platform.openai.com/docs/guides/rate-limits/usage-tiers?context=tier-one
The reason why we did not implement any retry logic or some exception handling:
GET
-semantic_search
is optional so the default behavior is not to use semantic search- in
POST
- we assume that the three entities this PR is concerned with (species, strain and brain region) won't change that much (or at all)
However, if you want we can write some extra logic.
Co-authored-by: Gianluca Ficarelli <[email protected]>
openai_api_key = settings.OPENAI_API_KEY.get_secret_value() | ||
|
||
# Generate embedding using OpenAI API | ||
client = openai.OpenAI(api_key=openai_api_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a rate limiting?
I think this is an important point - do we have any idea if they have an SLA?
I see https://status.openai.com/ but I didn't find latency numbers.
openai_api_key = settings.OPENAI_API_KEY.get_secret_value() | ||
|
||
# Generate embedding using OpenAI API | ||
client = openai.OpenAI(api_key=openai_api_key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a rate limiting?
I think this is an important point - do we have any idea if they have an SLA?
I see https://status.openai.com/ but I didn't find latency numbers.
Raises: | ||
ValueError: If OpenAI API key is not configured | ||
""" | ||
if settings.OPENAI_API_KEY is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to check the key every time; having it in the config will be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was mostly for the type checker so that we can call get_secret_value()
.
Note that we want want entitycore to be runnable (locally) without that env variable (of course the semantic search features would not work in that case)
app/service/species.py
Outdated
embedding = None | ||
|
||
if semantic_search is not None: | ||
# Generate embedding using OpenAI API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think these comments are needed; how the embedding is a detail that isn't needed here, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the comment in species.py
, strain.py
and brain_region.py
app/schemas/species.py
Outdated
|
||
|
||
class NestedSpeciesRead(SpeciesCreate, IdentifiableMixin): | ||
pass | ||
embedding: list[float] | None = Field(default=None, exclude=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the embedding required on the client side?
To me it seems like an internal details used for semantic search, so it would be nice to keep it internal.
app/schemas/species.py
Outdated
|
||
|
||
class NestedSpeciesRead(SpeciesCreate, IdentifiableMixin): | ||
pass | ||
embedding: list[float] | None = Field(default=None, exclude=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the embedding required on the client side?
To me it seems like an internal details used for semantic search, so it would be nice to keep it internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The client should never see the embeddings. This is why here we do exclude=True
. Which effectively excludes it from the model_dump
.
Note that in the parent class SpeciesCreate
we use embedding: SkipJsonSchema
because we don't want it to show up in openapi.json
but we do want it to be included in model_dump
to hook up to entitycore's logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
The rule of thumb I've been going by is that the schema are what enforces the API boundaries with respect to the clients (basically the json payloads for creates and responses.)
I wonder if we could, instead of adding the embedding
to the schema, handle the generation of the vectors in router_create_one
based on class or endpoint, or a create_embedding
argument.
That's probably something for @GianlucaFicarelli to decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is your call @GianlucaFicarelli and @mgeplf. We can adjust the code. The main issue is that your current abstractions assume that CreateEntity
model will be exactly the same as what you are putting in the DB. Which does not work in this case. We tried to bypass that mechanism in the simplest possible way.
# Extract embeddings from response | ||
embeddings = [embedding.embedding for embedding in response.data] | ||
else: | ||
# Use random vectors when OpenAI key is not provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want random vectors? Since it's nullable, can't null be used?
How does the search work w/ null values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. We avoided making the column nullable because IMO the documentation does not really document the behavior that well. There is some mention in the README.md https://github.com/pgvector/pgvector but it is in the section on an indexing algorithm HNSW (that we don't use).
Essentially, we wanted to avoid having to append queries with WHERE embedding IS NOT NULL
. Anyway, if you find some good resource that documents the behavior then I would be happy to change it.
@@ -175,12 +176,14 @@ class BrainRegion(Identifiable): | |||
hierarchy_id: Mapped[uuid.UUID] = mapped_column( | |||
ForeignKey("brain_region_hierarchy.id"), index=True | |||
) | |||
embedding: Mapped[Vector] = mapped_column(Vector(1536), nullable=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid repeating the vector definition, we could define a mixin that can be reused also if we want to add semantic search to more tables.
It would be helpful also to add a comment to explain why the dim of the vector is 1536.
@@ -69,6 +78,9 @@ def read_one(id_: uuid.UUID, db: SessionDep) -> StrainRead: | |||
def create_one( | |||
json_model: StrainCreate, db: SessionDep, user_context: AdminContextDep | |||
) -> StrainRead: | |||
# Generate embedding using OpenAI API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, this comment has been removed in read_many but not in the create_ones
|
||
# Generate all embeddings in a single API call | ||
names = [entity[2] for entity in all_entities] | ||
response = client.embeddings.create(model="text-embedding-3-small", input=names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For confirmation, is this request below the limits indicated in https://platform.openai.com/docs/api-reference/embeddings/create?
def upgrade() -> None: | ||
# ### commands auto generated by Alembic - please adjust! ### | ||
# Enable the pgvector extension | ||
op.execute("CREATE EXTENSION IF NOT EXISTS vector;") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The manual query to create the extension doesn't seem to work well with alembic automatic migration: running make migration
on the up to date branch would create a migration file that reverts the change, trying to remove the extension.
To fix it, it's possible to add in triggers.py the following code:
entities += [
PGExtension(schema="public", signature="vector"),
]
With this change, alembic would also be able to automatically generate the commands
public_vector = PGExtension(schema="public", signature="vector")
op.create_entity(public_vector)
and
public_vector = PGExtension(schema="public", signature="vector")
op.drop_entity(public_vector)
We could also rename triggers.py to something more generic, such as entities.py or pg_entities.py or sql_entities.py, although entity is also a class and a table so it may be a bit misleading, but I don't have a better name.
# Generate embeddings based on available API key | ||
api_key = os.getenv("OPENAI_API_KEY") | ||
|
||
if api_key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The migration should fail if executed in staging or production and the api key isn't defined by mistake.
There is an env variable ENVIRONMENT
that represents if the image is built for prod
or dev
, so it's not exactly the same as the running environment, but it should be enough to check that.
@@ -10,26 +11,28 @@ class SpeciesCreate(BaseModel): | |||
model_config = ConfigDict(from_attributes=True) | |||
name: str | |||
taxonomy_id: str | |||
embedding: SkipJsonSchema[list[float] | None] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm not wrong the embedding
attribute in the create schema is needed only because it's set in the create_one
method.
Since embedding is completely internal, it seems cleaner to not even define it in the pydantic schema, and pass an additional parameter to router_create_one
, similarly to what has been already done in router_read_many
.
In either case, the embedding
that is defined in the read schemas doesn't seem to be used, so it can be removed.
embedding = None | ||
|
||
if semantic_search is not None: | ||
embedding = generate_embedding(semantic_search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the call to generate_embedding
can be done in the service layer because it doesn't really belong to the query/db layer, but on the other side it's an internal detail, so we could pass directly semantic_search
to router_read_many
, and call generate_embedding
there? The same can be done also for router_create_one
.
This would also avoid repeating this call in each read_many
endpoints where it's needed, especially, if we think to extend it to other endpoints later.
# ### commands auto generated by Alembic - please adjust! ### | ||
# Enable the pgvector extension | ||
op.execute("CREATE EXTENSION IF NOT EXISTS vector;") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reminder: in staging and production there is postgresql 17.4 and the (pg)vector extension seems available.
However, before releasing and deploying we should ensure that no other actions are needed to activate the extension.
response = client.embeddings.create(model=model, input=text) | ||
|
||
# Return the generated embedding | ||
return response.data[0].embedding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function would propagate any error (such as missing api key, errors with the external api) and cause a generic error 500, that would require someone to check the logs on the server side.
Instead of propagating the errors, we can improve the error handling, but I'm ok also if we do that in a separate PR.
Closes #329
Useful links
TODO
0
or random vectors or NULLPOST /embeddings
usingname
and populate with real vectorsembedding
not present)POST /embeddings
and populate with real vectorsImplement import logic- they don't maintain the importing logic script anymore + plus the population will happen during the migration scriptsemantic_search: str | None = None
and implement the queryingsemantic_search: str
if providedPOST /species | /strain
are only allowed for admin usersdb
->entitycore
never has theembeddings
(entitycore
->user
does not)