-
Notifications
You must be signed in to change notification settings - Fork 63
Move index creation inside CREATE TABLE for massive database creation speedup #273
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Move index creation inside CREATE TABLE for massive database creation speedup #273
Conversation
| index = element.target | ||
| assert isinstance(index, Index) | ||
| was_created = index.info.get("_cockroachdb_index_created_by_create_table", False) | ||
| assert was_created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do need to handle emitting CREATE INDEX in cases where we_aren't also creating the corresponding table then we might be able to do that here by doing something like:
if not was_created:
return compiler.visit_create_index(...)
(Assuming that sqlalchemy always does index creations after the corresponding table creation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to be able to emit CREATE INDEX statements because an Alembic migration might want to add an index to an existing column. With the changes I proposed in wavemm#1 , this code works with the current master branch (bc87688)
from alembic.migration import MigrationContext
from alembic.operations import Operations
import sqlalchemy as sa
myengine = sa.create_engine("cockroachdb://root@localhost:26257/defaultdb")
conn = myengine.connect()
ctx = MigrationContext.configure(conn)
op = Operations(ctx)
op.drop_table("invoice", if_exists=True)
op.create_table(
"invoice",
sa.Column("invoice_number", sa.Integer(), nullable=False),
sa.Column("account_number", sa.Integer(), nullable=True),
sa.PrimaryKeyConstraint("invoice_number"),
)
op.create_index(op.f("ix_invoice_account_number"), "invoice", ["account_number"], unique=False)but your modified version fails with
Traceback (most recent call last):
File "/home/gord/git/sqlalchemy-cockroachdb/.gord_stuff/alembic_op.py", line 20, in <module>
op.create_index(op.f("ix_invoice_account_number"), "invoice", ["account_number"], unique=False)
File "<string>", line 3, in create_index
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/alembic/operations/ops.py", line 1013, in create_index
return operations.invoke(op)
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/alembic/operations/base.py", line 441, in invoke
return fn(self, operation)
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/alembic/operations/toimpl.py", line 112, in create_index
operations.impl.create_index(idx, **kw)
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/alembic/ddl/postgresql.py", line 99, in create_index
self._exec(CreateIndex(index, **kw))
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/alembic/ddl/impl.py", line 246, in _exec
return conn.execute(construct, params)
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1415, in execute
return meth(
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/sql/ddl.py", line 187, in _execute_on_connection
return connection._execute_ddl(
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1523, in _execute_ddl
compiled = ddl.compile(
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 308, in compile
return self._compiler(dialect, **kw)
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/sql/ddl.py", line 76, in _compiler
return dialect.ddl_compiler(dialect, self, **kw)
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/sql/compiler.py", line 886, in __init__
self.string = self.process(self.statement, **compile_kwargs)
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/sql/compiler.py", line 932, in process
return obj._compiler_dispatch(self, **kwargs)
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/ext/compiler.py", line 538, in <lambda>
lambda *arg, **kw: existing(*arg, **kw),
File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/ext/compiler.py", line 591, in __call__
expr = fn(element, compiler, **kw)
File "/home/gord/git/sqlalchemy-cockroachdb/sqlalchemy_cockroachdb/ddl_compiler.py", line 67, in visit_create_index
assert was_created
AssertionError
| IDX_USING = re.compile(r"^(?:btree|hash|gist|gin|[\w_]+)$", re.I) | ||
|
|
||
|
|
||
| # Heavily based on DDLCompiler.visit_create_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost PostgresqlDDLCompiler.visit_create_index. Differences are:
- Remove the
CREATEandON {table_name}bits - Replacing
USINGwithINVERTED(seems to be needed for crdb?) - Removing/commenting some features that I don't think crdb supports.
In the final version I would clean this up a lot more.
I don't think we should attempt to reuse PostgresqlDDLCompiler.visit_create_index - I think the string munging required for that would be quite brittle.
|
@gordthompson would you mind taking a look at this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After making the changes noted below I got
$ pytest -k test_create_table /home/gord/git/sqlalchemy-cockroachdb/test/test_suite_sqlalchemy.py
to run. (2 tests passed, 4 ignored, 6 tests total)
| @@ -1,5 +1,19 @@ | |||
| from sqlalchemy import exc | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import re statement above this line.
| from sqlalchemy_cockroachdb.base import ( # type: ignore[import-untyped] | ||
| CockroachDBDialect, | ||
| ) | ||
| from sqlalchemy_cockroachdb.ddl_compiler import ( # type: ignore[import-untyped] | ||
| CockroachDDLCompiler, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to remove these two imports to avoid
ImportError: cannot import name 'CockroachDBDialect' from partially initialized module 'sqlalchemy_cockroachdb.base' (most likely due to a circular import) (/home/gord/git/sqlalchemy-cockroachdb/sqlalchemy_cockroachdb/base.py)
|
Hi @gordthompson thanks for the review and for getting the tests passing. Do you have any thoughts on the general approach taken here? e.g. whether it's likely to have unexpected effects on edge case uses or cause maintenance difficulties in the future? If not I can clean this up to something more ready to merge. |
The general approach seems reasonable to me, but
yes, I agree that the change is significant enough that it probably should be an opt-in feature. |
|
Great, thanks! I'll keep tinkering with this stuff inside our codebase for a while longer. I'll probably want to upstream it around/before we start serious use of cockroachdb in production, so I'll aim to clean this PR up before then. So anytime from the next few weeks up to the end of this year. |
Hi!
We're in the process of migrating to cockroachdb. We found that naively creating dev databases using sqlalchemy and cockroachdb was dramatically slower than postgres (~20 minutes vs ~30 seconds). We narrowed most of this time down to
CREATE INDEXstatements.Talking with @data-matt he suggested doing index creation inside
CREATE TABLErather than as separate statements. This makes a huge difference, bringing our overall db initialisation time down to ~2m30.So I've experimented with getting sqlalchemy to do it this way. The only approach I could find is to have visit_create_table do the index creation and visit_create_index be a no-op. I've got some prototype code here which works for how we use sqlalchemy at Wave.
I don't know sqlalchemy's internals very well so I'm kind of uncertain about my approach, in particular:
Do you have any ideas/thoughts?
Then my other question: if this works, how do we release it? Presumably this is a breaking change, so do we need to put it behind some kind of config flag?