Move index creation inside CREATE TABLE for massive database creation speedup #273

davidshepherd7 · 2025-07-18T16:41:58Z

Hi!

We're in the process of migrating to cockroachdb. We found that naively creating dev databases using sqlalchemy and cockroachdb was dramatically slower than postgres (~20 minutes vs ~30 seconds). We narrowed most of this time down to CREATE INDEX statements.

Talking with @data-matt he suggested doing index creation inside CREATE TABLE rather than as separate statements. This makes a huge difference, bringing our overall db initialisation time down to ~2m30.

So I've experimented with getting sqlalchemy to do it this way. The only approach I could find is to have visit_create_table do the index creation and visit_create_index be a no-op. I've got some prototype code here which works for how we use sqlalchemy at Wave.

I don't know sqlalchemy's internals very well so I'm kind of uncertain about my approach, in particular:

Are there important cases where sqlalchemy could emit DDL for the index without emitting it for the table, e.g. does it have any native migration generation which does this?
I can't see any places in built-in sqlalchemy DDL Compilers where they return a no-op from a visit function. Is this just a completely insane idea?

Do you have any ideas/thoughts?

Then my other question: if this works, how do we release it? Presumably this is a breaking change, so do we need to put it behind some kind of config flag?

davidshepherd7 · 2025-07-18T16:44:28Z

sqlalchemy_cockroachdb/ddl_compiler.py

+    index = element.target
+    assert isinstance(index, Index)
+    was_created = index.info.get("_cockroachdb_index_created_by_create_table", False)
+    assert was_created


If we do need to handle emitting CREATE INDEX in cases where we_aren't also creating the corresponding table then we might be able to do that here by doing something like:

if not was_created: return compiler.visit_create_index(...)

(Assuming that sqlalchemy always does index creations after the corresponding table creation.)

We still need to be able to emit CREATE INDEX statements because an Alembic migration might want to add an index to an existing column. With the changes I proposed in wavemm#1 , this code works with the current master branch (bc87688)

from alembic.migration import MigrationContext from alembic.operations import Operations import sqlalchemy as sa myengine = sa.create_engine("cockroachdb://root@localhost:26257/defaultdb") conn = myengine.connect() ctx = MigrationContext.configure(conn) op = Operations(ctx) op.drop_table("invoice", if_exists=True) op.create_table( "invoice", sa.Column("invoice_number", sa.Integer(), nullable=False), sa.Column("account_number", sa.Integer(), nullable=True), sa.PrimaryKeyConstraint("invoice_number"), ) op.create_index(op.f("ix_invoice_account_number"), "invoice", ["account_number"], unique=False)

but your modified version fails with

Traceback (most recent call last): File "/home/gord/git/sqlalchemy-cockroachdb/.gord_stuff/alembic_op.py", line 20, in <module> op.create_index(op.f("ix_invoice_account_number"), "invoice", ["account_number"], unique=False) File "<string>", line 3, in create_index File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/alembic/operations/ops.py", line 1013, in create_index return operations.invoke(op) File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/alembic/operations/base.py", line 441, in invoke return fn(self, operation) File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/alembic/operations/toimpl.py", line 112, in create_index operations.impl.create_index(idx, **kw) File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/alembic/ddl/postgresql.py", line 99, in create_index self._exec(CreateIndex(index, **kw)) File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/alembic/ddl/impl.py", line 246, in _exec return conn.execute(construct, params) File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1415, in execute return meth( File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/sql/ddl.py", line 187, in _execute_on_connection return connection._execute_ddl( File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/engine/base.py", line 1523, in _execute_ddl compiled = ddl.compile( File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/sql/elements.py", line 308, in compile return self._compiler(dialect, **kw) File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/sql/ddl.py", line 76, in _compiler return dialect.ddl_compiler(dialect, self, **kw) File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/sql/compiler.py", line 886, in __init__ self.string = self.process(self.statement, **compile_kwargs) File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/sql/compiler.py", line 932, in process return obj._compiler_dispatch(self, **kwargs) File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/ext/compiler.py", line 538, in <lambda> lambda *arg, **kw: existing(*arg, **kw), File "/home/gord/git/sqlalchemy-cockroachdb/.venv/lib/python3.9/site-packages/sqlalchemy/ext/compiler.py", line 591, in __call__ expr = fn(element, compiler, **kw) File "/home/gord/git/sqlalchemy-cockroachdb/sqlalchemy_cockroachdb/ddl_compiler.py", line 67, in visit_create_index assert was_created AssertionError

davidshepherd7 · 2025-07-18T16:48:00Z

sqlalchemy_cockroachdb/ddl_compiler.py

+IDX_USING = re.compile(r"^(?:btree|hash|gist|gin|[\w_]+)$", re.I)
+
+
+# Heavily based on DDLCompiler.visit_create_index


This is almost PostgresqlDDLCompiler.visit_create_index. Differences are:

Remove the CREATE and ON {table_name} bits

Replacing USING with INVERTED (seems to be needed for crdb?)

Removing/commenting some features that I don't think crdb supports.

In the final version I would clean this up a lot more.

I don't think we should attempt to reuse PostgresqlDDLCompiler.visit_create_index - I think the string munging required for that would be quite brittle.

data-matt · 2025-07-18T17:30:27Z

@rafiss

dikshant · 2025-07-28T15:06:18Z

@gordthompson would you mind taking a look at this?

gordthompson

After making the changes noted below I got

$ pytest -k test_create_table /home/gord/git/sqlalchemy-cockroachdb/test/test_suite_sqlalchemy.py

to run. (2 tests passed, 4 ignored, 6 tests total)

gordthompson · 2025-07-28T16:17:11Z

sqlalchemy_cockroachdb/ddl_compiler.py

@@ -1,5 +1,19 @@
 from sqlalchemy import exc


Missing import re statement above this line.

gordthompson · 2025-07-28T16:20:15Z

sqlalchemy_cockroachdb/ddl_compiler.py

+from sqlalchemy_cockroachdb.base import (  # type: ignore[import-untyped]
+    CockroachDBDialect,
+)
+from sqlalchemy_cockroachdb.ddl_compiler import (  # type: ignore[import-untyped]
+    CockroachDDLCompiler,
+)


I had to remove these two imports to avoid

ImportError: cannot import name 'CockroachDBDialect' from partially initialized module 'sqlalchemy_cockroachdb.base' (most likely due to a circular import) (/home/gord/git/sqlalchemy-cockroachdb/sqlalchemy_cockroachdb/base.py)

davidshepherd7 · 2025-07-29T09:07:39Z

Hi @gordthompson thanks for the review and for getting the tests passing.

Do you have any thoughts on the general approach taken here? e.g. whether it's likely to have unexpected effects on edge case uses or cause maintenance difficulties in the future?

If not I can clean this up to something more ready to merge.

gordthompson · 2025-07-30T14:11:37Z

Do you have any thoughts on the general approach taken here?

The general approach seems reasonable to me, but

Presumably this is a breaking change, so do we need to put it behind some kind of config flag?

yes, I agree that the change is significant enough that it probably should be an opt-in feature.

davidshepherd7 · 2025-08-06T09:56:55Z

Great, thanks! I'll keep tinkering with this stuff inside our codebase for a while longer.

I'll probably want to upstream it around/before we start serious use of cockroachdb in production, so I'll aim to clean this PR up before then. So anytime from the next few weeks up to the end of this year.

Hacky WIP implementation of moving index creation inside CREATE TABLE

9132bbd

davidshepherd7 changed the title ~~Moving index creation inside CREATE TABLE for massive database creation speedup~~ Move index creation inside CREATE TABLE for massive database creation speedup Jul 18, 2025

davidshepherd7 commented Jul 18, 2025

View reviewed changes

dikshant requested a review from gordthompson July 28, 2025 15:23

gordthompson requested changes Jul 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move index creation inside CREATE TABLE for massive database creation speedup #273

Move index creation inside CREATE TABLE for massive database creation speedup #273

Uh oh!

davidshepherd7 commented Jul 18, 2025

Uh oh!

davidshepherd7 Jul 18, 2025 •

edited

Loading

Uh oh!

gordthompson Jul 30, 2025

Uh oh!

davidshepherd7 Jul 18, 2025 •

edited

Loading

Uh oh!

data-matt commented Jul 18, 2025

Uh oh!

dikshant commented Jul 28, 2025

Uh oh!

gordthompson left a comment •

edited

Loading

Uh oh!

gordthompson Jul 28, 2025

Uh oh!

gordthompson Jul 28, 2025

Uh oh!

davidshepherd7 commented Jul 29, 2025

Uh oh!

gordthompson commented Jul 30, 2025

Uh oh!

davidshepherd7 commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		IDX_USING = re.compile(r"^(?:btree\|hash\|gist\|gin\|[\w_]+)$", re.I)


		# Heavily based on DDLCompiler.visit_create_index

Move index creation inside CREATE TABLE for massive database creation speedup #273

Are you sure you want to change the base?

Move index creation inside CREATE TABLE for massive database creation speedup #273

Uh oh!

Conversation

davidshepherd7 commented Jul 18, 2025

Uh oh!

davidshepherd7 Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gordthompson Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

davidshepherd7 Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

data-matt commented Jul 18, 2025

Uh oh!

dikshant commented Jul 28, 2025

Uh oh!

gordthompson left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gordthompson Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

gordthompson Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

davidshepherd7 commented Jul 29, 2025

Uh oh!

gordthompson commented Jul 30, 2025

Uh oh!

davidshepherd7 commented Aug 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

davidshepherd7 Jul 18, 2025 •

edited

Loading

davidshepherd7 Jul 18, 2025 •

edited

Loading

gordthompson left a comment •

edited

Loading