Skip to content

Conversation

@theory
Copy link
Contributor

@theory theory commented Oct 30, 2025

Add a new pattern for "prepared inserts". It works like this:

  • Call BeginInsert with an INSERT query with optional columns and ending in VALUES. No values should be included in the string.
  • It returns a Block pre-configured with columns as declared in the INSERT statement
  • Add data to the block and periodically call InsertData to insert data and clear the block.
  • Call EndInsert() or just let the Client object go out of scope to signal the server that it's done inserting.

This allows one to send smaller batches of blocks, thereby using less memory, but still in a single ClickHouse INSERT operation.

Expected to be useful in the Postgres foreign data wrapper insert API, where multiple rows can be inserted at once but its API handles one-at-a-time insertion. It will also support the FDW COPY API, which can submit huge batches of data to insert, as well.

Comment on lines 1191 to 1206
if (chtype->GetCode() == Type::LowCardinality) {
chtype = col->As<ColumnLowCardinality>()->GetNestedType();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm honestly not sure this is the right thing to do. Might one need Type::LowCardonality?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think we can probably do away with this elision of LowCardonality if we can fix this issue. I can't figure out what to construct to append there. The error from Append there is:

no suitable user-defined conversion from "clickhouse::ItemView" to "clickhouse::ColumnRef" (aka "std::__1::shared_ptr<clickhouse::Column>") existsC/C++(312)


void FinishInsert();

void SendData(const Block& block);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to move this to public so that PreparedInsert can call it. Not in the header file, though, so shouldn't matter.

public:
Block * GetBlock();
void Execute();
// XXX This shouldn't be public.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure out how to make this private. Suggestions appreciated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if it worked declared public in the .cpp file, but I think I could also use an Impl class like Client does to hide such things.

@theory theory force-pushed the insert-block branch 5 times, most recently from 51d8216 to c93c844 Compare October 31, 2025 20:50
@mshustov mshustov requested review from Copilot and slabko November 4, 2025 08:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a PreparedInsert pattern for more memory-efficient bulk data insertion. Instead of accumulating all data before sending, users can now prepare an INSERT statement once and execute multiple smaller batches within a single ClickHouse operation.

Key Changes:

  • Added PreparedInsert class with GetBlock(), Execute(), and Finish() methods for iterative data insertion
  • Implemented PrepareInsert() methods in Client for initiating prepared inserts
  • Added comprehensive unit test demonstrating the prepared insert workflow

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
clickhouse/client.h Declared PreparedInsert nested class and PrepareInsert() methods with detailed documentation
clickhouse/client.cpp Implemented PreparedInsert class methods, ReceivePreparePackets(), and refactored insert finalization logic
clickhouse/block.h Fixed spelling in comments ("Convinience" → "Convenience")
ut/client_ut.cpp Added PrepareInsert test case and fixed spelling in existing comment ("Spontaneosly" → "Spontaneously")

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@slabko slabko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for contributing this feature. It has been on the list for quite some time, and I’m glad someone has started looking into it.

However, I have a few remarks.

In general, if you look at the codebase, there is no manual memory management, that is, instead of using new and delete, we rely on std::unique_ptr and std::shared_ptr to manage heap-allocated resources. In fact, the delete keyword is never used anywhere in the project. Using manual memory management of the PreparedInsert class introduces a very bad situation where PreparedInsert can be inadvertently copied.The compiler will automatically generate the copy assignment and the copy constructor operators, which could lead to shallow copies of pointers and ultimately a double-free error, if users are not careful. This can easily happen by accident.

My second remark is a bit tougher. I know you’ve put thought and care into this design, but I’ll have to ask for large changes. The PreparedInsert is not needed here, and the API is simpler without it. The insert operation should be simple and not require many visible moving parts. Ideally, I would approach it like this:

Block block = client.BeginInsert("INSERT INTO test_clickhouse_cpp_insert VALUES");
for (const auto& td : TEST_DATA) {
    id->Append(td.id);
    name->Append(td.name);
    f->Append(td.f);
}
client.SendData(block);
...
client.SendData(block);
...
client.SendData(block);
client.EndInsert();

The main points here are:

  1. BeginInsert and EndInsert clearly form a pair and serve one another.
  2. It’s unambiguous that no other insert or select statements should occur between them. The current PreparedInsert design creates room for sharing the PreparedInsert around, which risks losing the connection state and start using the client object for something else in the meantime. The proposed pattern enforces a clear principle: one operation → one connection → one client object. Need another parallel operation - create another client.
  3. Here the Block object is detached, and ownership is passed to the user code. The user knows it’s not an internal part of PreparedInsert and can freely modify it if needed.
  4. You can still preserve automatic EndInsert behavior when the client goes out of scope by tracking its state - if it’s in insert mode, call EndInsert in the destructor.
  5. I would avoid using the word Prepare... here, because it seem to have a bit different idea than what we are trying achiave here.

Thank you again for your work. Please let me know if you’d like any help, I’d be happy to assist.

@theory
Copy link
Contributor Author

theory commented Nov 5, 2025

Thank you for the design suggestions. I'll work on them this afternoon.

@theory theory changed the title Add PreparedInsert flow Add BeginInsert/InsertData/EndInsert flow Nov 5, 2025
@theory
Copy link
Contributor Author

theory commented Nov 5, 2025

Done in a91ff8a.

*/
std::unique_ptr<Block> BeginInsert(const std::string& query);
std::unique_ptr<Block> BeginInsert(const std::string& query, const std::string& query_id);
void InsertData(Block& block);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Holler if you'd rather pass a std::unique_ptr<Block>. Seems okay to me to pass a *block instead, but I'm not yet up to snuff on idiomatic C++.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does BeginInsert have to return std::unique_ptr? It seems to me that it doesn't to be a pointer at all, i.e.:

Block BeginInsert(const std::string& query);

InsertData looks good, except it should be

void InsertData(const Block& block);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But InsertData does modify the block, by design:

void Client::Impl::InsertData(Block& block) {
    assert(inserting);
    block.RefreshRowCount();
    SendData(block);
    block.Clear();
}

Would you rather that refreshing the count and clearing be done by the caller?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to returning a Block in 0a3da16, and also moved the docs to the README. Diff.

Copy link
Member

@serprex serprex Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, only new Block() would need delete. Destructor handles cleaning up vector's heap allocation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to void InsertData(const Block& block); in 17467b7.

@theory theory requested review from serprex and slabko November 5, 2025 19:32
@theory theory force-pushed the insert-block branch 5 times, most recently from d054c36 to 17467b7 Compare November 7, 2025 21:38
Add a new pattern for "prepared inserts". It works like this:

*   Call `BeginInsert` with an `INSERT` query with optional columns
    and ending in `VALUES`. No values should be included in the string.
*   It returns a `Block` pre-configured with columns as
    declared in the `INSERT` statement
*   Add data to the block and periodically call `InsertData` to insert
    data and clear the block.
*   Call `EndInsert()` or just let the `Client` object go out of scope
    to signal the server that it's done inserting.

This allows one to send smaller batches of blocks, thereby using less
memory, but still in a single ClickHouse `INSERT` operation.

Expected to be useful in the Postgres foreign data wrapper insert API,
where multiple rows can be inserted at once but its API handles
one-at-a-time insertion. It will also support the FDW COPY API, which
can submit huge batches of data to insert, as well.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants