-
Notifications
You must be signed in to change notification settings - Fork 12
Add support for column types: PostgreSQL JSON/JSONB and RedShift Super #15
Conversation
Hi @nicolasaldecoa , Looks like a great start! It does look like a tricky problem, because the databases have very limited functionality to deal with things like that. However, False positives only need to affect performance, because we can post-process the rows before diffing them in Python. So, we can parse the JSON columns ourselves using So the algorithm would look something like this:
I'm also concerned about true negatives, though they will be rare. That could happen, for example if the JSON you're minifying isn't originally minified, and the other is. Perhaps that's a price we can pay, but that one would completely evade our notice, unlike the false positives. But I don't see a way out at the moment, so I think we should reluctantly accept it as an edge-case. As for the tests, that part is still in data-diff, because we haven't yet migrated it. You can find it here: https://github.com/datafold/data-diff/blob/master/tests/test_database_types.py But you can use it to test sqeleton, by setting the installation to your active branch. (like with The idea is to add a "json" type to the tests (like there is "uuid" and "boolean"), so in the future it will be easy to add tests against other databases. And of course you'll have to add Hopefully that all makes sense. If you have any questions, or need more help, don't hesitate to ask. |
@erezsh thank you for the comments. 1I've actually implemented a solution like the one you mention in a layer that I'm building on top of data-diff. It's part of other stuff that I added as a post-processing step to the output of Here's a Gist with the logic that I'm using to discard false hits with jsons (not the actual code, but the same logic):
So in my case, I have this kinda solved, like I don't need to have that feature in data-diff, but if you consider those to be false positives, at least a warning is needed in my opinion. 2
by this you meant false negatives?, like cases where we don't see a difference if one json is minified in a table and not in the other? Regarding that, it is indeed a problem if the user cares about detecting two equivalent json objects with different serialization... I think that most of the time you would care most about not getting a false positive in those cases because the way different engines serialize jsons may vary, especially if they save the object in binary. 3Great, I'll take a look at the tests soon. |
|
|
|
Hey @erezsh, I've been busy these last couple of days, got back to this a few hours ago. I also started working on adding json type to the tests, but I couldn't set up data-diffs' dev environment in order to run them. Is that docker-compose supposed to work locally? thanks! |
What's the status of this change? There is a related issue with BigQuery at the moment datafold/data-diff#445 and I'd like to use the json functionality added here (BQ struct and array can be converted to json) I could work off of this branch as a base for now, but wondering if there's still a desire to add this PR |
Hello @dlawin, I was pushing to get this merged so I could use the main project instead of a custom fork, but haven´t had time to keep following up on this. Not sure if there's anything in my code that should be changed. If you want to pick it up, add the other features and help getting the branch merged, that would be great. In that case, let me know if I can help you out with anything. Not sure if @erezsh is still interested in adding this functionality. |
Sorry, I'm currently focused on other projects. Maybe @williebsweet can help you. |
Thanks for the context and work here @nicolasaldecoa , that's definitely possible -- I'll reach out if I have questions |
@dlawin Hello again, I forgot to link this related data-diff PR: |
Changes
JSONType
,RedShiftSuper
andPostgresqlJSON
todatabase_types
normalize_json
method toAbstractMixin_NormalizeValue
Mixin_NormalizeValue
for postgresql and redshiftNormalization rationale
RedShift
representation of the json object.
json_serialize(NULL)
returns''
(empty string), we use nvl2 to leave NULL values unchanged.PostgreSQL
In postgre, when
json
type columns are cast to text, they respect the format that the object had when the value wasinserted.
bjson
has a different behavior and adds whitespaces afterseparators
:
and,
.I couldn't find any native function in postgres that returns the minified json representation, except for
json_strip_nulls which does it incidentally but removes keys that have
null values.
The idea was to always serialize jsons in a minified format, so why we are using the string replace function to
standardize the output in postgres. This method has its pitfalls, but I'm not sure if there are any better options in
postgres.
Tech debt:
files contain any of the patterns that we are replacing.
their keys in different order. This is easy to solve in post-processing using python, but I don't know if there is an
efficient way to do it as part of the query.
So far, the results using this code have been good enough for us and this is much better than not having support for
the data type, but maybe there should be a warning saying that false positives are possible
(maybe setting
supported=False
inJSONtype
as it is inText
).Tests
Can you help me out with the tests? would it just be testing that the normalized functions return the strings that
we expect, or do we need to create temporal tables to query from?
We have dummy PG and RS tables to test this locally with data-diff. These are the relevant parts of the examples
that we're using:
PostgreSQL
RedShift
Detected differences (using data-diff with the sqeleton branch of this PR installed)
True Negatives: 400
False Negatives: None
True Positives: 401, 402, 403, 404, 501, 502, 503, 504