Skip to content
This repository was archived by the owner on May 2, 2023. It is now read-only.

Conversation

nicolasaldecoa
Copy link
Contributor

@nicolasaldecoa nicolasaldecoa commented Feb 1, 2023

Changes

  • Add JSONType, RedShiftSuper and PostgresqlJSON to database_types
  • Add normalize_json method to AbstractMixin_NormalizeValue
  • Implement normalize_json methods in Mixin_NormalizeValue for postgresql and redshift

Normalization rationale

RedShift

def normalize_json(self, value: str, _coltype: RedShiftSuper) -> str:
    return f'nvl2({value}, json_serialize({value}), NULL)'
  • json_serialize returns a minified varchar
    representation of the json object.
  • since json_serialize(NULL) returns '' (empty string), we use nvl2 to leave NULL values unchanged.

PostgreSQL

def normalize_json(self, value: str, _coltype: PostgresqlJSON) -> str:
    """
    Converts json or bjson values to its minified (most compact) string representation.
    Removes whitespaces after json separators (':' and ',') using string replacement.
    '{"a": 1, "b": true, "c": "x"}' (bjson::text) -> '{"a":1,"b":true,"c":"x"}'
    Comparisons to jsons of other db types can give false positives if they contain string
    values that include any of the replaced patterns, or if the items have different order.
    """
    return f"replace(replace({value}::text, '\": ', '\":'), ', \"', ',\"')"

In postgre, when json type columns are cast to text, they respect the format that the object had when the value was
inserted. bjson has a different behavior and adds whitespaces after
separators : and ,.

I couldn't find any native function in postgres that returns the minified json representation, except for
json_strip_nulls which does it incidentally but removes keys that have
null values.

The idea was to always serialize jsons in a minified format, so why we are using the string replace function to
standardize the output in postgres. This method has its pitfalls, but I'm not sure if there are any better options in
postgres.

Tech debt:

  • False positives when looking for differences between postgres and other type of DB if string values in some json
    files contain any of the patterns that we are replacing.
  • False positives when looking for differences if two jsonb objects have the same information but get serialized with
    their keys in different order. This is easy to solve in post-processing using python, but I don't know if there is an
    efficient way to do it as part of the query.

So far, the results using this code have been good enough for us and this is much better than not having support for
the data type, but maybe there should be a warning saying that false positives are possible
(maybe setting supported=False in JSONtype as it is in Text).

Tests

Can you help me out with the tests? would it just be testing that the normalized functions return the strings that
we expect, or do we need to create temporal tables to query from?

We have dummy PG and RS tables to test this locally with data-diff. These are the relevant parts of the examples
that we're using:

PostgreSQL

CREATE TABLE dev.data_diff (
    id int4 NOT NULL,
    field_json json NULL,
    field_jsonb jsonb NULL,
    CONSTRAINT data_diff_pkey PRIMARY KEY (id)
);

INSERT INTO dev.data_diff (id, field_json) VALUES
       (400, '{"keyText": "matches", "keyInt": 3, "keyFloat": 5.4445, "keyBoolean": true}'),
       (401, '{"keyInt": 3}'),
       (402, '{"keyFloat": 5.4445}'),
       (403, '{"keyBoolean": true}'),
       (404, '{"keyInt": 3, "keyFloat": 5.4445, "keyBoolean": true}');

INSERT INTO dev.data_diff (id, field_jsonb) VALUES
       (500, '{"keyText": "matches", "keyInt": 3, "keyFloat": 5.4445, "keyBoolean": true}'),
       (501, '{"keyInt": 3}'),
       (502, '{"keyFloat": 5.4445}'),
       (503, '{"keyBoolean": true}'),
       (504, '{"keyInt": 3, "keyFloat": 5.4445, "keyBoolean": true}');

RedShift

CREATE TABLE dev.data_diff (
    id int4 NOT NULL,
    field_json super NULL,
    field_jsonb super NULL,
    CONSTRAINT data_diff_pkey PRIMARY KEY (id)
);

INSERT INTO dev.data_diff (id, field_json) VALUES
       (400, JSON_PARSE('{"keyText": "matches", "keyInt": 3, "keyFloat": 5.4445, "keyBoolean": true}')),
       (401, JSON_PARSE('{"keyInt": 4}')),
       (402, JSON_PARSE('{"keyFloat": 5.43}')),
       (403, JSON_PARSE('{"keyBoolean": false}')),
       (404, JSON_PARSE('{"keyInt": 4, "keyFloat": 5.4445, "keyBoolean": true}'));

INSERT INTO dev.data_diff (id, field_jsonb) VALUES
       (500, JSON_PARSE('{"keyText": "matches", "keyInt": 3, "keyFloat": 5.4445, "keyBoolean": true}')),
       (501, JSON_PARSE('{"keyInt": 4}')),
       (502, JSON_PARSE('{"keyFloat": 5.43}')),
       (503, JSON_PARSE('{"keyBoolean": false}')),
       (504, JSON_PARSE('{"keyInt": 4, "keyFloat": 5.4445, "keyBoolean": true}'));

Detected differences (using data-diff with the sqeleton branch of this PR installed)

  • True Negatives: 400

  • False Negatives: None

  • True Positives: 401, 402, 403, 404, 501, 502, 503, 504

# ID 401
field_json:          '{"keyInt":3}' -> '{"keyInt":4}'

# ID 402
field_json:          '{"keyFloat":5.4445}' -> '{"keyFloat":5.43}'

# ID 403
field_json:          '{"keyBoolean":true}' -> '{"keyBoolean":false}'

# ID 404
field_json:          '{"keyInt":3,"keyFloat":5.4445,"keyBoolean":true}' -> '{"keyInt":4,"keyFloat":5.4445,"keyBoolean":true}'

# ID 501
field_jsonb:         '{"keyInt":3}' -> '{"keyInt":4}'

# ID 502
field_jsonb:         '{"keyFloat":5.4445}' -> '{"keyFloat":5.43}'

# ID 503
field_jsonb:         '{"keyBoolean":true}' -> '{"keyBoolean":false}'

# ID 504
field_jsonb:         '{"keyInt":3,"keyFloat":5.4445,"keyBoolean":true}' -> '{"keyInt":4,"keyFloat":5.4445,"keyBoolean":true}'
  • False Positives: 500 (I'm dealing with these type of cases in post-processing)
# ID 500
field_jsonb:         '{"keyInt":3,"keyText":"matches","keyFloat":5.4445,"keyBoolean":true}' -> {"keyText":"matches","keyInt":3,"keyFloat":5.4445,"keyBoolean":true}'

@erezsh
Copy link
Contributor

erezsh commented Feb 1, 2023

Hi @nicolasaldecoa ,

Looks like a great start!

It does look like a tricky problem, because the databases have very limited functionality to deal with things like that.

However, False positives only need to affect performance, because we can post-process the rows before diffing them in Python. So, we can parse the JSON columns ourselves using json.loads(), and also sort the keys. That mechanism doesn't currently exist, because we didn't need it. But I believe it shouldn't be hard to add. But, the effect on performance could be noticeable, because we would be downloading rows that aren't actually different. I think the most user-friendly approach to address this issue, is when we detect false positives (i.e. values that are equal, but were unequal before parsing), to warn the user about them.

So the algorithm would look something like this:

different = []
for row1, row2 in pairs_of_rows_by_key:
  if row1 != row2:
    parsed1 = parse_row(row1)
    parsed2 = parse_row(row1)
    if parsed1 != parsed2:
        different.append((row1, row))
    else:
        logger.warning("False-positive in row in column ..., may affect performance!")

I'm also concerned about true negatives, though they will be rare. That could happen, for example if the JSON you're minifying isn't originally minified, and the other is. Perhaps that's a price we can pay, but that one would completely evade our notice, unlike the false positives.

But I don't see a way out at the moment, so I think we should reluctantly accept it as an edge-case.

As for the tests, that part is still in data-diff, because we haven't yet migrated it. You can find it here: https://github.com/datafold/data-diff/blob/master/tests/test_database_types.py

But you can use it to test sqeleton, by setting the installation to your active branch. (like with pip install -e .)

The idea is to add a "json" type to the tests (like there is "uuid" and "boolean"), so in the future it will be easy to add tests against other databases. And of course you'll have to add JsonFaker, where you can add all of the cases and edge-cases we want to cover.

Hopefully that all makes sense. If you have any questions, or need more help, don't hesitate to ask.

@nicolasaldecoa
Copy link
Contributor Author

nicolasaldecoa commented Feb 1, 2023

@erezsh thank you for the comments.

1

I've actually implemented a solution like the one you mention in a layer that I'm building on top of data-diff. It's part of other stuff that I added as a post-processing step to the output of diff_tables , because I have some other specific requirements like splitting rows missing in A or B from rows with differences in the extra columns, getting only values of the columns that differ, allow grouping types of differences, formatting for readability, tolerating certain differences, etc.

Here's a Gist with the logic that I'm using to discard false hits with jsons (not the actual code, but the same logic):

def json_match(val_a, val_b):
      try:
          a = json.loads(val_a)
          b = json.loads(val_b)
      except (ValueError, TypeError, json.decoder.JSONDecodeError):  # not valid jsons
          return False
      # compare dictionaries
      return a == b

def _match(val_a, val_b):
    return (val_a == val_b) or (json_match(val_a, val_b))

So in my case, I have this kinda solved, like I don't need to have that feature in data-diff, but if you consider those to be false positives, at least a warning is needed in my opinion.

2

I'm also concerned about true negatives, though they will be rare.

by this you meant false negatives?, like cases where we don't see a difference if one json is minified in a table and not in the other? Regarding that, it is indeed a problem if the user cares about detecting two equivalent json objects with different serialization... I think that most of the time you would care most about not getting a false positive in those cases because the way different engines serialize jsons may vary, especially if they save the object in binary.

3

Great, I'll take a look at the tests soon.

@erezsh
Copy link
Contributor

erezsh commented Feb 1, 2023

  1. I consider false-positive detection necessary to merge this PR. But if you don't feel like doing that, maybe I can take care of that part.

  2. Yes, I meant false negatives, my bad.

  3. Great. Let me know if you have any questions.

@nicolasaldecoa
Copy link
Contributor Author

  1. I might be able to help out with that if you point me to the part of the data-diff pipeline where you would add the comparison between json objects before adding them as a diff (I haven't had the time to do a deep dive into data-diff's code yet). Maybe that step should be optional (controlled by parameter) because it probably won't scale very well in performance as you said.

  2. I'll open another PR in data-diff then with the tests and what we are trying to add in (1.)

@erezsh
Copy link
Contributor

erezsh commented Feb 2, 2023

@nicolasaldecoa
Copy link
Contributor Author

nicolasaldecoa commented Feb 7, 2023

Hey @erezsh, I've been busy these last couple of days, got back to this a few hours ago.
I've just drafted a way to check if the diff detections come from jsons that were serialized differently (maybe with this addition it would be OK to remove the replace in Postrgres json_normalize).
I would love some help with the comments that I left there tagged as TODO:, because I'm not sure what we wanna do when we detect that two rows were a false positive. Also, the whole thing could be made more efficient if we pass to diff_sets a boolean to indicate if there are any json columns in the schema to worry about, or even the indices of the columns, if any, so we can reduce the number of checks.

I also started working on adding json type to the tests, but I couldn't set up data-diffs' dev environment in order to run them. Is that docker-compose supposed to work locally?
I'm also a bit confused with the way data should be inserted and read in the mock DBs, hope you can help me with that as well.

thanks!

@dlawin
Copy link
Contributor

dlawin commented Mar 28, 2023

What's the status of this change? There is a related issue with BigQuery at the moment datafold/data-diff#445 and I'd like to use the json functionality added here (BQ struct and array can be converted to json)

I could work off of this branch as a base for now, but wondering if there's still a desire to add this PR

@nicolasaldecoa
Copy link
Contributor Author

What's the status of this change? There is a related issue with BigQuery at the moment datafold/data-diff#445 and I'd like to use the json functionality added here (BQ struct and array can be converted to json)

I could work off of this branch as a base for now, but wondering if there's still a desire to add this PR

Hello @dlawin, I was pushing to get this merged so I could use the main project instead of a custom fork, but haven´t had time to keep following up on this. Not sure if there's anything in my code that should be changed.

If you want to pick it up, add the other features and help getting the branch merged, that would be great. In that case, let me know if I can help you out with anything.

Not sure if @erezsh is still interested in adding this functionality.

@erezsh
Copy link
Contributor

erezsh commented Mar 30, 2023

Sorry, I'm currently focused on other projects.

Maybe @williebsweet can help you.

@dlawin
Copy link
Contributor

dlawin commented Mar 31, 2023

If you want to pick it up, add the other features and help getting the branch merged, that would be great. In that case, let me know if I can help you out with anything.

Thanks for the context and work here @nicolasaldecoa , that's definitely possible -- I'll reach out if I have questions

@nicolasaldecoa
Copy link
Contributor Author

If you want to pick it up, add the other features and help getting the branch merged, that would be great. In that case, let me know if I can help you out with anything.

Thanks for the context and work here @nicolasaldecoa , that's definitely possible -- I'll reach out if I have questions

@dlawin Hello again, I forgot to link this related data-diff PR:
datafold/data-diff#383

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants