Skip to content

Conversation

@luigidellaquila
Copy link
Contributor

@luigidellaquila luigidellaquila commented Apr 19, 2022

Context:

In SQL and EQL we are trying to add support for VERSION fields, see #85502

But ORDER BY <a calculated value> does not work properly when the calculated value is of type Version, since the calculation is translated to a Painless script.

We tracked down the problem to the following issue: #85989

We identified the following root problems:

  • sort _script only accepts string and number as type; both are not a good fit for VERSION fields
  • defining a new ScriptSortType to handle generic BytesRef values (that we can obtain from VERSION, allowing us to have the right sorting) does not help, since Version BytesRef is not a valid UTF8 string, so it cannot be properly formatted in the result
  • ScriptSortBuilder has a hardwired DocValueFormat.RAW formatter, that cannot be configured in the query
  • mapper-version is an x-pack plugin, so we would probably not move the version parsing/formatting logic to Server

Fixes #85989
Fixes #82287

@luigidellaquila
Copy link
Contributor Author

luigidellaquila commented Apr 19, 2022

With the reproducer described in #85989 and with this PR, the following query works fine

{
    "size": 1000,
    "_source": false,
    "fields": [
        {
            "field": "name"
        },
        {
            "field": "version"
        }
    ],
    "sort": [
        {
            "_script": {
                "script": {
                    "lang": "painless",
                    "source":"$('version', new Version('0'))"
                },
                "type": "custom",
                "format": "version"
            }
        }
    ]
}

"format": "version" is an addition that would be good to avoid, if we find a way to infer the DocValueFormat from the script

reduce the impact of the change
@luigidellaquila
Copy link
Contributor Author

Further reduced the impact of the changes, using type directly to infer the type

{
    "size": 1000,
    "_source": false,
    "fields": [
        {
            "field": "name"
        },
        {
            "field": "version"
        }
    ],
    "sort": [
        {
            "_script": {
                "script": {
                    "lang": "painless",
                    "source":"$('version', new Version('0'))"
                },
                "type": "version"
            }
        }
    ]
}

@luigidellaquila
Copy link
Contributor Author

@elasticmachine update branch

@cbuescher
Copy link
Member

  • mapper-version is an x-pack plugin, so we would probably not move the version parsing/formatting logic to Server

This is a decision that was made a while ago, maybe moving some (or even all) of the type implementation to server could be re-considered if this would make things easier or allow for better re-use for this PR. I haven't looked closely at the code yet but if you think this would help we can certainly discuss what changes are possible here.

@luigidellaquila
Copy link
Contributor Author

Thank you very much for your quick feedback @cbuescher

For now, the only blocker I have regarding x-pack is the following:

  • "source":"$('version', new Version('0'))" returns a Version, and I have the BytesRef, so everything works fine
  • "source": "doc['version'].value" returns a String, so it should be encoded to Version, but I didn't find a way to access VersionEncoder without breaking the dependencies

The solution with $ is enough for my immediate needs (ie. the SQL fix), but it seems a half-baked solution to me, and since many users are used to doc['version'].value syntax, they would probably expect it to work as well.

Please take your time to review the code and ping me if you have some time to discuss the solution

Thanks

Luigi

public class Version implements ToXContent, BytesRefProducer, Comparable<Version> {
protected String version;

public Version(String version) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should have another constructor that takes the encoded version, this String version is mostly for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean a protected Version(String version, BytesRef ref) for internal use only, so that we don't have to recalculate the BytesRef? I guess having a constructor with BytesRef would have the same problem, ie. we would have to decode it for the string value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick look I think we can have a ctor that only takes the BytesRef and internally decodes it to a String only once, then use one for the xContent/toString part and the other for comparison. Version only seems to be created from VersionStringDocValuesField where we can also directly get the bytes, in fact currently we do the decoding step there once in getInternal, that could happen in the ctor instead.

Copy link
Contributor

@stu-elastic stu-elastic Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need the Version(String) constructor, which is whitelisted for painless users. Users have to have the ability to provide a Version as a default value when fetching version fields, as in this test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'm keeping both (see last commit)

@luigidellaquila
Copy link
Contributor Author

With my last two commits, all the cases seem to be covered.

I added some test cases, but I'm sure there is much more to test. @stu-elastic @cbuescher any pointers would be much appreciated, in particular about BWC, as the addition of the new ScriptSortType.VERSION concerns me a bit.

I don't like very much the way I get the DocValueFormat for "version", but I didn't find a better way, without moving mapping-version out of x-pack. Also about this, any suggestion is welcome

@luigidellaquila
Copy link
Contributor Author

@elasticmachine update branch

Copy link
Member

@cbuescher cbuescher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick pass on this and have to admit I don't fully understand the details of setting up and wiring a new ScriptSortType but if making "Version" comparable gets the deal done, I think that approach works well even without having to move the VersionEncoder or other stuff to the server module. If that would make things easier on your side that's certainly something we can consider, I'd have to find out who to talk to for a move like that though because it would probably involve some slight license change on the code involved.
I left a few minor comments. Maybe it would be possible to use some of the ordering tests from VersionEncoderTests to check that the Version classes compare method works as expected. Maybe just use the cases from testEncodingOrderingSemver() because I think they cover most of the interesting cases.

};
}

private Version getVersionFromInternal(int index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed if we create the Version from a BytesRef directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes sense, we can do the decoding in the ctor directly. We won't need getVersionFromInternal.
Fixing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to do this change and I realised one thing that is worth pointing out: Version(String) constructor is public API (see last Painless test in 20_scripts.yml) and it has to encode the string to get the BytesRef. So if we decide to move Version to Server module, we will have to move VersionEncoder as well.


import org.apache.lucene.util.BytesRef;

public interface BytesRefProducer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any plans to use this interface on anything other than Version? Otherwise maybe the instance check this is used here could simple test for Version instead. Or is this here so you don't need to reference Version from within the server module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface is there only to decouple this fix from x-pack, I'd be happy to remove it if we move Verson to server module.
Just for completeness, we could use this interface (and the same fix) for at least another case, that is IP field type, but the interface would still be overkill (an if/else would be more than enough probably)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thats what I guessed. No worries, I think if moving complicates things right now the interface is fine. Maybe this could benefit from a small class comment why its there then, so the next person doesn't ask again.

- refactor Version ctor
- add test cases for sorting
- code cleanup
…ipt_sort' into poc/custom_bytesref_script_sort
@luigidellaquila luigidellaquila changed the title PoC: allow to sort by script value using SemVer semantics Allow to sort by script value using SemVer semantics Apr 26, 2022
@luigidellaquila luigidellaquila marked this pull request as ready for review April 26, 2022 15:57
@stu-elastic
Copy link
Contributor

The solution with $ is enough for my immediate needs (ie. the SQL fix), but it seems a half-baked solution to me, and since many users are used to doc['version'].value syntax, they would probably expect it to work as well.

We want to move users towards the fields API (the $ syntax). I would want to wait for strong user demand before adding a version that works with doc.

@luigidellaquila luigidellaquila added the :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache label Apr 26, 2022
@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Apr 26, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine
Copy link
Collaborator

Hi @luigidellaquila, I've created a changelog YAML for you.

@luigidellaquila luigidellaquila added the :Search Foundations/Mapping Index mappings, including merging and defining field types label Apr 26, 2022
@elasticmachine elasticmachine added the Team:Search Meta label for search team label Apr 26, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)


// decouples this module from org.elasticsearch.xpack.versionfield.Version field type
// that is defined in x-pack
public interface BytesRefProducer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a line in the Javadoc about how this class is used.

@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label Apr 26, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

@luigidellaquila
Copy link
Contributor Author

luigidellaquila commented Apr 26, 2022

We want to move users towards the fields API (the $ syntax). I would want to wait for strong user demand before adding a version that works with doc.

No problem for me, I can use the field API from SQL, so it's not a blocker.
If we agree on this, I can easily "disable" the support for doc[] syntax in this context



---
"Sort by Version script value (asc)":
Copy link
Contributor

@stu-elastic stu-elastic Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail on other versions, you'll have to move this into a new yaml test with the setup

  - skip:
      version: " - 8.2.99"
      reason: "version script field sorting support was added in 8.3.0"

- remove support for sorting based on doc['versionfield'].value
- disable version sort tests on previous ES versions
- added comments
…ipt_sort' into poc/custom_bytesref_script_sort
@luigidellaquila
Copy link
Contributor Author

@stu-elastic I removed the support for doc['version'].value in this context. In case of user demand, the fix is easy

Copy link
Contributor

@stu-elastic stu-elastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the doc-values version back? After reviewing the removal, it doesn't clean much up and I'm sure users would appreciate it.

The changes on the scripting side look good.

* See: https://github.com/elastic/elasticsearch/issues/82287
*/
public class Version implements ToXContent {
public class Version implements ToXContent, BytesRefProducer, Comparable<Version> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit, unrelated to this change actually, but could you change this class to implement ToXContentFragment and remove the isFragment method? We use ToXContentFragment as a marker interface for things that don't output full json object, and in this case I think the class itself only outputs a value to the xcontent builder.

- re-enable support for doc[field].value syntax
- let Version implement ToXContentFragment
- add tests
@luigidellaquila
Copy link
Contributor Author

luigidellaquila commented Apr 28, 2022

@stu-elastic @cbuescher thank you very much for your support.

@sethmlarson I see you labeled this PR for Team:Clients as well, is there anything specific to check before merging?

Copy link
Contributor

@sethmlarson sethmlarson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good from a clients perspective, no comments from me. I have a script that adds Team:Clients to PRs that change API specs or YAML tests so we can be alerted of upcoming changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>bug :Core/Infra/Scripting Scripting abstractions, Painless, and Mustache :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Clients Meta label for clients team Team:Core/Infra Meta label for core/infra team Team:Search Meta label for search team v8.3.0

Projects

None yet

6 participants