Skip to content

Conversation

@manonthemat
Copy link
Contributor

No description provided.

@manonthemat
Copy link
Contributor Author

While reading through https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-replace-charfilter.html, I found the claim that a filter

{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

with a document of

{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}

does in fact create these tokens:

{
    "tokens": [
        {
            "token": "My",
            "start_offset": 0,
            "end_offset": 2,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "credit",
            "start_offset": 3,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "card",
            "start_offset": 10,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "is",
            "start_offset": 15,
            "end_offset": 17,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "123_456_789",
            "start_offset": 18,
            "end_offset": 29,
            "type": "<NUM>",
            "position": 4
        }
    ]
}

As you can see "is" and "123_456_789" are separate terms, thus the small change of a added comma in the doc.

@dnhatn dnhatn added >docs General docs changes :Search Relevance/Analysis How text is split into tokens labels Jul 31, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@dnhatn
Copy link
Member

dnhatn commented Jul 31, 2018

@elasticmachine test this please

@mayya-sharipova mayya-sharipova merged commit a39f6f0 into elastic:master Aug 2, 2018
@mayya-sharipova
Copy link
Contributor

@manonthemat thanks for your PR! well noticed error!

dnhatn added a commit that referenced this pull request Aug 3, 2018
* master:
  HLRC: Move commercial clients from XPackClient (#32596)
  Add cluster UUID to Cluster Stats API response (#32206)
  Security: move User to protocol project (#32367)
  [TEST] Test for shard failures, add debug to testProfileMatchesRegular
  Minor fix for javadoc (applicable for java 11). (#32573)
  Painless: Move Some Lookup Logic to PainlessLookup (#32565)
  TEST: Avoid merges in testSeqNoAndCheckpoints
  [Rollup] Remove builders from HistoGroupConfig (#32533)
  Mutes failing SQL string function tests due to #32589
  fixed elements in array of produced terms (#32519)
  INGEST: Enable default pipelines (#32286)
  Remove cluster state initial customs (#32501)
  Mutes LicensingDocumentationIT due to #32580
  [ML] Remove multiple_bucket_spans (#32496)
  [ML] Rename JobProvider to JobResultsProvider (#32551)
  Correct minor typo in explain.asciidoc for HLRC
  Build: Add elastic maven to repos used by BuildPlugin (#32549)
  Clarify the error message when a pipeline agg is used in the 'order' parameter. (#32522)
  Revert "[test] turn on host io cache for opensuse (#32053)"
  Enable packaging tests on suse boxes
  [ML] Improve error when no available field exists for rule scope (#32550)
  [ML] Improve error for functions with limited rule condition support (#32548)
  Painless: Clean Up PainlessField (#32525)
  Add @AwaitsFix for #32554
  Remove broken @link in Javadoc
  Scripting: Conditionally use java time api in scripting (#31441)
  [ML] Fix thread leak when waiting for job flush (#32196) (#32541)
  Add AwaitsFix to failing test - see #32546
  Core: Minor size reduction for AbstractComponent (#32509)
  SQL: Added support for string manipulating functions with more than one parameter (#32356)
  [DOCS] Reloadable Secure Settings (#31713)
  Watcher: Reenable HttpSecretsIntegrationTests#testWebhookAction test (#32456)
  [Rollup] Remove builders from TermsGroupConfig (#32507)
  Use hostname instead of IP with SPNEGO test (#32514)
  Switch x-pack rolling restart to new style Requests (#32339)
  NETWORKING: Fix Netty Leaks by upgrading to 4.1.28 (#32511)
  [DOCS] Small fixes in rule configuration page (#32516)
  Painless: Clean up PainlessMethod (#32476)
  Build: Remove shadowing from benchmarks (#32475)
  Docs: Add all JDKs to CONTRIBUTING.md
  Add licensing enforcement for FIPS mode (#32437)
  SQL: Add test for handling of partial results (#32474)
  Mute testFilterCacheStats
  [ML][DOCS] Fix typo applied_to => applies_to
  Scripting: Fix painless compiler loader to know about context classes (#32385)
dnhatn added a commit that referenced this pull request Aug 6, 2018
* 6.x:
  [Kerberos] Use canonical host name (#32588)
  Cross-cluster search: preserve cluster alias in shard failures (#32608)
  [TEST] Allow to run in FIPS JVM (#32607)
  Handle AlreadyClosedException when bumping primary term
  [Test] Add ckb to the list of unsupported languages (#32611)
  SCRIPTING: Move Aggregation Scripts to their own context (#32068) (#32629)
  [TEST] Enhance failure message when bulk updates have failures
  [ML] Add ML result classes to protocol library (#32587)
  Suppress LicensingDocumentationIT.testPutLicense in release builds (#32613)
  [Rollup] Improve ID scheme for rollup documents (#32558)
  Mutes failing SQL string function tests due to #32589
  Suppress Wildfly test in FIPS JVMs (#32543)
  Add cluster UUID to Cluster Stats API response (#32206)
  [ML] Add some ML config classes to protocol library (#32502)
  [TEST]Split transport verification mode none tests (#32488)
  [Rollup] Remove builders from DateHistogramGroupConfig (#32555)
  [ML] Add Detector config classes to protocol library (#32495)
  [Rollup] Remove builders from MetricConfig (#32536)
  Fix race between replica reset and primary promotion (#32442)
  HLRC: Move commercial clients from XPackClient (#32596)
  Security: move User to protocol project (#32367)
  Minor fix for javadoc (applicable for java 11). (#32573)
  Painless: Move Some Lookup Logic to PainlessLookup (#32565)
  Core: Minor size reduction for AbstractComponent (#32509)
  INGEST: Enable default pipelines (#32286) (#32591)
  TEST: Avoid merges in testSeqNoAndCheckpoints
  [Rollup] Remove builders from HistoGroupConfig (#32533)
  fixed elements in array of produced terms (#32519)
  Mutes ReindexFailureTests.searchFailure dues to #28053
  Mutes LicensingDocumentationIT due to #32580
  Remove the SATA controller from OpenSUSE box
  [ML] Rename JobProvider to JobResultsProvider (#32551)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>docs General docs changes :Search Relevance/Analysis How text is split into tokens

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants