From e71301fa0dfd8a64c66abc7204793f6071bb32af Mon Sep 17 00:00:00 2001
From: David Roberts <dave.roberts@elastic.co>
Date: Fri, 14 Sep 2018 17:18:22 +0100
Subject: [PATCH 1/3] [DOCS][ML] Document the ML find_file_structure endpoint

Relates #33471
Relates #33630
---
 .../ml/apis/find-file-structure.asciidoc      | 477 ++++++++++++++++++
 docs/reference/ml/apis/ml-api.asciidoc        |   8 +
 2 files changed, 485 insertions(+)
 create mode 100644 docs/reference/ml/apis/find-file-structure.asciidoc
diff --git a/docs/reference/ml/apis/find-file-structure.asciidoc b/docs/reference/ml/apis/find-file-structure.asciidoc
new file mode 100644
index 0000000000000..7811c9091c23d
--- /dev/null
+++ b/docs/reference/ml/apis/find-file-structure.asciidoc
@@ -0,0 +1,477 @@
+[role="xpack"]
+[testenv="basic"]
+[[ml-find-file-structure]]
+=== Find File Structure API
+++++
+<titleabbrev>Find File Structure</titleabbrev>
+++++
+
+experimental[]
+
+Finds the structure of a text file that contains data suitable to be ingested
+into Elasticsearch.
+
+==== Request
+
+`POST _xpack/ml/find_file_structure`
+
+
+==== Description
+
+The aim of this endpoint is to provide a starting point for ingesting data into
+Elasticsearch in a format that will be suitable for subsequent use with other
+{ml} functionality.
+
+Unlike other Elasticsearch endpoints, the data that is POSTed to this endpoint
+need not be in JSON format, nor even UTF-8 encoded.  It must, however, be text;
+binary file formats are not currently supported.
+
+The response from the endpoint contains:
+
+* A couple of messages from the beginning of the file.
+* Statistics revealing the most common values of all fields detected within the
+  file, plus basic numeric statistics for numeric fields.
+* Information about the structure of the file that would be useful when writing
+  ingest configurations to index the file contents.
+* Appropriate mappings for an Elasticsearch index into which the file contents
+  could be ingested.
+
+All this can be calculated by the endpoint with no guidance.  However,
+optionally it is possible to override some of the decisions about the file
+structure by specifying the desired values in one or more query parameters.
+
+Details of the output can be seen in the
+<<ml-find-file-structure-examples,examples>>.
+
+If the endpoint produces unexpected results for a particular file, the `explain`
+query parameter causes an `explanation` to appear in the response, which should
+help in determining why the returned structure was chosen.
+
+==== Query Parameters
+
+`charset`::
+  (string) Optionally, the character set in which the sample file is encoded.
+  This must be a character set that is supported by the JVM that Elasticsearch,
+  for example "UTF-8", "UTF-16LE", "windows-1252" or "EUC-JP".  If not specified
+  then the structure finder will choose an appropriate character set.
+
+`column_names`::
+  (string) If `format` is set to `delimited` then this parameter may optionally
+  be used to specify the column names.  The value of this parameter must be a
+  comma separated list of column names even if the `delimiter` for the sample
+  file is not comma.  If not specified then the structure finder will use the
+  column names from the header row of the sample file if it has one, and
+  "column1", "column2", "column3", and so on if it doesn't.
+
+`delimiter`::
+  (string) If `format` is set to `delimited` then this parameter may optionally
+  be used to specify the character used to delimit the values in each row.
+  Only a single character is supported; the delimiter may not have multiple
+  characters.  If not specified then the structure finder will consider the
+  following possibilities: comma, tab, semi-colon and pipe (`|`).
+
+`explain`::
+  (boolean) If this parameter is set to `true` then the response will include a
+  field named `explanation` that is an array of strings that indicate how the
+  structure finder produced its result.  The default value is `false`.
+
+`format`::
+  (string) Optionally, the high level structure of the file, which must be one
+  of `json`, `xml`, `delimited` or `semi_structured_text`.  If not specified
+  then the structure finder will decide.
+
+`grok_pattern`::
+  (string) If `format` is set to `semi_structured_text` then this parameter may
+  optionally supply the Grok pattern to be used to extract fields from every
+  message within the sample file.  The name of the timestamp field within the
+  Grok pattern must match what is specified in `timestamp_field`, or be
+  "timestamp" if that parameter is not specified.  If not specified then the
+  structure finder will create a Grok pattern.
+
+`has_header_row`::
+  (boolean) If `format` is set to `delimited` then this parameter may optionally
+  be used to force the decision on whether the column names are in the first row
+  of the sample file.  If not specified then the structure finder will guess
+  based on the similarity of the first row of the sample file and other rows.
+
+`lines_to_sample`::
+  (unsigned integer) The number of lines from the beginning of the uploaded
+  sample to include in the structure analysis.  The minimum is 2; the default
+  is 1000.  If the sample contains fewer lines than this parameter specifies
+  then, providing there are at least two lines in the sample, the analysis will
+  proceed and will analyse all lines provided.  The more lines that are analyzed
+  the slower the analysis will be.  The more varied the lines that are analyzed
+  the more useful the analysis will be.  For example, if you upload a log file
+  where the first 1000 lines are all variations on the same message then the
+  analysis will find more commonality than would be seen with a bigger sample.
+  But, if possible, it would be more efficient to upload a sample file with more
+  variety in the first 1000 lines than to request analysis of 100000 lines to
+  achieve some variety.
+
+`quote`::
+  (string) If `format` is set to `delimited` then this parameter may optionally
+  be used to specify the character used to quote the values in each row if they
+  contain newlines or the delimiter character.  Only a single character is
+  supported.  If not specified the default is a double quote (`"`).  (If your
+  delimited file format does not use quoting then a workaround is to set this
+  argument to a character that does not appear anywhere in the sample.)
+
+`should_trim_fields`::
+  (boolean) If `format` is set to `delimited` then this parameter may optionally
+  be used to specify whether values between delimiters should have whitespace
+  trimmed from them.  If not specified then the default is `true` if the
+  delimiter is pipe (`|`) and `false` otherwise.
+
+`timestamp_field`::
+  (string) Optionally, the name of the field in the sample file that contains
+  the primary timestamp of each record (the one that would be used to populate
+  the `@timestamp` field if the file were ingested into an index).  If `format`
+  is `semi_structured_text` then this field must match the name of the
+  appropriate extraction in the `grok_pattern`, therefore it is best not to
+  specify this parameter unless `grok_pattern` is also specified.  For
+  structured file formats any `timestamp_field` specified must be present within
+  the file.  (For structured file formats it is not compulsory to have a
+  timestamp within the file if `timestamp_field` is not specified.)  If not
+  specified then the structure finder will make a decision about which field
+  (if any) should be the primary timestamp field.
+
+`timestamp_format`::
+  (string) Optionally, the time format of the timestamp field in the sample
+  file.  Currently there is a limitation that this format must be one that the
+  structure finder might choose by itself.  (The reason for this restriction is
+  that to consistently set all the fields in the response the structure finder
+  needs a corresponding Grok pattern name and simple regular expression for each
+  timestamp format.)  Therefore there is little value in specifying this
+  parameter for structured file formats: it is as good and less error-prone to
+  just specify `timestamp_field` if you know which field contains your primary
+  timestamp.  The valuable use case for this parameter is when the format is
+  semi-structured text, there are multiple timestamp formats in the sample file
+  and you know which format corresponds to the primary timestamp, yet you do not
+  want to specify the full `grok_pattern`.  If not specified then the structure
+  finder will choose the best format from the formats it knows, which are:
+  - `dd/MMM/YYYY:HH:mm:ss Z`
+  - `EEE MMM dd HH:mm zzz YYYY`
+  - `EEE MMM dd HH:mm:ss YYYY`
+  - `EEE MMM dd HH:mm:ss zzz YYYY`
+  - `EEE MMM dd YYYY HH:mm zzz`
+  - `EEE MMM dd YYYY HH:mm:ss zzz`
+  - `EEE, dd MMM YYYY HH:mm Z`
+  - `EEE, dd MMM YYYY HH:mm ZZ`
+  - `EEE, dd MMM YYYY HH:mm:ss Z`
+  - `EEE, dd MMM YYYY HH:mm:ss ZZ`
+  - `ISO8601`
+  - `MMM  d HH:mm:ss`
+  - `MMM  d HH:mm:ss,SSS`
+  - `MMM  d YYYY HH:mm:ss`
+  - `MMM dd HH:mm:ss`
+  - `MMM dd HH:mm:ss,SSS`
+  - `MMM dd YYYY HH:mm:ss`
+  - `MMM dd, YYYY K:mm:ss a`
+  - `TAI64N`
+  - `UNIX`
+  - `UNIX_MS`
+  - `YYYY-MM-dd HH:mm:ss`
+  - `YYYY-MM-dd HH:mm:ss,SSS`
+  - `YYYY-MM-dd HH:mm:ss,SSS Z`
+  - `YYYY-MM-dd HH:mm:ss,SSSZ`
+  - `YYYY-MM-dd HH:mm:ss,SSSZZ`
+  - `YYYY-MM-dd HH:mm:ssZ`
+  - `YYYY-MM-dd HH:mm:ssZZ`
+  - `YYYYMMddHHmmss`
+
+
+==== Request Body
+
+The file whose structure is to be analyzed.  This does not necessarily have to
+be in JSON format, and does not necessarily have to be UTF-8 encoded.  The
+size is still limited to the Elasticsearch HTTP receive buffer size (default
+100 Mb).
+
+
+==== Authorization
+
+You must have `monitor_ml`, or `monitor` cluster privileges to use this API.
+For more information, see {stack-ov}/security-privileges.html[Security Privileges].
+
+
+[[ml-find-file-structure-examples]]
+==== Examples
+
+Suppose you have a newline delimited JSON file containing information about some
+books.  Then you could send the contents to the `find_file_structure` endpoint:
+
+[source,js]
+----
+POST _xpack/ml/find_file_structure
+{"name": "Leviathan Wakes", "author": "James S.A. Corey", "release_date": "2011-06-02", "page_count": 561}
+{"name": "Hyperion", "author": "Dan Simmons", "release_date": "1989-05-26", "page_count": 482}
+{"name": "Dune", "author": "Frank Herbert", "release_date": "1965-06-01", "page_count": 604}
+{"name": "Dune Messiah", "author": "Frank Herbert", "release_date": "1969-10-15", "page_count": 331}
+{"name": "Children of Dune", "author": "Frank Herbert", "release_date": "1976-04-21", "page_count": 408}
+{"name": "God Emperor of Dune", "author": "Frank Herbert", "release_date": "1981-05-28", "page_count": 454}
+{"name": "Consider Phlebas", "author": "Iain M. Banks", "release_date": "1987-04-23", "page_count": 471}
+{"name": "Pandora's Star", "author": "Peter F. Hamilton", "release_date": "2004-03-02", "page_count": 768}
+{"name": "Revelation Space", "author": "Alastair Reynolds", "release_date": "2000-03-15", "page_count": 585}
+{"name": "A Fire Upon the Deep", "author": "Vernor Vinge", "release_date": "1992-06-01", "page_count": 613}
+{"name": "Ender's Game", "author": "Orson Scott Card", "release_date": "1985-06-01", "page_count": 324}
+{"name": "1984", "author": "George Orwell", "release_date": "1985-06-01", "page_count": 328}
+{"name": "Fahrenheit 451", "author": "Ray Bradbury", "release_date": "1953-10-15", "page_count": 227}
+{"name": "Brave New World", "author": "Aldous Huxley", "release_date": "1932-06-01", "page_count": 268}
+{"name": "Foundation", "author": "Isaac Asimov", "release_date": "1951-06-01", "page_count": 224}
+{"name": "The Giver", "author": "Lois Lowry", "release_date": "1993-04-26", "page_count": 208}
+{"name": "Slaughterhouse-Five", "author": "Kurt Vonnegut", "release_date": "1969-06-01", "page_count": 275}
+{"name": "The Hitchhiker's Guide to the Galaxy", "author": "Douglas Adams", "release_date": "1979-10-12", "page_count": 180}
+{"name": "Snow Crash", "author": "Neal Stephenson", "release_date": "1992-06-01", "page_count": 470}
+{"name": "Neuromancer", "author": "William Gibson", "release_date": "1984-07-01", "page_count": 271}
+{"name": "The Handmaid's Tale", "author": "Margaret Atwood", "release_date": "1985-06-01", "page_count": 311}
+{"name": "Starship Troopers", "author": "Robert A. Heinlein", "release_date": "1959-12-01", "page_count": 335}
+{"name": "The Left Hand of Darkness", "author": "Ursula K. Le Guin", "release_date": "1969-06-01", "page_count": 304}
+{"name": "The Moon is a Harsh Mistress", "author": "Robert A. Heinlein", "release_date": "1966-04-01", "page_count": 288}
+----
+// CONSOLE
+// TEST
+
+If the request does not encounter errors, you receive the following result:
+[source,js]
+----
+{
+  "num_lines_analyzed" : 24, <1>
+  "num_messages_analyzed" : 24, <2>
+  "sample_start" : "{\"name\": \"Leviathan Wakes\", \"author\": \"James S.A. Corey\", \"release_date\": \"2011-06-02\", \"page_count\": 561}\n{\"name\": \"Hyperion\", \"author\": \"Dan Simmons\", \"release_date\": \"1989-05-26\", \"page_count\": 482}\n", <3>
+  "charset" : "UTF-8", <4>
+  "has_byte_order_marker" : false, <5>
+  "format" : "json", <6>
+  "need_client_timezone" : false, <7>
+  "mappings" : { <8>
+    "author" : {
+      "type" : "keyword"
+    },
+    "name" : {
+      "type" : "keyword"
+    },
+    "page_count" : {
+      "type" : "long"
+    },
+    "release_date" : {
+      "type" : "keyword"
+    }
+  },
+  "field_stats" : { <9>
+    "author" : {
+      "count" : 24,
+      "cardinality" : 20,
+      "top_hits" : [
+        {
+          "value" : "Frank Herbert",
+          "count" : 4
+        },
+        {
+          "value" : "Robert A. Heinlein",
+          "count" : 2
+        },
+        {
+          "value" : "Alastair Reynolds",
+          "count" : 1
+        },
+        {
+          "value" : "Aldous Huxley",
+          "count" : 1
+        },
+        {
+          "value" : "Dan Simmons",
+          "count" : 1
+        },
+        {
+          "value" : "Douglas Adams",
+          "count" : 1
+        },
+        {
+          "value" : "George Orwell",
+          "count" : 1
+        },
+        {
+          "value" : "Iain M. Banks",
+          "count" : 1
+        },
+        {
+          "value" : "Isaac Asimov",
+          "count" : 1
+        },
+        {
+          "value" : "James S.A. Corey",
+          "count" : 1
+        }
+      ]
+    },
+    "name" : {
+      "count" : 24,
+      "cardinality" : 24,
+      "top_hits" : [
+        {
+          "value" : "1984",
+          "count" : 1
+        },
+        {
+          "value" : "A Fire Upon the Deep",
+          "count" : 1
+        },
+        {
+          "value" : "Brave New World",
+          "count" : 1
+        },
+        {
+          "value" : "Children of Dune",
+          "count" : 1
+        },
+        {
+          "value" : "Consider Phlebas",
+          "count" : 1
+        },
+        {
+          "value" : "Dune",
+          "count" : 1
+        },
+        {
+          "value" : "Dune Messiah",
+          "count" : 1
+        },
+        {
+          "value" : "Ender's Game",
+          "count" : 1
+        },
+        {
+          "value" : "Fahrenheit 451",
+          "count" : 1
+        },
+        {
+          "value" : "Foundation",
+          "count" : 1
+        }
+      ]
+    },
+    "page_count" : {
+      "count" : 24,
+      "cardinality" : 24,
+      "min_value" : 180.0,
+      "max_value" : 768.0,
+      "mean_value" : 387.0833333333333,
+      "median_value" : 329.5,
+      "top_hits" : [
+        {
+          "value" : 180.0,
+          "count" : 1
+        },
+        {
+          "value" : 208.0,
+          "count" : 1
+        },
+        {
+          "value" : 224.0,
+          "count" : 1
+        },
+        {
+          "value" : 227.0,
+          "count" : 1
+        },
+        {
+          "value" : 268.0,
+          "count" : 1
+        },
+        {
+          "value" : 271.0,
+          "count" : 1
+        },
+        {
+          "value" : 275.0,
+          "count" : 1
+        },
+        {
+          "value" : 288.0,
+          "count" : 1
+        },
+        {
+          "value" : 304.0,
+          "count" : 1
+        },
+        {
+          "value" : 311.0,
+          "count" : 1
+        }
+      ]
+    },
+    "release_date" : {
+      "count" : 24,
+      "cardinality" : 20,
+      "top_hits" : [
+        {
+          "value" : "1985-06-01",
+          "count" : 3
+        },
+        {
+          "value" : "1969-06-01",
+          "count" : 2
+        },
+        {
+          "value" : "1992-06-01",
+          "count" : 2
+        },
+        {
+          "value" : "1932-06-01",
+          "count" : 1
+        },
+        {
+          "value" : "1951-06-01",
+          "count" : 1
+        },
+        {
+          "value" : "1953-10-15",
+          "count" : 1
+        },
+        {
+          "value" : "1959-12-01",
+          "count" : 1
+        },
+        {
+          "value" : "1965-06-01",
+          "count" : 1
+        },
+        {
+          "value" : "1966-04-01",
+          "count" : 1
+        },
+        {
+          "value" : "1969-10-15",
+          "count" : 1
+        }
+      ]
+    }
+  }
+}
+----
+// TESTRESPONSE[s/"sample_start" : ".*",/"sample_start" : "$body.sample_start",/]
+// The substitution is because the "file" is pre-processed by the test harness,
+// so the fields may get reordered in the JSON the endpoint sees
+
+<1> `num_lines_analyzed` says how many lines of the uploaded file were analyzed.
+<2> `num_messages_analyzed` says how many distinct messages the lines contained.
+     For ND-JSON it will be the same as `num_lines_analyzed`, but for other
+     file formats messages can span several lines.
+<3> `sample_start` reproduces the first two messages in the sample file
+    verbatim.  This may help to diagnose parse errors, or accidental uploads of
+    the wrong file.
+<4> `charset` the character encoding used to parse the uploaded file.
+<5> `has_byte_order_marker` for UTF character encodings, did the uploaded file
+    begin with a byte order marker?
+<6> `format` is one of `json`, `xml`, `delimited` or `semi_structured_text`.
+<7> `need_client_timezone` will be `true` if a timestamp format is detected
+    that does not include a timezone, thus necessitating that the server that
+    parses it must be told the correct timezone by the client.
+<8> `mappings` contains some suitable mappings for an index into which the data
+    could be ingested.  In this case the `release_date` field has been given
+    `type` `keyword` as it is not considered specific enough to convert to the
+    `date` `type`.
+<9> `field_stats` contains the most common values of each field, plus basic
+    numeric statistics for the numeric `page_count` field.  This information
+    may provide clues that the data needs to be cleaned or transformed prior
+    to use by other {ml} functionality.
+
diff --git a/docs/reference/ml/apis/ml-api.asciidoc b/docs/reference/ml/apis/ml-api.asciidoc
index 961eb37e9d7e0..661e86c5ba67b 100644
--- a/docs/reference/ml/apis/ml-api.asciidoc
+++ b/docs/reference/ml/apis/ml-api.asciidoc
@@ -70,6 +70,12 @@ machine learning APIs and in advanced job configuration options in Kibana.
 * <<ml-get-influencer,Get influencers>>
 * <<ml-get-record,Get records>>
 
+[float]
+[[ml-api-file-structure-endpoint]]
+=== File Structure
+
+* <<ml-find-file-structure,Find file structure>>
+
 //ADD
 include::post-calendar-event.asciidoc[]
 include::put-calendar-job.asciidoc[]
@@ -126,3 +132,5 @@ include::update-snapshot.asciidoc[]
 //VALIDATE
 //include::validate-detector.asciidoc[]
 //include::validate-job.asciidoc[]
+//FILE-STRUCTURE
+include::find-file-structure.asciidoc[]

From e4fbf106d49ef77e18b91659a277eeb3a52369d7 Mon Sep 17 00:00:00 2001
From: lcawl <lcawley@elastic.co>
Date: Mon, 17 Sep 2018 12:41:02 -0700
Subject: [PATCH 2/3] [DOCS] Edits file structure API

---
 .../ml/apis/find-file-structure.asciidoc      | 310 +++++++++---------
 docs/reference/ml/apis/ml-api.asciidoc        |   5 +-
 2 files changed, 164 insertions(+), 151 deletions(-)

diff --git a/docs/reference/ml/apis/find-file-structure.asciidoc b/docs/reference/ml/apis/find-file-structure.asciidoc
index 7811c9091c23d..3ae4927db2404 100644
--- a/docs/reference/ml/apis/find-file-structure.asciidoc
+++ b/docs/reference/ml/apis/find-file-structure.asciidoc
@@ -8,8 +8,8 @@
 
 experimental[]
 
-Finds the structure of a text file that contains data suitable to be ingested
-into Elasticsearch.
+Finds the structure of a text file. The text file must contain data that is 
+suitable to be ingested into {es}.
 
 ==== Request
 
@@ -18,174 +18,187 @@ into Elasticsearch.
 
 ==== Description
 
-The aim of this endpoint is to provide a starting point for ingesting data into
-Elasticsearch in a format that will be suitable for subsequent use with other
-{ml} functionality.
+This API provides a starting point for ingesting data into {es} in a format that 
+is suitable for subsequent use with other {ml} functionality.
 
-Unlike other Elasticsearch endpoints, the data that is POSTed to this endpoint
-need not be in JSON format, nor even UTF-8 encoded.  It must, however, be text;
-binary file formats are not currently supported.
+Unlike other {es} endpoints, the data that is posted to this endpoint does not 
+need to be UTF-8 encoded and in JSON format.  It must, however, be text; binary 
+file formats are not currently supported.
 
-The response from the endpoint contains:
+The response from the API contains:
 
 * A couple of messages from the beginning of the file.
-* Statistics revealing the most common values of all fields detected within the
-  file, plus basic numeric statistics for numeric fields.
-* Information about the structure of the file that would be useful when writing
+* Statistics that reveal the most common values for all fields detected within 
+  the file and basic numeric statistics for numeric fields.
+* Information about the structure of the file, which is useful when you write 
   ingest configurations to index the file contents.
-* Appropriate mappings for an Elasticsearch index into which the file contents
-  could be ingested.
+* Appropriate mappings for an {es} index, which you could use to ingest the file 
+  contents. 
 
-All this can be calculated by the endpoint with no guidance.  However,
-optionally it is possible to override some of the decisions about the file
-structure by specifying the desired values in one or more query parameters.
+All this information can be calculated by the structure finder with no guidance. 
+However, you can optionally override some of the decisions about the file 
+structure by specifying one or more query parameters.
 
 Details of the output can be seen in the
 <<ml-find-file-structure-examples,examples>>.
 
-If the endpoint produces unexpected results for a particular file, the `explain`
-query parameter causes an `explanation` to appear in the response, which should
-help in determining why the returned structure was chosen.
+If the structure finder produces unexpected results for a particular file, 
+specify the `explain` query parameter. It causes an `explanation` to appear in 
+the response, which should help in determining why the returned structure was 
+chosen.
 
 ==== Query Parameters
 
 `charset`::
-  (string) Optionally, the character set in which the sample file is encoded.
-  This must be a character set that is supported by the JVM that Elasticsearch,
-  for example "UTF-8", "UTF-16LE", "windows-1252" or "EUC-JP".  If not specified
-  then the structure finder will choose an appropriate character set.
+  (string) The file's character set. It must be a character set that is supported 
+  by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`, `windows-1252`, or 
+  `EUC-JP`. If this parameter is not specified, the structure finder chooses an 
+  appropriate character set.
 
 `column_names`::
-  (string) If `format` is set to `delimited` then this parameter may optionally
-  be used to specify the column names.  The value of this parameter must be a
-  comma separated list of column names even if the `delimiter` for the sample
-  file is not comma.  If not specified then the structure finder will use the
-  column names from the header row of the sample file if it has one, and
-  "column1", "column2", "column3", and so on if it doesn't.
+  (string) If `format` is set to `delimited`, you can specify the column names 
+  in a comma-separated list. If this parameter is not specified, the structure 
+  finder uses the column names from the header row of the file. If the file does 
+  not have a header role, columns are named "column1", "column2", "column3", etc. 
 
 `delimiter`::
-  (string) If `format` is set to `delimited` then this parameter may optionally
-  be used to specify the character used to delimit the values in each row.
-  Only a single character is supported; the delimiter may not have multiple
-  characters.  If not specified then the structure finder will consider the
-  following possibilities: comma, tab, semi-colon and pipe (`|`).
+  (string) If `format` is set to `delimited`, you can specify the character used 
+  to delimit the values in each row. Only a single character is supported; the 
+  delimiter cannot have multiple characters. If this parameter is not specified, 
+  the structure finder considers the following possibilities: comma, tab, 
+  semi-colon, and pipe (`|`).
 
 `explain`::
-  (boolean) If this parameter is set to `true` then the response will include a
-  field named `explanation` that is an array of strings that indicate how the
-  structure finder produced its result.  The default value is `false`.
+  (boolean) If this parameter is set to `true`, the response includes a field 
+  named `explanation`, which is an array of strings that indicate how the
+  structure finder produced its result. The default value is `false`.
 
 `format`::
-  (string) Optionally, the high level structure of the file, which must be one
-  of `json`, `xml`, `delimited` or `semi_structured_text`.  If not specified
-  then the structure finder will decide.
+  (string) The high level structure of the file. Valid values are `json`, `xml`, 
+  `delimited`, and `semi_structured_text`. If this parameter is not specified, 
+  the structure finder chooses one.
 
 `grok_pattern`::
-  (string) If `format` is set to `semi_structured_text` then this parameter may
-  optionally supply the Grok pattern to be used to extract fields from every
-  message within the sample file.  The name of the timestamp field within the
-  Grok pattern must match what is specified in `timestamp_field`, or be
-  "timestamp" if that parameter is not specified.  If not specified then the
-  structure finder will create a Grok pattern.
+  (string) If `format` is set to `semi_structured_text`, you can specify a Grok 
+  pattern that is used to extract fields from every message in the file. The 
+  name of the timestamp field in the Grok pattern must match what is specified 
+  in the `timestamp_field` parameter. If that parameter is not specified, the 
+  name of the timestamp field in the Grok pattern must match "timestamp". If 
+  `grok_pattern` is not specified, the structure finder creates a Grok pattern.
 
 `has_header_row`::
-  (boolean) If `format` is set to `delimited` then this parameter may optionally
-  be used to force the decision on whether the column names are in the first row
-  of the sample file.  If not specified then the structure finder will guess
-  based on the similarity of the first row of the sample file and other rows.
+  (boolean) If `format` is set to `delimited`, you can use this parameter to 
+  indicate whether the column names are in the first row of the file. If this 
+  parameter is not specified, the structure finder guesses based on the similarity of 
+  the first row of the file to other rows.
 
 `lines_to_sample`::
-  (unsigned integer) The number of lines from the beginning of the uploaded
-  sample to include in the structure analysis.  The minimum is 2; the default
-  is 1000.  If the sample contains fewer lines than this parameter specifies
-  then, providing there are at least two lines in the sample, the analysis will
-  proceed and will analyse all lines provided.  The more lines that are analyzed
-  the slower the analysis will be.  The more varied the lines that are analyzed
-  the more useful the analysis will be.  For example, if you upload a log file
-  where the first 1000 lines are all variations on the same message then the
-  analysis will find more commonality than would be seen with a bigger sample.
-  But, if possible, it would be more efficient to upload a sample file with more
-  variety in the first 1000 lines than to request analysis of 100000 lines to
-  achieve some variety.
+  (unsigned integer) The number of lines to include in the structural analysis, 
+  starting from the beginning of the file. The minimum is 2; the default
+  is 1000. If the value of this parameter is greater than the number of lines in 
+  the file, the analysis proceeds (as long as there are at least two lines in the 
+  file) for all of the lines. +
++
+--  
+NOTE: The number of lines and the variation of the lines affects the speed of 
+the analysis. For example, if you upload a log file where the first 1000 lines 
+are all variations on the same message, the analysis will find more commonality 
+than would be seen with a bigger sample. If possible, however, it is more 
+efficient to upload a sample file with more variety in the first 1000 lines than 
+to request analysis of 100000 lines to achieve some variety.
+--
 
 `quote`::
-  (string) If `format` is set to `delimited` then this parameter may optionally
-  be used to specify the character used to quote the values in each row if they
-  contain newlines or the delimiter character.  Only a single character is
-  supported.  If not specified the default is a double quote (`"`).  (If your
-  delimited file format does not use quoting then a workaround is to set this
-  argument to a character that does not appear anywhere in the sample.)
+  (string) If `format` is set to `delimited`, you can specify the character used 
+  to quote the values in each row if they contain newlines or the delimiter 
+  character. Only a single character is supported. If this parameter is not 
+  specified, the default value is a double quote (`"`). If your delimited file 
+  format does not use quoting, a workaround is to set this argument to a 
+  character that does not appear anywhere in the sample.
 
 `should_trim_fields`::
-  (boolean) If `format` is set to `delimited` then this parameter may optionally
-  be used to specify whether values between delimiters should have whitespace
-  trimmed from them.  If not specified then the default is `true` if the
-  delimiter is pipe (`|`) and `false` otherwise.
+  (boolean) If `format` is set to `delimited`, you can specify whether values 
+  between delimiters should have whitespace trimmed from them. If this parameter 
+  is not specified and the delimiter is pipe (`|`), the default value is `true`. 
+  Otherwise, the default value is `false`.
 
 `timestamp_field`::
-  (string) Optionally, the name of the field in the sample file that contains
-  the primary timestamp of each record (the one that would be used to populate
-  the `@timestamp` field if the file were ingested into an index).  If `format`
-  is `semi_structured_text` then this field must match the name of the
-  appropriate extraction in the `grok_pattern`, therefore it is best not to
-  specify this parameter unless `grok_pattern` is also specified.  For
-  structured file formats any `timestamp_field` specified must be present within
-  the file.  (For structured file formats it is not compulsory to have a
-  timestamp within the file if `timestamp_field` is not specified.)  If not
-  specified then the structure finder will make a decision about which field
-  (if any) should be the primary timestamp field.
+  (string) The name of the field that contains the primary timestamp of each 
+  record in the file. In particular, if the file were ingested into an index, 
+  this is the field that would be used to populate the `@timestamp` field. +
++
+--
+If the `format` is `semi_structured_text`, this field must match the name of the
+appropriate extraction in the `grok_pattern`. Therefore, for semi-structured 
+file formats, it is best not to specify this parameter unless `grok_pattern` is 
+also specified. 
+
+For structured file formats, if you specify this parameter, the field must exist 
+within the file. 
+
+If this parameter is not specified, the structure finder makes a decision about which 
+field (if any) is the primary timestamp field. For structured file formats, it 
+is not compulsory to have a timestamp in the file.
+--
 
 `timestamp_format`::
-  (string) Optionally, the time format of the timestamp field in the sample
-  file.  Currently there is a limitation that this format must be one that the
-  structure finder might choose by itself.  (The reason for this restriction is
-  that to consistently set all the fields in the response the structure finder
-  needs a corresponding Grok pattern name and simple regular expression for each
-  timestamp format.)  Therefore there is little value in specifying this
-  parameter for structured file formats: it is as good and less error-prone to
-  just specify `timestamp_field` if you know which field contains your primary
-  timestamp.  The valuable use case for this parameter is when the format is
-  semi-structured text, there are multiple timestamp formats in the sample file
-  and you know which format corresponds to the primary timestamp, yet you do not
-  want to specify the full `grok_pattern`.  If not specified then the structure
-  finder will choose the best format from the formats it knows, which are:
-  - `dd/MMM/YYYY:HH:mm:ss Z`
-  - `EEE MMM dd HH:mm zzz YYYY`
-  - `EEE MMM dd HH:mm:ss YYYY`
-  - `EEE MMM dd HH:mm:ss zzz YYYY`
-  - `EEE MMM dd YYYY HH:mm zzz`
-  - `EEE MMM dd YYYY HH:mm:ss zzz`
-  - `EEE, dd MMM YYYY HH:mm Z`
-  - `EEE, dd MMM YYYY HH:mm ZZ`
-  - `EEE, dd MMM YYYY HH:mm:ss Z`
-  - `EEE, dd MMM YYYY HH:mm:ss ZZ`
-  - `ISO8601`
-  - `MMM  d HH:mm:ss`
-  - `MMM  d HH:mm:ss,SSS`
-  - `MMM  d YYYY HH:mm:ss`
-  - `MMM dd HH:mm:ss`
-  - `MMM dd HH:mm:ss,SSS`
-  - `MMM dd YYYY HH:mm:ss`
-  - `MMM dd, YYYY K:mm:ss a`
-  - `TAI64N`
-  - `UNIX`
-  - `UNIX_MS`
-  - `YYYY-MM-dd HH:mm:ss`
-  - `YYYY-MM-dd HH:mm:ss,SSS`
-  - `YYYY-MM-dd HH:mm:ss,SSS Z`
-  - `YYYY-MM-dd HH:mm:ss,SSSZ`
-  - `YYYY-MM-dd HH:mm:ss,SSSZZ`
-  - `YYYY-MM-dd HH:mm:ssZ`
-  - `YYYY-MM-dd HH:mm:ssZZ`
-  - `YYYYMMddHHmmss`
-
+  (string) The time format of the timestamp field in the file. +
++
+--
+NOTE: Currently there is a limitation that this format must be one that the 
+structure finder might choose by itself. The reason for this restriction is that 
+to consistently set all the fields in the response the structure finder needs a 
+corresponding Grok pattern name and simple regular expression for each timestamp 
+format. Therefore, there is little value in specifying this parameter for 
+structured file formats. If you know which field contains your primary timestamp, 
+it is as good and less error-prone to just specify `timestamp_field`.
+
+The valuable use case for this parameter is when the format is semi-structured 
+text, there are multiple timestamp formats in the file, and you know which 
+format corresponds to the primary timestamp, but you do not want to specify the 
+full `grok_pattern`.  
+
+If this parameter is not specified, the structure finder chooses the best format from 
+the formats it knows, which are:
+
+* `dd/MMM/YYYY:HH:mm:ss Z`
+* `EEE MMM dd HH:mm zzz YYYY`
+* `EEE MMM dd HH:mm:ss YYYY`
+* `EEE MMM dd HH:mm:ss zzz YYYY`
+* `EEE MMM dd YYYY HH:mm zzz`
+* `EEE MMM dd YYYY HH:mm:ss zzz`
+* `EEE, dd MMM YYYY HH:mm Z`
+* `EEE, dd MMM YYYY HH:mm ZZ`
+* `EEE, dd MMM YYYY HH:mm:ss Z`
+* `EEE, dd MMM YYYY HH:mm:ss ZZ`
+* `ISO8601`
+* `MMM  d HH:mm:ss`
+* `MMM  d HH:mm:ss,SSS`
+* `MMM  d YYYY HH:mm:ss`
+* `MMM dd HH:mm:ss`
+* `MMM dd HH:mm:ss,SSS`
+* `MMM dd YYYY HH:mm:ss`
+* `MMM dd, YYYY K:mm:ss a`
+* `TAI64N`
+* `UNIX`
+* `UNIX_MS`
+* `YYYY-MM-dd HH:mm:ss`
+* `YYYY-MM-dd HH:mm:ss,SSS`
+* `YYYY-MM-dd HH:mm:ss,SSS Z`
+* `YYYY-MM-dd HH:mm:ss,SSSZ`
+* `YYYY-MM-dd HH:mm:ss,SSSZZ`
+* `YYYY-MM-dd HH:mm:ssZ`
+* `YYYY-MM-dd HH:mm:ssZZ`
+* `YYYYMMddHHmmss`
+
+--
 
 ==== Request Body
 
-The file whose structure is to be analyzed.  This does not necessarily have to
-be in JSON format, and does not necessarily have to be UTF-8 encoded.  The
-size is still limited to the Elasticsearch HTTP receive buffer size (default
-100 Mb).
+The text file that you want to analyze. It must contain data that is suitable to 
+be ingested into {es}. It does not need to be in JSON format and it does not 
+need to be UTF-8 encoded.  The size is limited to the {es} HTTP receive buffer 
+size, which defaults to 100 Mb.
 
 
 ==== Authorization
@@ -197,8 +210,8 @@ For more information, see {stack-ov}/security-privileges.html[Security Privilege
 [[ml-find-file-structure-examples]]
 ==== Examples
 
-Suppose you have a newline delimited JSON file containing information about some
-books.  Then you could send the contents to the `find_file_structure` endpoint:
+Suppose you have a newline-delimited JSON file that contains information about 
+some books. You can send the contents to the `find_file_structure` endpoint:
 
 [source,js]
 ----
@@ -452,24 +465,23 @@ If the request does not encounter errors, you receive the following result:
 // The substitution is because the "file" is pre-processed by the test harness,
 // so the fields may get reordered in the JSON the endpoint sees
 
-<1> `num_lines_analyzed` says how many lines of the uploaded file were analyzed.
-<2> `num_messages_analyzed` says how many distinct messages the lines contained.
-     For ND-JSON it will be the same as `num_lines_analyzed`, but for other
-     file formats messages can span several lines.
-<3> `sample_start` reproduces the first two messages in the sample file
-    verbatim.  This may help to diagnose parse errors, or accidental uploads of
-    the wrong file.
-<4> `charset` the character encoding used to parse the uploaded file.
-<5> `has_byte_order_marker` for UTF character encodings, did the uploaded file
-    begin with a byte order marker?
+<1> `num_lines_analyzed` indicates how many lines of the file were analyzed.
+<2> `num_messages_analyzed` indicates how many distinct messages the lines contained.
+     For ND-JSON, this value is the same as `num_lines_analyzed`. For other file 
+     formats, messages can span several lines.
+<3> `sample_start` reproduces the first two messages in the file verbatim. This 
+     may help to diagnose parse errors or accidental uploads of the wrong file.
+<4> `charset` indicates the character encoding used to parse the file.
+<5> For UTF character encodings, `has_byte_order_marker` indicates whether the 
+    file begins with a byte order marker.
 <6> `format` is one of `json`, `xml`, `delimited` or `semi_structured_text`.
-<7> `need_client_timezone` will be `true` if a timestamp format is detected
-    that does not include a timezone, thus necessitating that the server that
-    parses it must be told the correct timezone by the client.
+<7> If a timestamp format is detected that does not include a timezone, 
+    `need_client_timezone` will be `true`. The server that parses the file must 
+    therefore be told the correct timezone by the client.
 <8> `mappings` contains some suitable mappings for an index into which the data
-    could be ingested.  In this case the `release_date` field has been given
-    `type` `keyword` as it is not considered specific enough to convert to the
-    `date` `type`.
+    could be ingested. In this case, the `release_date` field has been given a 
+    `keyword` type as it is not considered specific enough to convert to the
+    `date` type.
 <9> `field_stats` contains the most common values of each field, plus basic
     numeric statistics for the numeric `page_count` field.  This information
     may provide clues that the data needs to be cleaned or transformed prior
diff --git a/docs/reference/ml/apis/ml-api.asciidoc b/docs/reference/ml/apis/ml-api.asciidoc
index 661e86c5ba67b..bb086435fb24c 100644
--- a/docs/reference/ml/apis/ml-api.asciidoc
+++ b/docs/reference/ml/apis/ml-api.asciidoc
@@ -95,6 +95,8 @@ include::delete-forecast.asciidoc[]
 include::delete-job.asciidoc[]
 include::delete-calendar-job.asciidoc[]
 include::delete-snapshot.asciidoc[]
+//FIND
+include::find-file-structure.asciidoc[]
 //FLUSH
 include::flush-job.asciidoc[]
 //FORECAST
@@ -132,5 +134,4 @@ include::update-snapshot.asciidoc[]
 //VALIDATE
 //include::validate-detector.asciidoc[]
 //include::validate-job.asciidoc[]
-//FILE-STRUCTURE
-include::find-file-structure.asciidoc[]
+

From f9a243e2a5963f024663f79a78d5ad868284f9dd Mon Sep 17 00:00:00 2001
From: David Roberts <dave.roberts@elastic.co>
Date: Wed, 19 Sep 2018 16:04:03 +0100
Subject: [PATCH 3/3] Make clearer that some parameters require user setting of
 `format`

Previously it was easy to think that the correct auto-detected
`format` made it possible to specify some of the other parameters.
---
 .../ml/apis/find-file-structure.asciidoc      | 164 +++++++++---------
 1 file changed, 82 insertions(+), 82 deletions(-)

diff --git a/docs/reference/ml/apis/find-file-structure.asciidoc b/docs/reference/ml/apis/find-file-structure.asciidoc
index 3ae4927db2404..f9a583a027a4b 100644
--- a/docs/reference/ml/apis/find-file-structure.asciidoc
+++ b/docs/reference/ml/apis/find-file-structure.asciidoc
@@ -8,7 +8,7 @@
 
 experimental[]
 
-Finds the structure of a text file. The text file must contain data that is 
+Finds the structure of a text file. The text file must contain data that is
 suitable to be ingested into {es}.
 
 ==== Request
@@ -18,126 +18,126 @@ suitable to be ingested into {es}.
 
 ==== Description
 
-This API provides a starting point for ingesting data into {es} in a format that 
+This API provides a starting point for ingesting data into {es} in a format that
 is suitable for subsequent use with other {ml} functionality.
 
-Unlike other {es} endpoints, the data that is posted to this endpoint does not 
-need to be UTF-8 encoded and in JSON format.  It must, however, be text; binary 
+Unlike other {es} endpoints, the data that is posted to this endpoint does not
+need to be UTF-8 encoded and in JSON format.  It must, however, be text; binary
 file formats are not currently supported.
 
 The response from the API contains:
 
 * A couple of messages from the beginning of the file.
-* Statistics that reveal the most common values for all fields detected within 
+* Statistics that reveal the most common values for all fields detected within
   the file and basic numeric statistics for numeric fields.
-* Information about the structure of the file, which is useful when you write 
+* Information about the structure of the file, which is useful when you write
   ingest configurations to index the file contents.
-* Appropriate mappings for an {es} index, which you could use to ingest the file 
-  contents. 
+* Appropriate mappings for an {es} index, which you could use to ingest the file
+  contents.
 
-All this information can be calculated by the structure finder with no guidance. 
-However, you can optionally override some of the decisions about the file 
+All this information can be calculated by the structure finder with no guidance.
+However, you can optionally override some of the decisions about the file
 structure by specifying one or more query parameters.
 
 Details of the output can be seen in the
 <<ml-find-file-structure-examples,examples>>.
 
-If the structure finder produces unexpected results for a particular file, 
-specify the `explain` query parameter. It causes an `explanation` to appear in 
-the response, which should help in determining why the returned structure was 
+If the structure finder produces unexpected results for a particular file,
+specify the `explain` query parameter. It causes an `explanation` to appear in
+the response, which should help in determining why the returned structure was
 chosen.
 
 ==== Query Parameters
 
 `charset`::
-  (string) The file's character set. It must be a character set that is supported 
-  by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`, `windows-1252`, or 
-  `EUC-JP`. If this parameter is not specified, the structure finder chooses an 
+  (string) The file's character set. It must be a character set that is supported
+  by the JVM that {es} uses. For example, `UTF-8`, `UTF-16LE`, `windows-1252`, or
+  `EUC-JP`. If this parameter is not specified, the structure finder chooses an
   appropriate character set.
 
 `column_names`::
-  (string) If `format` is set to `delimited`, you can specify the column names 
-  in a comma-separated list. If this parameter is not specified, the structure 
-  finder uses the column names from the header row of the file. If the file does 
-  not have a header role, columns are named "column1", "column2", "column3", etc. 
+  (string) If you have set `format` to `delimited`, you can specify the column names
+  in a comma-separated list. If this parameter is not specified, the structure
+  finder uses the column names from the header row of the file. If the file does
+  not have a header role, columns are named "column1", "column2", "column3", etc.
 
 `delimiter`::
-  (string) If `format` is set to `delimited`, you can specify the character used 
-  to delimit the values in each row. Only a single character is supported; the 
-  delimiter cannot have multiple characters. If this parameter is not specified, 
-  the structure finder considers the following possibilities: comma, tab, 
+  (string) If you have set `format` to `delimited`, you can specify the character used
+  to delimit the values in each row. Only a single character is supported; the
+  delimiter cannot have multiple characters. If this parameter is not specified,
+  the structure finder considers the following possibilities: comma, tab,
   semi-colon, and pipe (`|`).
 
 `explain`::
-  (boolean) If this parameter is set to `true`, the response includes a field 
+  (boolean) If this parameter is set to `true`, the response includes a field
   named `explanation`, which is an array of strings that indicate how the
   structure finder produced its result. The default value is `false`.
 
 `format`::
-  (string) The high level structure of the file. Valid values are `json`, `xml`, 
-  `delimited`, and `semi_structured_text`. If this parameter is not specified, 
+  (string) The high level structure of the file. Valid values are `json`, `xml`,
+  `delimited`, and `semi_structured_text`. If this parameter is not specified,
   the structure finder chooses one.
 
 `grok_pattern`::
-  (string) If `format` is set to `semi_structured_text`, you can specify a Grok 
-  pattern that is used to extract fields from every message in the file. The 
-  name of the timestamp field in the Grok pattern must match what is specified 
-  in the `timestamp_field` parameter. If that parameter is not specified, the 
-  name of the timestamp field in the Grok pattern must match "timestamp". If 
+  (string) If you have set `format` to `semi_structured_text`, you can specify a Grok
+  pattern that is used to extract fields from every message in the file. The
+  name of the timestamp field in the Grok pattern must match what is specified
+  in the `timestamp_field` parameter. If that parameter is not specified, the
+  name of the timestamp field in the Grok pattern must match "timestamp". If
   `grok_pattern` is not specified, the structure finder creates a Grok pattern.
 
 `has_header_row`::
-  (boolean) If `format` is set to `delimited`, you can use this parameter to 
-  indicate whether the column names are in the first row of the file. If this 
-  parameter is not specified, the structure finder guesses based on the similarity of 
+  (boolean) If you have set `format` to `delimited`, you can use this parameter to
+  indicate whether the column names are in the first row of the file. If this
+  parameter is not specified, the structure finder guesses based on the similarity of
   the first row of the file to other rows.
 
 `lines_to_sample`::
-  (unsigned integer) The number of lines to include in the structural analysis, 
+  (unsigned integer) The number of lines to include in the structural analysis,
   starting from the beginning of the file. The minimum is 2; the default
-  is 1000. If the value of this parameter is greater than the number of lines in 
-  the file, the analysis proceeds (as long as there are at least two lines in the 
+  is 1000. If the value of this parameter is greater than the number of lines in
+  the file, the analysis proceeds (as long as there are at least two lines in the
   file) for all of the lines. +
 +
---  
-NOTE: The number of lines and the variation of the lines affects the speed of 
-the analysis. For example, if you upload a log file where the first 1000 lines 
-are all variations on the same message, the analysis will find more commonality 
-than would be seen with a bigger sample. If possible, however, it is more 
-efficient to upload a sample file with more variety in the first 1000 lines than 
+--
+NOTE: The number of lines and the variation of the lines affects the speed of
+the analysis. For example, if you upload a log file where the first 1000 lines
+are all variations on the same message, the analysis will find more commonality
+than would be seen with a bigger sample. If possible, however, it is more
+efficient to upload a sample file with more variety in the first 1000 lines than
 to request analysis of 100000 lines to achieve some variety.
 --
 
 `quote`::
-  (string) If `format` is set to `delimited`, you can specify the character used 
-  to quote the values in each row if they contain newlines or the delimiter 
-  character. Only a single character is supported. If this parameter is not 
-  specified, the default value is a double quote (`"`). If your delimited file 
-  format does not use quoting, a workaround is to set this argument to a 
+  (string) If you have set `format` to `delimited`, you can specify the character used
+  to quote the values in each row if they contain newlines or the delimiter
+  character. Only a single character is supported. If this parameter is not
+  specified, the default value is a double quote (`"`). If your delimited file
+  format does not use quoting, a workaround is to set this argument to a
   character that does not appear anywhere in the sample.
 
 `should_trim_fields`::
-  (boolean) If `format` is set to `delimited`, you can specify whether values 
-  between delimiters should have whitespace trimmed from them. If this parameter 
-  is not specified and the delimiter is pipe (`|`), the default value is `true`. 
+  (boolean) If you have set `format` to `delimited`, you can specify whether values
+  between delimiters should have whitespace trimmed from them. If this parameter
+  is not specified and the delimiter is pipe (`|`), the default value is `true`.
   Otherwise, the default value is `false`.
 
 `timestamp_field`::
-  (string) The name of the field that contains the primary timestamp of each 
-  record in the file. In particular, if the file were ingested into an index, 
+  (string) The name of the field that contains the primary timestamp of each
+  record in the file. In particular, if the file were ingested into an index,
   this is the field that would be used to populate the `@timestamp` field. +
 +
 --
 If the `format` is `semi_structured_text`, this field must match the name of the
-appropriate extraction in the `grok_pattern`. Therefore, for semi-structured 
-file formats, it is best not to specify this parameter unless `grok_pattern` is 
-also specified. 
+appropriate extraction in the `grok_pattern`. Therefore, for semi-structured
+file formats, it is best not to specify this parameter unless `grok_pattern` is
+also specified.
 
-For structured file formats, if you specify this parameter, the field must exist 
-within the file. 
+For structured file formats, if you specify this parameter, the field must exist
+within the file.
 
-If this parameter is not specified, the structure finder makes a decision about which 
-field (if any) is the primary timestamp field. For structured file formats, it 
+If this parameter is not specified, the structure finder makes a decision about which
+field (if any) is the primary timestamp field. For structured file formats, it
 is not compulsory to have a timestamp in the file.
 --
 
@@ -145,20 +145,20 @@ is not compulsory to have a timestamp in the file.
   (string) The time format of the timestamp field in the file. +
 +
 --
-NOTE: Currently there is a limitation that this format must be one that the 
-structure finder might choose by itself. The reason for this restriction is that 
-to consistently set all the fields in the response the structure finder needs a 
-corresponding Grok pattern name and simple regular expression for each timestamp 
-format. Therefore, there is little value in specifying this parameter for 
-structured file formats. If you know which field contains your primary timestamp, 
+NOTE: Currently there is a limitation that this format must be one that the
+structure finder might choose by itself. The reason for this restriction is that
+to consistently set all the fields in the response the structure finder needs a
+corresponding Grok pattern name and simple regular expression for each timestamp
+format. Therefore, there is little value in specifying this parameter for
+structured file formats. If you know which field contains your primary timestamp,
 it is as good and less error-prone to just specify `timestamp_field`.
 
-The valuable use case for this parameter is when the format is semi-structured 
-text, there are multiple timestamp formats in the file, and you know which 
-format corresponds to the primary timestamp, but you do not want to specify the 
-full `grok_pattern`.  
+The valuable use case for this parameter is when the format is semi-structured
+text, there are multiple timestamp formats in the file, and you know which
+format corresponds to the primary timestamp, but you do not want to specify the
+full `grok_pattern`.
 
-If this parameter is not specified, the structure finder chooses the best format from 
+If this parameter is not specified, the structure finder chooses the best format from
 the formats it knows, which are:
 
 * `dd/MMM/YYYY:HH:mm:ss Z`
@@ -195,9 +195,9 @@ the formats it knows, which are:
 
 ==== Request Body
 
-The text file that you want to analyze. It must contain data that is suitable to 
-be ingested into {es}. It does not need to be in JSON format and it does not 
-need to be UTF-8 encoded.  The size is limited to the {es} HTTP receive buffer 
+The text file that you want to analyze. It must contain data that is suitable to
+be ingested into {es}. It does not need to be in JSON format and it does not
+need to be UTF-8 encoded.  The size is limited to the {es} HTTP receive buffer
 size, which defaults to 100 Mb.
 
 
@@ -210,7 +210,7 @@ For more information, see {stack-ov}/security-privileges.html[Security Privilege
 [[ml-find-file-structure-examples]]
 ==== Examples
 
-Suppose you have a newline-delimited JSON file that contains information about 
+Suppose you have a newline-delimited JSON file that contains information about
 some books. You can send the contents to the `find_file_structure` endpoint:
 
 [source,js]
@@ -467,19 +467,19 @@ If the request does not encounter errors, you receive the following result:
 
 <1> `num_lines_analyzed` indicates how many lines of the file were analyzed.
 <2> `num_messages_analyzed` indicates how many distinct messages the lines contained.
-     For ND-JSON, this value is the same as `num_lines_analyzed`. For other file 
+     For ND-JSON, this value is the same as `num_lines_analyzed`. For other file
      formats, messages can span several lines.
-<3> `sample_start` reproduces the first two messages in the file verbatim. This 
+<3> `sample_start` reproduces the first two messages in the file verbatim. This
      may help to diagnose parse errors or accidental uploads of the wrong file.
 <4> `charset` indicates the character encoding used to parse the file.
-<5> For UTF character encodings, `has_byte_order_marker` indicates whether the 
+<5> For UTF character encodings, `has_byte_order_marker` indicates whether the
     file begins with a byte order marker.
 <6> `format` is one of `json`, `xml`, `delimited` or `semi_structured_text`.
-<7> If a timestamp format is detected that does not include a timezone, 
-    `need_client_timezone` will be `true`. The server that parses the file must 
+<7> If a timestamp format is detected that does not include a timezone,
+    `need_client_timezone` will be `true`. The server that parses the file must
     therefore be told the correct timezone by the client.
 <8> `mappings` contains some suitable mappings for an index into which the data
-    could be ingested. In this case, the `release_date` field has been given a 
+    could be ingested. In this case, the `release_date` field has been given a
     `keyword` type as it is not considered specific enough to convert to the
     `date` type.
 <9> `field_stats` contains the most common values of each field, plus basic