Skip to content

Conversation

@jpountz
Copy link
Contributor

@jpountz jpountz commented Nov 18, 2016

No description provided.

@jpountz jpountz added the >docs General docs changes label Nov 18, 2016
Copy link
Contributor

@dadoonet dadoonet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@bczifra
Copy link

bczifra commented Nov 18, 2016

Perhaps it's also worth noting that it can have a negative impact on proximity searches (per @mikemccand comment)?

@jpountz
Copy link
Contributor Author

jpountz commented Nov 18, 2016

Thanks @bczifra , I just added a note about it.

@mikemccand
Copy link
Contributor

LGTM, thanks @jpountz!

since their content will need to be retrieved by the `_search` API to build
the response. Inverting this document can use an amount of memory that is a
multiplier of the original size of the document. Proximity search (phrase
queries for instance) and <<search-request-highlighting,highlighting>> also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hitting the large document in the response is going to be slow in ES as well because we always go to stored fields for the id. Unless you turn off storing _source. I'm not sure if that needs to be in the list, but it feels important because as it reads now I'd think "well, I just have to avoid phrase queries and highlighting".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do that!

Copy link
Contributor

@clintongormley clintongormley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, minor style commens

[[maximum-document-size]]
=== Avoid large documents

Given that the default <<modules-http,`http.max_context_length`>> is set to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that -> than


Given that the default <<modules-http,`http.max_context_length`>> is set to
100MB, Elasticsearch will refuse to index any document that is larger that
that. You might decide to increase that particular setting, but Lucene still
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at -> of

100MB, Elasticsearch will refuse to index any document that is larger that
that. You might decide to increase that particular setting, but Lucene still
has a limit at about 2GB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But even -> Even


But even without considering hard limits, large documents are usually not
practical. Large documents put more stress on network, disk and on memory usage
since their content will need to be retrieved by the `_search` API to build
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inverting -> Indexing (or) Indexing this document into the inverted index

original document.

It is sometimes useful to reconsider what the unit of information should be.
For instance, the fact you want to make books searchable doesn't necesarily
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a single document should consist of a whole book


It is sometimes useful to reconsider what the unit of information should be.
For instance, the fact you want to make books searchable doesn't necesarily
mean that a document should consist of a book. It might be a better idea to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use chapters [delete comma]

mean that a document should consist of a book. It might be a better idea to
use chapters, or even paragraphs as documents, and then have a property in
these documents that identifies which book they belong to. This does not only
avoid the issues with large documents, it also makes the search experience
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete likely

@clintongormley
Copy link
Contributor

LGTM

@jpountz jpountz merged commit 52408fc into elastic:master Nov 21, 2016
@jpountz jpountz deleted the docs/large_docs branch November 21, 2016 14:01
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Nov 22, 2016
* master: (42 commits)
  Add support for merging custom meta data in tribe node (elastic#21552)
  [DOCS] Show EC2's auto attribute (elastic#21474)
  Add information about the removal of store throttling to the migration guide.
  Add a recommendation against large documents to the docs. (elastic#21652)
  Add indices options tests to search api REST tests (elastic#21701)
  Fixing indentation in geospatial querying example. (elastic#21682)
  Fix typo in filters aggregation docs (elastic#21690)
  Add BWC layer for Exceptions (elastic#21694)
  Add checkstyle rule to forbid empty javadoc comments (elastic#20881)
  Docs: Added offline install link for discovery-file plugin
  remove pointless catch exception in TransportSearchAction (elastic#21689)
  Rename ClusterState#lookupPrototypeSafe to `lookupPrototype` and remove previous "unsafe" unused variant (elastic#21686)
  Use a buffer to do character to byte conversion in StreamOutput#writeString (elastic#21680)
  Fix integer overflows when dealing with templates. (elastic#21628)
  Fix highlighting on a stored keyword field (elastic#21645)
  Set execute permissions for native plugin programs (elastic#21657)
  adjust visibility of DiscoveryNodes.Delta constructor
  Remove unused DiscoveryNodes.Delta constructor
  Remove unused DiscoveryNode#removeDeadMembers public method
  Remove minNodeVersion and corresponding public `getSmallestVersion` getter method from DiscoveryNodes
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>docs General docs changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants