Skip to content

Commit 676b7af

Browse files
committed
Merge branch 'cn' of github.com:elasticsearch-cn/elasticsearch-definitive-guide into cn
2 parents 03b91b6 + 3e31f3e commit 676b7af

File tree

4 files changed

+165
-279
lines changed

4 files changed

+165
-279
lines changed

050_Search/00_Intro.asciidoc

Lines changed: 23 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -1,60 +1,43 @@
11
[[search]]
2-
== Searching--The Basic Tools
2+
== 搜索——最基本的工具
33

4-
So far, we have learned how to use Elasticsearch as a simple NoSQL-style
5-
distributed document store. We can ((("searching")))throw JSON documents at Elasticsearch and
6-
retrieve each one by ID. But the real power of Elasticsearch lies in its
7-
ability to make sense out of chaos -- to turn Big Data into Big Information.
4+
现在,我们已经学会了如何使用 Elasticsearch 作为一个简单的 NoSQL 风格的分布式文档存储系统。我们可以((("searching")))将一个 JSON 文档扔到 Elasticsearch 里,然后根据 ID 检索。但 Elasticsearch 真正强大之处在于可以从无规律的数据中找出有意义的信息——从“大数据”到“大信息”。
85

9-
This is the reason that we use structured JSON documents, rather than
10-
amorphous blobs of data. Elasticsearch not only _stores_ the document, but
11-
also _indexes_ the content of the document in order to make it searchable.
6+
Elasticsearch 不只会_存储(stores)_ 文档,为了能被搜索到也会为文档添加_索引(indexes)_ ,这也是为什么我们使用结构化的 JSON 文档,而不是无结构的二进制数据。
127

13-
_Every field in a document is indexed and can be queried_. ((("indexing"))) And it's not just
14-
that. During a single query, Elasticsearch can use _all_ of these indices, to
15-
return results at breath-taking speed. That's something that you could never
16-
consider doing with a traditional database.
8+
_文档中的每个字段都将被索引并且可以被查询_ 。((("indexing")))不仅如此,在简单查询时,Elasticsearch 可以使用 _所有(all)_ 这些索引字段,以惊人的速度返回结果。这是你永远不会考虑用传统数据库去做的一些事情。
179

18-
A _search_ can be any of the following:
10+
_搜索(search)_ 可以做到:
1911

20-
* A structured query on concrete fields((("fields", "searching on")))((("searching", "types of searches"))) like `gender` or `age`, sorted by
21-
a field like `join_date`, similar to the type of query that you could construct
22-
in SQL
12+
* 在类似于 `gender` 或者 `age` 这样的字段((("fields", "searching on")))((("searching", "types of searches")))上使用结构化查询,`join_date` 这样的字段上使用排序,就像SQL的结构化查询一样。
2313

24-
* A full-text query, which finds all documents matching the search keywords,
25-
and returns them sorted by _relevance_
14+
* 全文检索,找出所有匹配关键字的文档并按照_相关性(relevance)_ 排序后返回结果。
2615

27-
* A combination of the two
16+
* 以上二者兼而有之。
2817

29-
While many searches will just work out of((("full text search"))) the box, to use Elasticsearch to
30-
its full potential, you need to understand three subjects:
18+
很多搜索都是开箱即用的((("full text search"))),为了充分挖掘 Elasticsearch 的潜力,你需要理解以下三个概念:
3119

32-
_Mapping_::
33-
How the data in each field is interpreted
34-
35-
_Analysis_::
36-
How full text is processed to make it searchable
37-
38-
_Query DSL_::
39-
The flexible, powerful query language used by Elasticsearch
20+
_映射(Mapping)_ ::
21+
描述数据在每个字段内如何存储
4022

41-
Each of these is a big subject in its own right, and we explain them in
42-
detail in <<search-in-depth>>. The chapters in this section introduce the
43-
basic concepts of all three--just enough to help you to get an overall
44-
understanding of how search works.
23+
_分析(Analysis)_ ::
24+
全文是如何处理使之可以被搜索的
4525

46-
We will start by explaining the `search` API in its simplest form.
26+
_领域特定查询语言(Query DSL)_ ::
27+
Elasticsearch 中强大灵活的查询语言
4728

48-
.Test Data
29+
以上提到的每个点都是一个大话题,我们将在 <<search-in-depth>> 一章详细阐述它们。本章节我们将介绍这三点的一些基本概念——仅仅帮助你大致了解搜索是如何工作的。
30+
31+
我们将使用最简单的形式开始介绍 `search` API。
32+
33+
.测试数据
4934

5035
****
5136
52-
The documents that we will use for test purposes in this chapter can be found
53-
in this gist: https://gist.github.com/clintongormley/8579281.
37+
本章节的测试数据可以在这里找到: https://gist.github.com/clintongormley/8579281 。
5438
55-
You can copy the commands and paste them into your shell in order to follow
56-
along with this chapter.
39+
你可以把这些命令复制到终端中执行来实践本章的例子。
5740
58-
Alternatively, if you're in the online version of this book, you can link:sense_widget.html?snippets/050_Search/Test_data.json[click here to open in Sense].
41+
另外,如果你读的是在线版本,可以 link:sense_widget.html?snippets/050_Search/Test_data.json[点击这个链接] 感受下。
5942
6043
****

410_Scaling/10_Intro.asciidoc

Lines changed: 11 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -1,29 +1,17 @@
11
[[scale]]
2-
== Designing for Scale
2+
== 扩容设计
33

4-
Elasticsearch is used by some companies to index ((("scaling", "designing for scale")))and search petabytes of data
5-
every day, but most of us start out with something a little more humble in
6-
size. Even if we aspire to be the next Facebook, it is unlikely that our bank
7-
balance matches our aspirations. We need to build for what we have today, but
8-
in a way that will allow us to scale out flexibly and rapidly.
4+
一些公司每天使用 Elasticsearch((("scaling", "designing for scale"))) 索引检索 PB 级数据,
5+
但我们中的大多数都起步于规模稍逊的项目。即使我们立志成为下一个 Facebook,我们的银行卡余额却也跟不上梦想的脚步。
6+
我们需要为今日所需而构建,但也要允许我们可以灵活而又快速地进行水平扩展。
97

10-
Elasticsearch is built to scale. It will run very happily on your laptop or
11-
in a cluster containing hundreds of nodes, and the experience is almost
12-
identical. Growing from a small cluster to a large cluster is almost entirely
13-
automatic and painless. Growing from a large cluster to a very large cluster
14-
requires a bit more planning and design, but it is still relatively painless.
8+
Elasticsearch 为了可扩展性而生。它可以良好地运行于你的笔记本电脑又或者一个拥有数百节点的集群,同时用户体验基本相同。
9+
由小规模集群增长为大规模集群的过程几乎完全自动化并且无痛。由大规模集群增长为超大规模集群需要一些规划和设计,但还是相对地无痛。
1510

16-
Of course, it is not magic. Elasticsearch has its limitations too. If you
17-
are aware of those limitations and work with them, the growing process will be
18-
pleasant. If you treat Elasticsearch badly, you could be in for a world of
19-
pain.
11+
当然这一切并不是魔法。Elasticsearch 也有它的局限性。如果你了解这些局限性并能够与之相处,集群扩容的过程将会是愉快的。
12+
如果你对 Elasticsearch 处理不当,那么你将处于一个充满痛苦的世界。
2013

21-
The default settings in Elasticsearch will take you a long way, but to get the
22-
most bang for your buck, you need to think about how data flows through your
23-
system. We will talk about two common data flows: time-based data (such as log
24-
events or social network streams, where relevance is driven by recency), and
25-
user-based data (where a large document collection can be subdivided by user or
26-
customer).
14+
Elasticsearch 的默认设置会伴你走过很长的一段路,但为了发挥它最大的效用,你需要考虑数据是如何流经你的系统的。
15+
我们将讨论两种常见的数据流:时序数据(时间驱动相关性,例如日志或社交网络数据流),以及基于用户的数据(拥有很大的文档集但可以按用户或客户细分)。
2716

28-
This chapter will help you make the right decisions up front, to avoid
29-
nasty surprises later.
17+
这一章将帮助你在遇到不愉快之前做出正确的选择。
Lines changed: 45 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -1,117 +1,69 @@
11
[[hardware]]
2-
=== Hardware
3-
4-
If you've been following the normal development path, you've probably been playing((("deployment", "hardware")))((("hardware")))
5-
with Elasticsearch on your laptop or on a small cluster of machines laying around.
6-
But when it comes time to deploy Elasticsearch to production, there are a few
7-
recommendations that you should consider. Nothing is a hard-and-fast rule;
8-
Elasticsearch is used for a wide range of tasks and on a bewildering array of
9-
machines. But these recommendations provide good starting points based on our experience with
10-
production clusters.
11-
12-
==== Memory
13-
14-
If there is one resource that you will run out of first, it will likely be memory.((("hardware", "memory")))((("memory")))
15-
Sorting and aggregations can both be memory hungry, so enough heap space to
16-
accommodate these is important.((("heap"))) Even when the heap is comparatively small,
17-
extra memory can be given to the OS filesystem cache. Because many data structures
18-
used by Lucene are disk-based formats, Elasticsearch leverages the OS cache to
19-
great effect.
20-
21-
A machine with 64 GB of RAM is the ideal sweet spot, but 32 GB and 16 GB machines
22-
are also common. Less than 8 GB tends to be counterproductive (you end up
23-
needing many, many small machines), and greater than 64 GB has problems that we will
24-
discuss in <<heap-sizing>>.
2+
=== 硬件
3+
4+
按照正常的流程,((("deployment", "hardware")))((("hardware")))你可能已经在自己的笔记本电脑或集群上使用了 Elasticsearch。
5+
但是当要部署 Elasticsearch 到生产环境时,有一些建议是你需要考虑的。这里没有什么必须要遵守的准则,Elasticsearch 被用于在众多的机器上处理各种任务。基于我们在生产环境使用 Elasticsearch 集群的经验,这些建议可以为你提供一个好的起点。
6+
7+
==== 内存
8+
9+
如果有一种资源是最先被耗尽的,它可能是内存。((("hardware", "memory")))((("memory")))排序和聚合都很耗内存,所以有足够的堆空间来应付它们是很重要的。((("heap")))即使堆空间是比较小的时候,
10+
也能为操作系统文件缓存提供额外的内存。因为 Lucene 使用的许多数据结构是基于磁盘的格式,Elasticsearch 利用操作系统缓存能产生很大效果。
11+
12+
64 GB 内存的机器是非常理想的, 但是32 GB 和16 GB 机器也是很常见的。少于8 GB 会适得其反(你最终需要很多很多的小机器),大于64 GB 的机器也会有问题,
13+
我们将在 <<heap-sizing>> 中讨论。
2514

2615
==== CPUs
2716

28-
Most Elasticsearch deployments tend to be rather light on CPU requirements. As
29-
such,((("CPUs (central processing units)")))((("hardware", "CPUs"))) the exact processor setup matters less than the other resources. You should
30-
choose a modern processor with multiple cores. Common clusters utilize two to eight
31-
core machines.
17+
大多数 Elasticsearch 部署往往对 CPU 要求不高。因此,((("CPUs (central processing units)")))((("hardware", "CPUs")))相对其它资源,具体配置多少个(CPU)不是那么关键。你应该选择具有多个内核的现代处理器,常见的集群使用两到八个核的机器。
3218

33-
If you need to choose between faster CPUs or more cores, choose more cores. The
34-
extra concurrency that multiple cores offers will far outweigh a slightly faster
35-
clock speed.
19+
如果你要在更快的 CPUs 和更多的核心之间选择,选择更多的核心更好。多个内核提供的额外并发远胜过稍微快一点点的时钟频率。
3620

37-
==== Disks
21+
==== 硬盘
3822

39-
Disks are important for all clusters,((("disks")))((("hardware", "disks"))) and doubly so for indexing-heavy clusters
40-
(such as those that ingest log data). Disks are the slowest subsystem in a server,
41-
which means that write-heavy clusters can easily saturate their disks, which in
42-
turn become the bottleneck of the cluster.
23+
硬盘对所有的集群都很重要,((("disks")))((("hardware", "disks")))对大量写入的集群更是加倍重要(例如那些存储日志数据的)。硬盘是服务器上最慢的子系统,这意味着那些写入量很大的集群很容易让硬盘饱和,使得它成为集群的瓶颈。
4324

44-
If you can afford SSDs, they are by far superior to any spinning media. SSD-backed
45-
nodes see boosts in both query and indexing performance. If you can afford it,
46-
SSDs are the way to go.
25+
如果你负担得起 SSD,它将远远超出任何旋转介质(注:机械硬盘,磁带等)。 基于 SSD 的节点,查询和索引性能都有提升。如果你负担得起,SSD 是一个好的选择。
4726

48-
.Check Your I/O Scheduler
49-
****
50-
If you are using SSDs, make sure your OS I/O scheduler is((("I/O scheduler"))) configured correctly.
51-
When you write data to disk, the I/O scheduler decides when that data is
52-
_actually_ sent to the disk. The default under most *nix distributions is a
53-
scheduler called `cfq` (Completely Fair Queuing).
54-
55-
This scheduler allocates _time slices_ to each process, and then optimizes the
56-
delivery of these various queues to the disk. It is optimized for spinning media:
57-
the nature of rotating platters means it is more efficient to write data to disk
58-
based on physical layout.
59-
60-
This is inefficient for SSD, however, since there are no spinning platters
61-
involved. Instead, `deadline` or `noop` should be used instead. The deadline
62-
scheduler optimizes based on how long writes have been pending, while `noop`
63-
is just a simple FIFO queue.
64-
65-
This simple change can have dramatic impacts. We've seen a 500-fold improvement
66-
to write throughput just by using the correct scheduler.
27+
.检查你的 I/O 调度程序
6728
****
29+
如果你正在使用 SSDs,确保你的系统 I/O 调度程序是((("I/O scheduler")))配置正确的。
30+
当你向硬盘写数据,I/O 调度程序决定何时把数据实际发送到硬盘。
31+
大多数默认 *nix 发行版下的调度程序都叫做 `cfq`(完全公平队列)。
32+
33+
调度程序分配 _时间片_ 到每个进程。并且优化这些到硬盘的众多队列的传递。但它是为旋转介质优化的:
34+
机械硬盘的固有特性意味着它写入数据到基于物理布局的硬盘会更高效。
6835
69-
If you use spinning media, try to obtain the fastest disks possible (high-performance server disks, 15k RPM drives).
36+
这对 SSD 来说是低效的,尽管这里没有涉及到机械硬盘。但是,`deadline` 或者 `noop` 应该被使用。`deadline` 调度程序基于写入等待时间进行优化,
37+
`noop` 只是一个简单的 FIFO 队列。
7038
71-
Using RAID 0 is an effective way to increase disk speed, for both spinning disks
72-
and SSD. There is no need to use mirroring or parity variants of RAID, since
73-
high availability is built into Elasticsearch via replicas.
39+
这个简单的更改可以带来显著的影响。仅仅是使用正确的调度程序,我们看到了500倍的写入能力提升。
40+
****
7441

75-
Finally, avoid network-attached storage (NAS). People routinely claim their
76-
NAS solution is faster and more reliable than local drives. Despite these claims,
77-
we have never seen NAS live up to its hype. NAS is often slower, displays
78-
larger latencies with a wider deviation in average latency, and is a single
79-
point of failure.
42+
如果你使用旋转介质,尝试获取尽可能快的硬盘(高性能服务器硬盘,15k RPM 驱动器)。
8043

81-
==== Network
44+
使用 RAID 0 是提高硬盘速度的有效途径,对机械硬盘和 SSD 来说都是如此。没有必要使用镜像或其它 RAID 变体,因为高可用已经通过 replicas 内建于 Elasticsearch 之中。
8245

83-
A fast and reliable network is obviously important to performance in a distributed((("hardware", "network")))((("network")))
84-
system. Low latency helps ensure that nodes can communicate easily, while
85-
high bandwidth helps shard movement and recovery. Modern data-center networking
86-
(1 GbE, 10 GbE) is sufficient for the vast majority of clusters.
46+
最后,避免使用网络附加存储(NAS)。人们常声称他们的 NAS 解决方案比本地驱动器更快更可靠。除却这些声称,
47+
我们从没看到 NAS 能配得上它的大肆宣传。NAS 常常很慢,显露出更大的延时和更宽的平均延时方差,而且它是单点故障的。
8748

88-
Avoid clusters that span multiple data centers, even if the data centers are
89-
colocated in close proximity. Definitely avoid clusters that span large geographic
90-
distances.
49+
==== 网络
9150

92-
Elasticsearch clusters assume that all nodes are equal--not that half the nodes
93-
are actually 150ms distant in another data center. Larger latencies tend to
94-
exacerbate problems in distributed systems and make debugging and resolution
95-
more difficult.
51+
快速可靠的网络显然对分布式系统的性能是很重要的((("hardware", "network")))((("network")))。
52+
低延时能帮助确保节点间能容易的通讯,大带宽能帮助分片移动和恢复。现代数据中心网络(1 GbE, 10 GbE)对绝大多数集群都是足够的。
9653

97-
Similar to the NAS argument, everyone claims that their pipe between data centers is
98-
robust and low latency. This is true--until it isn't (a network failure will
99-
happen eventually; you can count on it). From our experience, the hassle of
100-
managing cross&#x2013;data center clusters is simply not worth the cost.
54+
即使数据中心们近在咫尺,也要避免集群跨越多个数据中心。绝对要避免集群跨越大的地理距离。
10155

102-
==== General Considerations
56+
Elasticsearch 假定所有节点都是平等的--并不会因为有一半的节点在150ms 外的另一数据中心而有所不同。更大的延时会加重分布式系统中的问题而且使得调试和排错更困难。
10357

104-
It is possible nowadays to obtain truly enormous machines:((("hardware", "general considerations"))) hundreds of gigabytes
105-
of RAM with dozens of CPU cores. Conversely, it is also possible to spin up
106-
thousands of small virtual machines in cloud platforms such as EC2. Which
107-
approach is best?
58+
和 NAS 的争论类似,每个人都声称他们的数据中心间的线路都是健壮和低延时的。这是真的--直到它不是时(网络失败终究是会发生的,你可以相信它)。
59+
从我们的经验来看,处理跨数据中心集群的麻烦事是根本不值得的。
10860

109-
In general, it is better to prefer medium-to-large boxes. Avoid small machines,
110-
because you don't want to manage a cluster with a thousand nodes, and the overhead
111-
of simply running Elasticsearch is more apparent on such small boxes.
61+
==== 总则
11262

113-
At the same time, avoid the truly enormous machines. They often lead to imbalanced
114-
resource usage (for example, all the memory is being used, but none of the CPU) and can
115-
add logistical complexity if you have to run multiple nodes per machine.
63+
获取真正的高配机器在今天是可能的:((("hardware", "general considerations")))成百 GB 的 RAM 和几十个 CPU 核心。
64+
反之,在云平台上串联起成千的小虚拟机也是可能的,例如 EC2。哪种方式是最好的?
11665

66+
通常,选择中配或者高配机器更好。避免使用低配机器,
67+
因为你不会希望去管理拥有上千个节点的集群,而且在这些低配机器上运行 Elasticsearch 的开销也是显著的。
11768

69+
与此同时,避免使用真正的高配机器。它们通常会导致资源使用不均衡(例如,所有的内存都被使用,但 CPU 却没有)而且在单机上运行多个节点时,会增加逻辑复杂度。

0 commit comments

Comments
 (0)