Elasticsearch中数据查询(Python实现)总结

背景

Elasticsearch是为了全文检索出现的，在实际使用中查询数据是客户端和集群交互的主要交互形式。我们先看一下Elasticsearch的查询原理：

elasticsearch基于

本文以Python API（官方包：https://github.com/elastic/elasticsearch-py）为例进行实现讲解。

第一部分查询原理说明

https://pdai.tech/md/db/nosql-es/elasticsearch-y-th-4.html

第一部分 `Search` 接口

1.1 接口案例

我们使用Elasticsearch官方提供 Python 包进行查询交互：

from elasticsearch import Elasticsearch
elasticClient = Elasticsearch('http://192.168.52.142:9200')

query_body = {
    "query":{
        "bool":{
            "must":{
                "match":{
                    "text_entry": "we"
                }
            }
        }
    },
    "from": 1,
    "size": 1
}

result = elasticClient.search(index="shakespeare", body=query_body)
print(result)

查询结果为:

{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 3109,
            "relation": "eq"
        },
        "max_score": 5.6665335,
        "hits": [
            {
                "_index": "shakespeare",
                "_type": "_doc",
                "_id": "35519",
                "_score": 5.6665335,
                "_source": {
                    "type": "line",
                    "line_id": 35520,
                    "play_name": "Hamlet",
                    "speech_number": 33,
                    "line_number": "4.5.119",
                    "speaker": "Danes",
                    "text_entry": "We will, we will."
                }
            }
        ]
    }
}

按照顺序说明一下字段参数的含义：

took，表示Elasticsearch查询语句花费的时间，单位毫秒；
timed_out，表示查询是否超时；
_shards，表示查询中查看的分片（shard）信息，其中total表示shard总量，successful成功个数，skipped跳过个数，failed失败个数；
hits ，查询的结果；
hits.total，是包含与搜索条件匹配的文档总数信息的对象
hits.total.value，表示查询共命中数量（结合hits.total.relation参数看）；
hits.total.relation，当值是eq时，hits.total.value的值是准确计数。当值是gte时，hits.total.value的值是不准确的。
hits.max_score，最大分数；
hits.hits，查询结果数据存储List（默认为前10个文档，查询中size参数决定）。

第35行，hits.sort 表示结果排序键（如果请求中没有指定，则默认按分数排序）。

现在我们来重点关注11~14行，参数track_total_hits可以控制total值的准确性，可以设为false、true或正整数。当为false时，不再返回total计数，total始终为1；当为正整数时，表示精确的命中文档计数，例如为1000时，返回如下结构 :

"total": {
    "value": 3109,
    "relation": "eq"
}

此时如果命中结果小于或等于1000时，value是精确的计数，relation是eq，如果命中的结果数大于了1000时，此时value等于1000，relation是gte，如果需要精确统计结果数，这个值要设置的很大或者设为true，这个参数的设基更多的出于性能考虑。具体工程应用中，我们可以首先检查relation的值，如果是eq，则表示value的值真实代表了命中的结果数，如果为gte表示value的值是不准确的。
那么track_total_hits这个参数到底如何设置才是最合理的呢？这要结合具体的业务需求和应用场景。可以遵循如下三个原则：

保持默认值：10000，不变，这足以满足一般的业务需求，就算是淘宝、京东这样的大型电商网站，一页展示40个结果，10000个结果可以展示250页，相信没有用户会看250页后的商品，大多数情况下用户基本上都是浏览前10也的商品。
如果需要精确知道命中的文档数量，此时应把track_total_hits设置为true,但用户需要清楚的明白，如果命中的文档数量很大，会影响查询性能，而且会消耗大量的内存，甚至存在内存异常的风险。
如果你确切知道不需要知道命中的结果数，则把track_total_hits设为false,这会提升查询性能。

一般的分页需求我们可以使用form和size的方式实现，但是这种分页方式在深度分页的场景下应该是要避免使用的。深度分页会随着请求的页次增加，所消耗的内存和时间的增长也是成比例的增加，为了避免深度分页产生的问题，elasticsearch从2.0版本开始，增加了一个限制：

1	index.max_result_window =10000

1.2 接口说明

用search API 可以获取数据，但是存在下面两个问题：

返回的数据默认只有10条

这个是因为ES默认的返回数据都值设置成了10条，所以无论索引中的文档数有多少，都只反馈最开始的10条

我们可以看到，可以根据设置size数值，来获取返回的文档数，但是另外的问题又来了，如果文档数量大于10000（size的最大值）呢？且这样获取貌似速度上比较慢？

2.文档数量大于10000，且查询速度不够快

这时候scan API就派上用场了。

当发送查询命令到ElasticSearch中，返回的文档集合默认会按照计算出来的文档打分排序(已经在本章的 Lucene的默认打分算法一节中讲到)。

1.3 Python实现

第二部分 Scroll 接口

elasticsearch的scroll是什么？
可以简单理解为mysql的cursor游标，比如你一次请求的数据会很大，可以使用scroll这样的流式接口，scroll会把你的所需要的结果标记起来。
但是这scroll的查询还是会对数据进行排序的，这样会影响性能。如果你只是单纯的想要数据，那么可以使用scan，因为scan会告诉 elasticsearch 不去排序。scan模式会扫描shard分片中的数据,单纯的扫除匹配，而不会像scroll进行排序处理。

https://zhuanlan.zhihu.com/p/392363821

Slice使用方式以及原理。多线程方式提升查询速度。

slice 在 search + scroll 的时候可以用，在聚类结果导出的时候也可以用。
注意用完的scroll应该及时删除。es默认保留500个滚动任务id。多了就不能再使用了！
注意scroll状态保留的时间。如果你后续接到数据以后的处理时间比较长，或者查询本身时间花费也长，你应该合理的设置一个长的状态保留时间。

2.1 接口案例

2.2 接口说明

2.3 Python实现

https://simplernerd.com/elasticsearch-scroll-python/

第三部分 Scan 接口

按照官网介绍

不用排序，单纯将数据查出来。

如果没有排序的深度分页需求，最好使用 scan scroll的组合。

scan scroll的流式接口用法很是简单,在url里扩充字段 search_type 是scan类型，scroll是3分钟，当次查询的结果会在elasticsearch标记3分钟。
这里的size 1000个会在每个shard起到作用。并不是把所有结果限制为1000个！如果你的分片数目有10个，那么你最多可以拿到 1000 * 10的数据。

第四部分异步搜索

https://blog.csdn.net/UbuntuTouch/article/details/107868114

https://www.elastic.co/guide/en/elasticsearch/reference/7.16/async-search.html

第四部分总结

当需求查询es数据库中大量数据时,用_search就不符合应用场景了，建议使用helpers.scan，helpers.scan返回的数据对象时迭代器，很大节省内存空间，而且查询速度要远远大于search；search在利用from、size参数控制返回数据的条数,使用 from and size 的深度分页，size=10&from=10000 是非常低效的，因为 100,000 排序的结果必须从每个分片上取出并重新排序最后返回 10 条。这个过程需要对每个请求页重复，scroll进行数据分页，也可以返回大数据，但是search返回的数据是以list的形式，如果一次需要返回的数据量比较大的话，则会十分耗费内存，而且数据传输速度也会比较慢

第五部分附录

5.1 参数`track_total_hits`说明

Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as “there are at least 10000 hits”, the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It’s is a good trade off to speed up searches if you don’t need the accurate number of hits after a certain threshold.

When set to true the search response will always track the number of hits that match the query accurately (e.g. total.relation will always be equal to "eq" when track_total_hits is set to true). Otherwise the "total.relation" returned in the "total" object in the search response determines how the "total.value" should be interpreted. A value of "gte" means that the "total.value" is a lower bound of the total hits that match the query and a value of "eq" indicates that "total.value" is the accurate count.

GET twitter/_search
{
    "track_total_hits": true,
     "query": {
        "match" : {
            "message" : "Elasticsearch"
        }
     }
}

Copy as curlView in Console

… returns:

{
    "_shards": ...
    "timed_out": false,
    "took": 100,
    "hits": {
        "max_score": 1.0,
        "total" : {
            "value": 2048,    
            "relation": "eq"  
        },
        "hits": ...
    }
}

	The total number of hits that match the query.
	The count is accurate (e.g. `"eq"` means equals).

It is also possible to set track_total_hits to an integer. For instance the following query will accurately track the total hit count that match the query up to 100 documents:

GET twitter/_search
{
    "track_total_hits": 100,
     "query": {
        "match" : {
            "message" : "Elasticsearch"
        }
     }
}

Copy as curlView in Console

The hits.total.relation in the response will indicate if the value returned in hits.total.value is accurate ("eq") or a lower bound of the total ("gte").

For instance the following response:

{
    "_shards": ...
    "timed_out": false,
    "took": 30,
    "hits" : {
        "max_score": 1.0,
        "total" : {
            "value": 42,         
            "relation": "eq"     
        },
        "hits": ...
    }
}

	42 documents match the query
	and the count is accurate (`"eq"`)

… indicates that the number of hits returned in the total is accurate.

If the total number of hits that match the query is greater than the value set in track_total_hits, the total hits in the response will indicate that the returned value is a lower bound:

{
    "_shards": ...
    "hits" : {
        "max_score": 1.0,
        "total" : {
            "value": 100,         
            "relation": "gte"     
        },
        "hits": ...
    }
}

	There are at least 100 documents that match the query
	This is a lower bound (`"gte"`).

If you don’t need to track the total number of hits at all you can improve query times by setting this option to false:

GET twitter/_search
{
    "track_total_hits": false,
     "query": {
        "match" : {
            "message" : "Elasticsearch"
        }
     }
}

Copy as curlView in Console

… returns:

{
    "_shards": ...
    "timed_out": false,
    "took": 10,
    "hits" : { 
        "max_score": 1.0,
        "hits": ...
    }
}

	The total number of hits is unknown.

Finally you can force an accurate count by setting "track_total_hits" to true in the request.

参考文献及资料

1、Official Elasticsearch client library for Python，链接：https://github.com/elastic/elasticsearch-py

目录

背景

第一部分 查询原理说明

第一部分 Search 接口

1.1 接口案例

1.2 接口说明

1.3 Python实现

第二部分 Scroll 接口

2.1 接口案例

2.2 接口说明

2.3 Python实现

第三部分 Scan 接口

第四部分 异步搜索

第四部分 总结

第五部分 附录

5.1 参数track_total_hits说明

参考文献及资料

第一部分查询原理说明

第一部分 `Search` 接口

第四部分异步搜索

第四部分总结

第五部分附录

5.1 参数`track_total_hits`说明