关于IndexScan时Filter的下推

Xscaper · 2024 年8 月 28 日 00:41

nebula 版本：3.8
部署方式：单机
安装方式：源码编译
是否上生产环境： N

首先,我发现了一个现象,就是在可以利用索引进行搜索的时候,Filter中的相关条件可以下推,但如果是全量的索引扫描,则不行

下面是一个用basketballplayer数据集复现的执行计划,只展示Index算子

(root@nebula) [basketballplayer]> EXPLAIN MATCH(v:player) WHERE v.player.name == "a" and v.player.age == 1 RETURN v
-----+----------------+--------------+----------------+--------------------------------------------------------------
|  7 | IndexScan      | 2            |                | outputVar: {                                                |
|    |                |              |                |   "colNames": [                                             |
|    |                |              |                |     "_vid"                                                  |
|    |                |              |                |   ],                                                        |
|    |                |              |                |   "type": "DATASET",                                        |
|    |                |              |                |   "name": "__IndexScan_1"                                   |
|    |                |              |                | }                                                           |
|    |                |              |                | inputVar:                                                   |
|    |                |              |                | space: 59                                                   |
|    |                |              |                | dedup: 0                                                    |
|    |                |              |                | limit: 9223372036854775807                                  |
|    |                |              |                | filter:                                                     |
|    |                |              |                | orderBy: []                                                 |
|    |                |              |                | schemaId: 60                                                |
|    |                |              |                | isEdge: false                                               |
|    |                |              |                | returnCols: [                                               |
|    |                |              |                |   "_vid"                                                    |
|    |                |              |                | ]                                                           |
|    |                |              |                | indexCtx: [                                                 |
|    |                |              |                |   {                                                         |
|    |                |              |                |     "columnHints": [                                        |
|    |                |              |                |       {                                                     |
|    |                |              |                |         "includeEnd": false,                                |
|    |                |              |                |         "includeBegin": true,                               |
|    |                |              |                |         "endValue": "__EMPTY__",                            |
|    |                |              |                |         "beginValue": "a",                                  |
|    |                |              |                |         "scanType": "PREFIX",                               |
|    |                |              |                |         "column": "name"                                    |
|    |                |              |                |       }                                                     |
|    |                |              |                |     ],                                                      |
|    |                |              |                |     "filter": "((player.name==\"a\") AND (player.age==1))", |
|    |                |              |                |     "index_id": 65                                          |
|    |                |              |                |   }                                                         |
|    |                |              |                | ]                                                           |
-----+----------------+--------------+----------------+--------------------------------------------------------------


(root@nebula) [basketballplayer]> EXPLAIN MATCH(v:player) WHERE  v.player.age == 1 RETURN v
-----+----------------+--------------+----------------+-----------------------------------
|  7 | IndexScan      | 2            |                | outputVar: {                     |
|    |                |              |                |   "colNames": [                  |
|    |                |              |                |     "_vid"                       |
|    |                |              |                |   ],                             |
|    |                |              |                |   "type": "DATASET",             |
|    |                |              |                |   "name": "__IndexScan_1"        |
|    |                |              |                | }                                |
|    |                |              |                | inputVar:                        |
|    |                |              |                | space: 59                        |
|    |                |              |                | dedup: 0                         |
|    |                |              |                | limit: 9223372036854775807       |
|    |                |              |                | filter:                          |
|    |                |              |                | orderBy: []                      |
|    |                |              |                | schemaId: 60                     |
|    |                |              |                | isEdge: false                    |
|    |                |              |                | returnCols: [                    |
|    |                |              |                |   "_vid"                         |
|    |                |              |                | ]                                |
|    |                |              |                | indexCtx: [                      |
|    |                |              |                |   {                              |
|    |                |              |                |     "columnHints": [],           |
|    |                |              |                |     "filter": "",                |
|    |                |              |                |     "index_id": 64               |
|    |                |              |                |   }                              |
|    |                |              |                | ]                                |
-----+----------------+--------------+----------------+-----------------------------------

可以看到,在第一个触发了PREFIX搜索的执行计划中,将条件也一并下推至了indexCtx中,第二个执行计划则没有。
我仍为,第二种情况也应该将filter下推,这样可以在大数据量的索引扫描显著节约内存并提速。
我阅读了源码(IndexScanRule),发现本质上是因为,只有在找到最优索引的相关函数中,才会传入filter并将其下推,于是我修改了部分逻辑,让其在未发现能精准搜索的索引时也会将其下推

//OptimizerUtils.cpp
Status OptimizerUtils::createIndexQueryCtx(std::vector<IndexQueryContext>& iqctx,
                                           graph::QueryContext* qctx,
                                           const IndexScan* node) {
  auto index = findLightestIndex(qctx, node);
  if (index == nullptr) {
    return Status::IndexNotFound("No valid index found");
  }
  auto in = static_cast<const IndexScan*>(node);
  auto* filter = Expression::decode(qctx->objPool(), in->queryContext().begin()->get_filter());
  auto* newFilter = ExpressionUtils::rewriteParameter(filter, qctx);
  return appendIQCtx(index, iqctx,newFilter);
}

其中最后调用的appendIQCtx我也做了改造,在传入非空filter时会进行下推。

同时我发现了第二个问题,就是我即使这样修改了源码,如果条件是or，他仍然无法下推
我再次查看了源码,发现本质上是因为,在match语句寻找起点时会依次调用PropIndexSeek和LabelIndexSeek,前者无法匹配时才会调用后者,前者不接受OR关系。而后者,由于未转换filter的表达式类型(LabelTagPropertyExpression → TagPropertyExpression),下推至底层会导致storage报错闪退。
于是我同样更改了相关的逻辑(MatchSolver::makeIndexFilter函数),使其能接受OR类型的逻辑表达式，是OR类型表达式也能正确下推。

请问以上的修改,是否会导致在其他场景下产生一些错误?我自行验证在简单点查询和点边查询的情景应该是没有问题的。

MuYi-方扬 · 2024 年8 月 28 日 22:41

你可以尝试提交下 PR，欢迎！

Xscaper · 2024 年8 月 31 日 03:14

尝试提交时还发现了tck测试的一个bug:无法正确解析嵌套一层的字典对象
tests/common/plan_differ.py中

    def _is_subdict_nested(self, expect, resp):
        key_list = []
        extracted_expected_dict = expect

        # Extract the innermost dict of nested dict and save the keys into key_list
        while (len(extracted_expected_dict) == 1 and
               isinstance(list(extracted_expected_dict.values())[0], dict)):
            k = list(extracted_expected_dict.keys())[0]
            v = list(extracted_expected_dict.values())[0]
            key_list.append(k)
            extracted_expected_dict = v
        # The inner map cannot be empty
        if len(extracted_expected_dict) == 0:
            return None
        # Unnested dict, push the first key into list
        if extracted_expected_dict == expect:
            key_list.append(list(expect.keys())[0])

        extracted_resp_dict = {}
        if len(key_list) == 1:
          for k in resp:
            extracted_resp_dict[k] = _try_convert_json(resp[k])
        else:
            extracted_resp_dict = self._convert_jsonStr_to_dict(resp, key_list)

        for k in extracted_expected_dict:
            extracted_expected_dict[k] = _try_convert_json(extracted_expected_dict[k])

简单来说,在不存在嵌套时,key_list为空,会直接把key填充进keylist，使得len(key_list) == 1
但存在一层嵌套时,len(key_list) == 1同样成立,会直接被当做为不存在嵌套处理

IndexScan算子就是一个典型的例子,他同时存在{filter:“”}和{“indexCtx”:{“filter”:“xxx”}}两个对象,
而后者在这种情况下会被当做{“filter”:“xxx”}处理,导致比较失败

Xscaper · 2024 年8 月 31 日 09:17

已提交pr
https://github.com/vesoft-inc/nebula/pull/5938

system · 2024 年9 月 30 日 09:18

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。