LDBC数据导入及nGQL实践

xjc · 2021 年8 月 18 日 01:53

概述

最近在自己搭的一个Nebula Graph单机集群中导入LDBC数据集，并尝试用nGQL写了几个LDBC SNB几个基础查询（Short Reads）。

数据导入

Nebula bench这个repo有用python包装好的生成和导入LDBC到Nebula的过程，基本上照着它文档中的步骤做就行。

遇见的几个小坑：

运行python3 run.py importer后默认生成的yaml中默认设置了space的replica为3，在我的单机集群下不能用。要么自己改py，要么就只是draft run后，自己运行nebula importer来导数据，我机智的选择了后者；
导入的数据中发现一些完全不同类型的点的vid是一样的，如person和organization，这个在后面跑nGQL的时候会觉得有点怪。Nebula bench的文档也有提到，因为不影响压测，没做处理，好吧。

导入成功后，SHOW STATS欣赏一下：

(root@nebula) [ldbc1]> show stats;
+---------+------------------+----------+
| Type    | Name             | Count    |
+---------+------------------+----------+
| "Tag"   | "Comment"        | 2052169  |
+---------+------------------+----------+
| "Tag"   | "Forum"          | 90492    |
+---------+------------------+----------+
| "Tag"   | "Organisation"   | 7955     |
+---------+------------------+----------+
| "Tag"   | "Person"         | 9892     |
+---------+------------------+----------+
| "Tag"   | "Place"          | 1460     |
+---------+------------------+----------+
| "Tag"   | "Post"           | 1003605  |
+---------+------------------+----------+
| "Tag"   | "Tag"            | 16080    |
+---------+------------------+----------+
| "Tag"   | "Tagclass"       | 71       |
+---------+------------------+----------+
| "Edge"  | "CONTAINER_OF"   | 1003605  |
+---------+------------------+----------+
| "Edge"  | "HAS_CREATOR"    | 3055774  |
+---------+------------------+----------+
| "Edge"  | "HAS_INTEREST"   | 229166   |
+---------+------------------+----------+
| "Edge"  | "HAS_MEMBER"     | 1611869  |
+---------+------------------+----------+
| "Edge"  | "HAS_MODERATOR"  | 90492    |
+---------+------------------+----------+
| "Edge"  | "HAS_TAG"        | 3721409  |
+---------+------------------+----------+
| "Edge"  | "HAS_TYPE"       | 16080    |
+---------+------------------+----------+
| "Edge"  | "IS_LOCATED_IN"  | 3073620  |
+---------+------------------+----------+
| "Edge"  | "IS_PART_OF"     | 1454     |
+---------+------------------+----------+
| "Edge"  | "IS_SUBCLASS_OF" | 70       |
+---------+------------------+----------+
| "Edge"  | "KNOWS"          | 180623   |
+---------+------------------+----------+
| "Edge"  | "LIKES"          | 2190095  |
+---------+------------------+----------+
| "Edge"  | "REPLY_OF"       | 2052169  |
+---------+------------------+----------+
| "Edge"  | "STUDY_AT"       | 7949     |
+---------+------------------+----------+
| "Edge"  | "WORK_AT"        | 21654    |
+---------+------------------+----------+
| "Space" | "vertices"       | 3165488  |
+---------+------------------+----------+
| "Space" | "edges"          | 17256029 |
+---------+------------------+----------+
Got 25 rows (time spent 1344/16017 us)

nGQL查询

下面尝试解决LDBC SNB Interactive workload中相对基础的几个查询场景，Short Reads，场景的需求可以具体看spec。

Short Reads #1 - Profile of a person

match (v1:Person)-[:IS_LOCATED_IN]->(v2:Place) where id(v1)==$person_id
return v1.firstName, v1.lastName, v1.birthday, v1.locationIP, v1.browserUsed, id(v2), v1.gender, v1.creationDate

Short Reads #2 - Recent messages of a person

这里从comment找post需要支持不限跳数，目前nebula尚不支持，只能指定一个足够大的上限，我随便设了5.

match(p1:Person)<-[:HAS_CREATOR]-(m:`Comment`)-[:REPLY_OF*..5]->(p:Post)-[:HAS_CREATOR]->(p2:Person) 
where id(p1)==$person_id return id(m) as messageId, 
(case m.content is null when false then m.content when true then m.imageFile end) as content,
id(p),id(p2),p2.firstName,p2.lastName,
m.creationDate as creationDate order by creationDate desc, messageId desc limit 10;

Short Reads #3 - Friends of a person

match (p1:Person)-[k:KNOWS]-(p2:Person) where id(p1)==$person_id 
return id(p2) as friendId,p2.firstName,p2.lastName,k.creationDate as creationDate 
order by creationDate desc, friendId;

Short Reads #4 - Content of a message

终于可以不用match了，这个简单的查询直接用fetch搞定。

fetch prop on Post $message_id 
yield Post.creationDate, Post.content, Post.imageFile

Short Reads #5 - Creator of a message

同样不需要用match，go！

go from 6605817 over HAS_CREATOR yield HAS_CREATOR._dst as personId, $$.Person.firstName, $$.Person.lastName;

Short Reads #6 - Forum of a message

继续go。这里也涉及到无限跳数的问题，go同样不支持，我设了最大跳数5。

go 0 to 5 steps from $message_id over REPLY_OF yield REPLY_OF._dst as postId 
| go from $-.postId over CONTAINER_OF REVERSELY yield CONTAINER_OF._dst as forumId, $$.Forum.title as title
| go from $-.forumId over HAS_MODERATOR yield $-.forumId, $-.title, HAS_MODERATOR._dst as moderatorId, $$.Person.firstName, $$.Person.lastName

Short Reads #7 - Replies of a message

这个场景看下来需要Open Cypher的OPTIONAL MATCH来实现，Nebula暂时还不支持，期待后续版本能加上。

the end.