全文索引部署失败，Listener为offline状态

zcaa · 2022 年11 月 29 日 03:56

nebula 版本：2.6.1
部署方式：单机
安装方式：RPM
是否为线上版本：N
硬件信息
- 磁盘：SSD
- CPU、内存信息：20C 64G
问题的具体描述：按照官网的方式部署全文索引，但是部署失败，Listerner为offline状态。
我有2台机器，其中1台机器上单机部署了Nebula Graph，并且在这台机器上部署了ES（ip：172.16.0.20，OS: Ubuntu18.04）。
然后在另外一台机器（ip: 172.16.0.17，OS: CentOS 7）上部署Listener（安装官网说的安装了相同版本2.6.1的Nebula Graph，只是操作系统不一样）。nebula-storaged-listener.conf的配置如下：

########## nebula-storaged-listener ###########
########## basics ##########
# Whether to run as a daemon process
--daemonize=true
# The file to host the process id
--pid_file=pids_listener/nebula-storaged.pid
# Whether to use the configuration obtained from the configuration file
--local_config=true

########## logging ##########
# The directory to host logging files
--log_dir=logs_listener
# Log level, 0, 1, 2, 3 for INFO, WARNING, ERROR, FATAL respectively
--minloglevel=0
# Verbose log level, 1, 2, 3, 4, the higher of the level, the more verbose of the logging
--v=0
# Maximum seconds to buffer the log messages
--logbufsecs=0
# Whether to redirect stdout and stderr to separate output files
--redirect_stdout=true
# Destination filename of stdout and stderr, which will also reside in log_dir.
--stdout_log_file=storaged-stdout.log
--stderr_log_file=storaged-stderr.log
# Copy log messages at or above this level to stderr in addition to logfiles. The numbers of severity levels INFO, WARNING, ERROR, and FATAL are 0, 1, 2, and 3, respectively.
--stderrthreshold=2

########## networking ##########
# Meta server address
--meta_server_addrs=172.16.0.17:9559
# Local ip
--local_ip=172.16.0.17
# Storage daemon listening port
--port=9789
# HTTP service ip
--ws_ip=172.16.0.17
# HTTP service port
--ws_http_port=19789
# HTTP2 service port
--ws_h2_port=19790
# heartbeat with meta service
--heartbeat_interval_secs=10

########## storage ##########
# Listener wal directory. only one path is allowed.
--listener_path=data/listener
# This parameter can be ignored for compatibility. let's fill A default value of "data"
--data_path=data
# The type of part manager, [memory | meta]
--part_man_type=memory
# The default reserved bytes for one batch operation
--rocksdb_batch_size=4096
# The default block cache size used in BlockBasedTable.
# The unit is MB.
--rocksdb_block_cache=4
# The type of storage engine, `rocksdb', `memory', etc.
--engine_type=rocksdb
# The type of part, `simple', `consensus'...
--part_type=simple

接着在Lisener这台机器(172.16.0.17)上运行./bin/nebula-storaged --flagfile /usr/local/nebula/etc/nebula-storaged-listener.conf启动Listener。
然后使用Nebula Graph Consle工具连接172.16.0.20上的Nebula Graph，先使用User test选中test图空间，接着使用ADD LISTENER ELASTICSEARCH 172.16.0.17:9789;添加Listener，最后使用SHOW LISTENER;
最后发现都处于Offline状态。
这些日志太多了我不知道需要贴哪一个，请大佬指教。

SuperYoko · 2022 年11 月 29 日 04:13

查看meta日志和listener本身日志，顺道看下listener启动状态

zcaa · 2022 年11 月 29 日 05:27

部署Listener服务器的日志如下：
meta_error日志如下：

Log file created at: 2022/11/29 10:18:18
Running on machine: k3s01
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E1129 10:18:18.319000 30795 FileUtils.cpp:369] Failed to read the directory "data/meta/nebula" (2): No such file or directory
E1129 10:18:20.322270 31097 ActiveHostsMan.cpp:246] Get last update time failed, error: E_KEY_NOT_FOUND

meta_info日志如下：

Log file created at: 2022/11/29 10:18:18
Running on machine: k3s01
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1129 10:18:18.275646 30795 MetaDaemon.cpp:262] localhost = "127.0.0.1":9559
I1129 10:18:18.290127 30795 NebulaStore.cpp:52] Start the raft service...
I1129 10:18:18.292140 30795 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 8388608 for each part by default
I1129 10:18:18.292737 30795 RaftexService.cpp:62] Init thrift server for raft service, port: 9560
I1129 10:18:18.293071 30897 RaftexService.cpp:93] Starting the Raftex Service
I1129 10:18:18.318042 30897 RaftexService.cpp:83] Starting the Raftex Service on 9560
I1129 10:18:18.318109 30897 RaftexService.cpp:103] Start the Raftex Service successfully
I1129 10:18:18.318922 30795 NebulaStore.cpp:84] Scan the local path, and init the spaces_
E1129 10:18:18.319000 30795 FileUtils.cpp:369] Failed to read the directory "data/meta/nebula" (2): No such file or directory
I1129 10:18:18.319690 30795 NebulaStore.cpp:170] Init data from partManager for "127.0.0.1":9559
I1129 10:18:18.319743 30795 NebulaStore.cpp:265] Create data space 0
I1129 10:18:18.339063 30795 RocksEngine.cpp:128] open rocksdb on data/meta/nebula/0/data
I1129 10:18:18.349751 30795 Part.cpp:54] [Port: 9560, Space: 0, Part: 0] Cannot fetch the last committed log id from the storage engine
I1129 10:18:18.349767 30795 RaftPart.cpp:278] [Port: 9560, Space: 0, Part: 0] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 0, lastLogTerm 0, committedLogId 0, term 0
I1129 10:18:18.349901 30795 NebulaStore.cpp:328] Space 0, part 0 has been added, asLearner 0
I1129 10:18:18.349928 30795 NebulaStore.cpp:77] Register handler...
I1129 10:18:18.349934 30795 MetaDaemon.cpp:98] Waiting for the leader elected...
I1129 10:18:18.349943 30795 MetaDaemon.cpp:110] Leader has not been elected, sleep 1s
I1129 10:18:19.049615 30844 RaftPart.cpp:957] [Port: 9560, Space: 0, Part: 0] Start leader election, reason: lastMsgDur 700, term 0
I1129 10:18:19.049746 30844 RaftPart.cpp:1095] [Port: 9560, Space: 0, Part: 0] Sending out an election request (space = 0, part = 0, term = 1, lastLogId = 0, lastLogTerm = 0, candidateIP = 127.0.0.1, candidatePort = 9560)
I1129 10:18:19.049773 30844 RaftPart.cpp:1059] [Port: 9560, Space: 0, Part: 0] Partition is elected as the new leader for term 1
I1129 10:18:19.049786 30844 RaftPart.cpp:1137] [Port: 9560, Space: 0, Part: 0] The partition is elected as the leader
I1129 10:18:19.350163 30795 KVBasedClusterIdMan.h:83] There is no clusterId existed in kvstore!
I1129 10:18:19.350242 30795 MetaDaemon.cpp:118] I am leader, create cluster Id
I1129 10:18:19.350258 30795 KVBasedClusterIdMan.h:30] Create ClusterId 5910483749134832870
I1129 10:18:19.350714 30839 KVBasedClusterIdMan.h:108] Put key __meta_cluster_id_key__, val 5910483749134832870
I1129 10:18:19.350883 30795 MetaDaemon.cpp:137] Get meta version is 2
I1129 10:18:19.351027 30839 MetaVersionMan.cpp:66] Write meta version 2 succeeds
I1129 10:18:19.351097 30795 MetaDaemon.cpp:164] Nebula store init succeeded, clusterId 5910483749134832870
I1129 10:18:19.351119 30795 MetaDaemon.cpp:275] Start http service
I1129 10:18:19.358739 30795 MetaDaemon.cpp:172] Starting Meta HTTP Service
I1129 10:18:19.365340 31023 WebService.cpp:124] Web service started on HTTP[19559], HTTP2[19560]
I1129 10:18:19.365657 30795 JobManager.cpp:56] JobManager initialized
I1129 10:18:19.365711 30795 MetaDaemon.cpp:305] Check and init root user
I1129 10:18:19.365813 30795 RootUserMan.h:30] Root user is not exists
I1129 10:18:19.365831 30795 RootUserMan.h:36] Init root user
I1129 10:18:19.365969 31029 JobManager.cpp:77] JobManager::runJobBackground() enter
I1129 10:18:19.366679 30795 MetaDaemon.cpp:331] The meta deamon start on "127.0.0.1":9559
I1129 10:18:20.321827 31097 HBProcessor.cpp:33] Receive heartbeat from "127.0.0.1":9669, role = GRAPH
E1129 10:18:20.322270 31097 ActiveHostsMan.cpp:246] Get last update time failed, error: E_KEY_NOT_FOUND
I1129 10:18:20.344841 31097 HBProcessor.cpp:33] Receive heartbeat from "127.0.0.1":9779, role = STORAGE
I1129 10:18:20.344913 31097 HBProcessor.cpp:38] Set clusterId for new host "127.0.0.1":9779!
E1129 10:18:20.345192 31097 ActiveHostsMan.cpp:246] Get last update time failed, error: E_KEY_NOT_FOUND
I1129 10:18:30.759070 31098 HBProcessor.cpp:33] Receive heartbeat from "127.0.0.1":9779, role = STORAGE
E1129 10:18:30.759660 31098 ActiveHostsMan.cpp:246] Get last update time failed, error: E_KEY_NOT_FOUND
I1129 10:18:30.804383 31098 HBProcessor.cpp:33] Receive heartbeat from "127.0.0.1":9669, role = GRAPH

listener本身日志在哪里看？listener启动状态是用SHOW LISENER命令来查看的吗？一直都是Offline状态。

SuperYoko · 2022 年11 月 29 日 09:07

不是用命令，查看listener进程，一般都是配置错了，导致listener本身没拉起或者连不到meta。

zcaa · 2022 年11 月 29 日 09:30

大佬，我不是很懂。请问如何查看listener进程？您说一般是配置错误，我的nebula-storaged-listener.conf配置已经贴在上面了，可以直接指出哪里的配置错误么？还有meta_error的日志也贴出来了，说Failed to read the directory “data/meta/nebula” (2): No such file or directory，这个哪里会读这个路径我也不清楚。请大佬明示。您这样回复用户根本解决不了问题啊。

wey · 2022 年11 月 29 日 12:25

listener log 还写在 storaged-*.log 里的哈

SuperYoko · 2022 年11 月 29 日 13:28

不好意思，是我想当然了，如@wey 所言，日志在你拉起的listener进程的目录里找一下。
meta那个地方的错误日志没有相关性，你可以看看贴出listener的目录下的storage日志和@Wey 指出的错误文件。

话说你本地用的是127.0.0.1吗？你的服务在同一个机器上？

Local IP used to identify the nebula-metad process.Change it to an address other than loopback if the service is distributed or will be accessed remotely.

zcaa · 2022 年11 月 30 日 01:52

您指的是这个路径下的日志么？

nebula-storaged.INFO的日志如下：

Log file created at: 2022/11/29 10:31:07
Running on machine: k3s01
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1129 10:31:07.889071  6270 StorageDaemon.cpp:125] localhost = "172.16.0.17":9789
I1129 10:31:07.909164  6270 MetaClient.cpp:57] Create meta client to "172.16.0.17":9559
I1129 10:31:07.948290  6270 MetaClient.cpp:3006] Load leader of "127.0.0.1":9779 in 0 space
I1129 10:31:07.948364  6270 MetaClient.cpp:3012] Load leader ok
I1129 10:31:07.948798  6270 MetaClient.cpp:117] Register time task for heartbeat!
I1129 10:31:07.948850  6270 StorageServer.cpp:167] Init schema manager
I1129 10:31:07.948865  6270 StorageServer.cpp:170] Init index manager
I1129 10:31:07.948879  6270 StorageServer.cpp:173] Init kvstore
I1129 10:31:07.949091  6270 NebulaStore.cpp:52] Start the raft service...
I1129 10:31:07.962484  6270 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 8388608 for each part by default
I1129 10:31:07.962870  6270 RaftexService.cpp:62] Init thrift server for raft service, port: 9790
I1129 10:31:07.963192  6327 RaftexService.cpp:93] Starting the Raftex Service
I1129 10:31:07.974342  6327 RaftexService.cpp:83] Starting the Raftex Service on 9790
I1129 10:31:07.974395  6327 RaftexService.cpp:103] Start the Raftex Service successfully
I1129 10:31:07.974691  6270 NebulaStore.cpp:188] Init listener from partManager for "172.16.0.17":9789
I1129 10:31:07.974771  6270 NebulaStore.cpp:77] Register handler...
I1129 10:31:07.974792  6270 StorageServer.cpp:85] Starting Storage HTTP Service
I1129 10:31:07.986286  6270 StorageServer.cpp:89] Http Thread Pool started
I1129 10:31:07.992103  6339 WebService.cpp:124] Web service started on HTTP[19789], HTTP2[19790]
I1129 10:31:07.992271  6270 TransactionManager.cpp:25] TransactionManager ctor()
I1129 10:31:08.021631  6270 RocksEngine.cpp:128] open rocksdb on data/nebula/0/data
I1129 10:31:08.022518  6270 AdminTaskManager.cpp:22] max concurrenct subtasks: 10
I1129 10:31:08.022733  6270 AdminTaskManager.cpp:35] exit AdminTaskManager::init()
I1129 10:31:08.023309  6364 AdminTaskManager.cpp:231] waiting for incoming task
I1129 10:31:08.024246  6367 StorageServer.cpp:261] The admin service start on "172.16.0.17":9788
I1129 10:31:08.024981  6368 StorageServer.cpp:286] The internal storage service start(same with admin) on "172.16.0.17":9787
I1129 10:31:08.037310  6366 StorageServer.cpp:232] The storage service start on "172.16.0.17":9789
W1129 10:41:37.884773  6317 RaftexService.cpp:165] Cannot find the part 89 in the graph space 48
W1129 10:41:37.886688  6298 RaftexService.cpp:165] Cannot find the part 96 in the graph space 48
W1129 10:41:47.869503  6303 RaftexService.cpp:165] Cannot find the part 17 in the graph space 48
W1129 10:41:47.872701  6298 RaftexService.cpp:165] Cannot find the part 95 in the graph space 48
W1129 10:41:57.874933  6328 RaftexService.cpp:165] Cannot find the part 48 in the graph space 48
W1129 10:41:57.884852  6304 RaftexService.cpp:165] Cannot find the part 77 in the graph space 48
W1129 10:42:07.878731  6298 RaftexService.cpp:165] Cannot find the part 49 in the graph space 48
W1129 10:42:07.897918  6312 RaftexService.cpp:165] Cannot find the part 25 in the graph space 48
W1129 10:42:17.884169  6303 RaftexService.cpp:165] Cannot find the part 48 in the graph space 48
W1129 10:42:17.901865  6330 RaftexService.cpp:165] Cannot find the part 91 in the graph space 48
W1129 10:42:27.889600  6317 RaftexService.cpp:165] Cannot find the part 1 in the graph space 48
W1129 10:42:27.908953  6334 RaftexService.cpp:165] Cannot find the part 65 in the graph space 48

另外 storaged-*.log的2个日志里面什么也没有（0 Bytes）
还有，我本地有2台服务器，1台是172.16.0.20(Ubuntu 18.04)，安装了Nebula Graph和ES，另一台是172.16.0.17(CentOS 7)，安装了Nebula Graph作为Lisener。不知道这个127.0.0.1是哪里的配置。

SuperYoko · 2022 年11 月 30 日 02:03

meta/graph/storage 的conf文件应该都有这个 local_ip的配置项，如果多机，最好配置成对应的网卡ip

zcaa · 2022 年11 月 30 日 06:51

按照官网要求我已经修改了Lisener部署机器的nebula-storaged-listener.conf的配置文件，storaged-listener的配置文件如上1楼所示。
但是官网没有要求修改Lisener部署机器的nebula-storaged.conf、nebula-metad.conf、nebula-graphd.conf，我这次把他们的配置文件里面的127.0.0.1全部改成了我本机的ip172.16.0.17，然后重启Nebula Graph，按照官网教程重新又走了一遍，还是不行，所有的lisener都是offline。

pandasheeps · 2022 年11 月 30 日 07:11

ps -ef|grep nebula* 看下storage listener 启动起来了吗？
然后把logs_listener 目录下的INFO log 贴一下？

zcaa · 2022 年11 月 30 日 07:24

Info log日志如下：

Log file created at: 2022/11/29 10:31:07
Running on machine: k3s01
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1129 10:31:07.889071  6270 StorageDaemon.cpp:125] localhost = "172.16.0.17":9789
I1129 10:31:07.909164  6270 MetaClient.cpp:57] Create meta client to "172.16.0.17":9559
I1129 10:31:07.948290  6270 MetaClient.cpp:3006] Load leader of "127.0.0.1":9779 in 0 space
I1129 10:31:07.948364  6270 MetaClient.cpp:3012] Load leader ok
I1129 10:31:07.948798  6270 MetaClient.cpp:117] Register time task for heartbeat!
I1129 10:31:07.948850  6270 StorageServer.cpp:167] Init schema manager
I1129 10:31:07.948865  6270 StorageServer.cpp:170] Init index manager
I1129 10:31:07.948879  6270 StorageServer.cpp:173] Init kvstore
I1129 10:31:07.949091  6270 NebulaStore.cpp:52] Start the raft service...
I1129 10:31:07.962484  6270 NebulaSnapshotManager.cpp:25] Send snapshot is rate limited to 8388608 for each part by default
I1129 10:31:07.962870  6270 RaftexService.cpp:62] Init thrift server for raft service, port: 9790
I1129 10:31:07.963192  6327 RaftexService.cpp:93] Starting the Raftex Service
I1129 10:31:07.974342  6327 RaftexService.cpp:83] Starting the Raftex Service on 9790
I1129 10:31:07.974395  6327 RaftexService.cpp:103] Start the Raftex Service successfully
I1129 10:31:07.974691  6270 NebulaStore.cpp:188] Init listener from partManager for "172.16.0.17":9789
I1129 10:31:07.974771  6270 NebulaStore.cpp:77] Register handler...
I1129 10:31:07.974792  6270 StorageServer.cpp:85] Starting Storage HTTP Service
I1129 10:31:07.986286  6270 StorageServer.cpp:89] Http Thread Pool started
I1129 10:31:07.992103  6339 WebService.cpp:124] Web service started on HTTP[19789], HTTP2[19790]
I1129 10:31:07.992271  6270 TransactionManager.cpp:25] TransactionManager ctor()
I1129 10:31:08.021631  6270 RocksEngine.cpp:128] open rocksdb on data/nebula/0/data
I1129 10:31:08.022518  6270 AdminTaskManager.cpp:22] max concurrenct subtasks: 10
I1129 10:31:08.022733  6270 AdminTaskManager.cpp:35] exit AdminTaskManager::init()
I1129 10:31:08.023309  6364 AdminTaskManager.cpp:231] waiting for incoming task
I1129 10:31:08.024246  6367 StorageServer.cpp:261] The admin service start on "172.16.0.17":9788
I1129 10:31:08.024981  6368 StorageServer.cpp:286] The internal storage service start(same with admin) on "172.16.0.17":9787
I1129 10:31:08.037310  6366 StorageServer.cpp:232] The storage service start on "172.16.0.17":9789
W1129 10:41:37.884773  6317 RaftexService.cpp:165] Cannot find the part 89 in the graph space 48
W1129 10:41:37.886688  6298 RaftexService.cpp:165] Cannot find the part 96 in the graph space 48
W1129 10:41:47.869503  6303 RaftexService.cpp:165] Cannot find the part 17 in the graph space 48
W1129 10:41:47.872701  6298 RaftexService.cpp:165] Cannot find the part 95 in the graph space 48
W1129 10:41:57.874933  6328 RaftexService.cpp:165] Cannot find the part 48 in the graph space 48
W1129 10:41:57.884852  6304 RaftexService.cpp:165] Cannot find the part 77 in the graph space 48
W1129 10:42:07.878731  6298 RaftexService.cpp:165] Cannot find the part 49 in the graph space 48
W1129 10:42:07.897918  6312 RaftexService.cpp:165] Cannot find the part 25 in the graph space 48
W1129 10:42:17.884169  6303 RaftexService.cpp:165] Cannot find the part 48 in the graph space 48
W1129 10:42:17.901865  6330 RaftexService.cpp:165] Cannot find the part 91 in the graph space 48
W1129 10:42:27.889600  6317 RaftexService.cpp:165] Cannot find the part 1 in the graph space 48

pandasheeps · 2022 年11 月 30 日 07:42

storage listener 确定是新起的吗？咋日期是昨天的啊？
我怀疑你改了ip. storage listener没重启。

zcaa · 2022 年11 月 30 日 07:47

我把Lisener部署机器的nebula-storaged.conf、nebula-metad.conf、nebula-graphd.conf的ip从127.0.0.1默认ip改成了Lisener部署机器的ip172.16.0.17，然后我使用sudo /usr/local/nebula/scripts/nebula.service restart all 重启了nebula，然后我按照官网说的使用./bin/nebula-storaged --flagfile /usr/local/nebula/etc/nebula-storaged-listener.conf启动listenerl。

zcaa · 2022 年11 月 30 日 07:48

日期是昨天的是因为昨天我就开始部署但是失败了，需要把这些日志全删了再重来一遍看日志么？

pandasheeps · 2022 年11 月 30 日 08:31

./bin/nebula-storaged --flagfile /usr/local/nebula/etc/nebula-storaged-listener.conf
启动的时候， /usr/local/nebula/etc/nebula-storaged-listener.conf 中的ip是不是也改了？
刚刚给的截图里面。storage listener 是昨天起来的。貌似没有重启啊

pandasheeps · 2022 年11 月 30 日 08:32

重启下。然后用新的space，add listener试试
基本就是配置的问题

zcaa · 2022 年11 月 30 日 09:24

nebula-storaged-listener.conf里面的ip我昨天就改成了172.16.0.17，但是发现都是offline状态。@SuperYoko 提示nebula-storaged.conf、nebula-metad.conf、nebula-graphd.conf里面的127.0.0.1的ip也修改成172.16.0.17，今天改了这3个之后使用sudo /usr/local/nebula/scripts/nebula.service restart all 重启了nebula graph。重启之后我又执行了一次./bin/nebula-storaged --flagfile /usr/local/nebula/etc/nebula-storaged-listener.conf，是不是重启storage listener不能用这个命令？具体要怎么重启storage listener呢？

pandasheeps · 2022 年11 月 30 日 09:31

应该之前的你没有退出。重启没有用。
你可以直接kill就行。用服务重启的方式，在高版本里面有。
kill掉旧的。然后重新找个新的space 试试吧

zcaa · 2022 年11 月 30 日 10:05

1、使用sudo /usr/local/nebula/scripts/nebula.service stop all命令停止listener部署机器（172.16.0.17）上的nebula graph.
2、使用kill -9 停止了listener.

3、使用sudo /usr/local/nebula/scripts/nebula.service start all重启nebula graph.
4、使用./bin/nebula-storaged --flagfile /usr/local/nebula/etc/nebula-storaged-listener.conf启动listener.
5、在172.16.0.20的nebula graph上新建一个图空间并使用该图空间，然后使用ADD LISTENER ELASTICSEARCH 172.16.0.17:9789;添加listener.
6、使用SHOW LISTENER;命令，还是offline状态。