Star

编译ARM版本的nebula-graph成功,之后遇到了一些问题

  • nebula 版本:1.2.0

  • 部署方式(分布式 / 单机 / Docker / DBaaS):单机

  • 硬件信息

    • 磁盘 SSD
    • CPU、内存信息: CPU 96核 内存512G
  • 问题的具体描述
    编译完成以后,运行
    ./nebula.service start all
    报错,内存占用飙升,达到了200个G以上的占用

日志:


[root@node1 scripts]# terminate called after throwing an instance of 'std::system_error'
  what():  Invalid argument
*** Aborted at 1611040526 (unix time) try "date -d @1611040526" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGABRT (@0x7e64) received by PID 32356 (TID 0xffffb0fe4560) from PID 32356; stack trace: ***
    @          0x1b16a1f (unknown)
    @     0xffffb0fa066b ([vdso]+0x66a)
    @     0xffffb0cc5238 __GI_raise
    @     0xffffb0cc68af __GI_abort
    @          0x1e5b49f __gnu_cxx::__verbose_terminate_handler()
    @          0x1e59ed3 __cxxabiv1::__terminate()
    @          0x1ef0933 __cxa_call_terminate
    @          0x1e59687 __gxx_personality_v0
    @          0x1efa743 _Unwind_RaiseException_Phase2
    @          0x1efac57 _Unwind_RaiseException
    @          0x1e5a09f __cxa_throw
    @           0xe43fc3 std::__throw_system_error()
    @          0x1ed0b9b std::thread::join()
    @          0x148658f nebula::WebService::~WebService()
    @           0xe44be3 main
    @     0xffffb0cb1723 __libc_start_main
    @           0xe6e423 (unknown)

[root@node1 scripts]# free -g
                     total        used        free      shared  buff/cache   available
Mem:            510         260         114           0         135         176
Swap:              0              0             0

麻烦把 nebula-graph和nebula-storage的 日志贴一下

**meta.INFO:**

Log file created at: 2021/01/19 17:51:06
Running on machine: node1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0119 17:51:06.046078 59513 NebulaStore.cpp:45] Start the raft service...
I0119 17:51:06.127620 59513 RaftexService.cpp:65] Init thrift server for raft service.
I0119 17:51:06.222126 59623 RaftexService.cpp:99] Starting the Raftex Service
I0119 17:51:06.973577 59623 RaftexService.cpp:87] Starting the Raftex Service on 45501
I0119 17:51:06.973639 59623 RaftexService.cpp:111] Start the Raftex Service successfully
I0119 17:51:06.973732 59513 NebulaStore.cpp:58] Scan the local path, and init the spaces_
I0119 17:51:06.973806 59513 NebulaStore.cpp:65] Scan path "data/meta/nebula/0"
I0119 17:51:07.093168 59513 RocksEngine.cpp:120] open rocksdb on data/meta/nebula/0/data
I0119 17:51:07.093192 59513 NebulaStore.cpp:95] Load space 0 from disk
I0119 17:51:07.093214 59513 NebulaStore.cpp:130] Need to open 1 parts of space 0
I0119 17:51:07.093353 59616 FileBasedWal.cpp:58] [Port: 45501, Space: 0, Part: 0] lastLogId in wal is 3, lastLogTerm is 2, path is data/meta/nebula/0/wal/0/0000000000000000001.wal
I0119 17:51:07.093410 59616 RaftPart.cpp:293] [Port: 45501, Space: 0, Part: 0] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 3, lastLogTerm 2, committedLogId 3, term 2
I0119 17:51:07.093506 59616 NebulaStore.cpp:160] Load part 0, 0 from disk
I0119 17:51:07.093519 59513 NebulaStore.cpp:175] Load space 0 complete
I0119 17:51:07.093528 59513 NebulaStore.cpp:183] Init data from partManager for [127.0.0.1:45500]
I0119 17:51:07.093538 59513 NebulaStore.cpp:247] Space 0 has existed!
I0119 17:51:07.093545 59513 NebulaStore.cpp:267] [Space: 0, Part: 0] has existed!
I0119 17:51:07.093555 59513 NebulaStore.cpp:200] Register handler...
I0119 17:51:07.093560 59513 MetaDaemon.cpp:93] Waiting for the leader elected...
I0119 17:51:07.093566 59513 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0119 17:51:08.093657 59513 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0119 17:51:09.093798 59513 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0119 17:51:10.093945 59513 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0119 17:51:11.094090 59513 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0119 17:51:12.094241 59513 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0119 17:51:12.305127 59617 RaftPart.cpp:991] [Port: 45501, Space: 0, Part: 0] Start leader election, reason: lastMsgDur 5211, term 2
I0119 17:51:12.305197 59617 RaftPart.cpp:1124] [Port: 45501, Space: 0, Part: 0] Sending out an election request (space = 0, part = 0, term = 3, lastLogId = 3, lastLogTerm = 2, candidateIP = 127.0.0.1, candidatePort = 45501)
I0119 17:51:12.305251 59617 RaftPart.cpp:1084] [Port: 45501, Space: 0, Part: 0] Partition is elected as the new leader for term 3
I0119 17:51:12.305258 59617 RaftPart.cpp:1179] [Port: 45501, Space: 0, Part: 0] The partition is elected as the leader
I0119 17:51:12.305302 59617 InMemoryLogBuffer.h:23] [Port: 45501, Space: 0, Part: 0] InMemoryLogBuffer ctor, firstLogId 4
I0119 17:51:13.094434 59513 MetaDaemon.cpp:133] Nebula store init succeeded, clusterId 2695977746709790328
I0119 17:51:13.094501 59513 MetaDaemon.cpp:223] Start http service
I0119 17:51:13.189043 59513 MetaDaemon.cpp:141] Starting Meta HTTP Service
I0119 17:51:13.720804 59706 Acceptor.cpp:453] Dropping all connections from Acceptor=0xffee002b0000 in thread 0xffedc0090000
I0119 17:51:13.721035 59709 Acceptor.cpp:453] Dropping all connections from Acceptor=0xffee002b0500 in thread 0xffed80080000
I0119 17:51:13.721204 59710 Acceptor.cpp:453] Dropping all connections from Acceptor=0xffee002b0a00 in thread 0xffed40090000
I0119 17:51:13.721369 59711 Acceptor.cpp:453] Dropping all connections from Acceptor=0xffee002b0f00 in thread 0xffed00080000
E0119 17:51:13.721725 59513 WebService.cpp:173] Failed to start web service: 98failed to bind to async server socket: 127.0.0.1:11000: Address already in use
E0119 17:51:13.722548 59513 MetaDaemon.cpp:231] Init web service failed: 98failed to bind to async server socket: 127.0.0.1:11000: Address already in use

**meta.ERROR**
Log file created at: 2021/01/19 17:51:13
Running on machine: node1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0119 17:51:13.721725 59513 WebService.cpp:173] Failed to start web service: 98failed to bind to async server socket: 127.0.0.1:11000: Address already in use
E0119 17:51:13.722548 59513 MetaDaemon.cpp:231] Init web service failed: 98failed to bind to async server socket: 127.0.0.1:11000: Address already in use

**storaged.INFO**
Log file created at: 2021/01/19 17:51:06
Running on machine: node1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0119 17:51:06.596592 59551 MetaClient.cpp:44] Create meta client to [127.0.0.1:45500]
I0119 17:51:06.636695 59551 StatsManager.cpp:112] registerHisto, bucketSize: 1000, min: 1, max: 1000000
I0119 17:51:06.637161 59551 GflagsManager.cpp:125] Prepare to register 14 gflags to meta
I0119 17:51:06.637181 59551 ClusterIdMan.h:61] Try to open cluster.id
W0119 17:51:06.637189 59551 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0119 17:51:09.791154 59650 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:09.791594 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0119 17:51:09.791632 59551 MetaClient.cpp:76] Waiting for the metad to be ready!
I0119 17:51:11.791716 59551 ClusterIdMan.h:61] Try to open cluster.id
W0119 17:51:11.791783 59551 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0119 17:51:15.021251 59683 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:15.021394 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0119 17:51:15.021412 59551 MetaClient.cpp:76] Waiting for the metad to be ready!
I0119 17:51:17.021502 59551 ClusterIdMan.h:61] Try to open cluster.id
W0119 17:51:17.021592 59551 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0119 17:51:20.206735 59687 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:20.206872 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0119 17:51:20.206908 59551 MetaClient.cpp:76] Waiting for the metad to be ready!
I0119 17:51:22.207000 59551 ClusterIdMan.h:61] Try to open cluster.id
W0119 17:51:22.207082 59551 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0119 17:51:25.625253 59690 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:25.625372 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0119 17:51:25.625411 59551 MetaClient.cpp:76] Waiting for the metad to be ready!
I0119 17:51:27.625492 59551 ClusterIdMan.h:61] Try to open cluster.id
W0119 17:51:27.625584 59551 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0119 17:51:30.630848 59695 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:30.630954 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0119 17:51:30.631001 59551 MetaClient.cpp:76] Waiting for the metad to be ready!
I0119 17:51:32.631081 59551 ClusterIdMan.h:61] Try to open cluster.id
W0119 17:51:32.631188 59551 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0119 17:51:35.636394 59699 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:35.636488 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0119 17:51:35.636528 59551 MetaClient.cpp:76] Waiting for the metad to be ready!
I0119 17:51:37.636620 59551 ClusterIdMan.h:61] Try to open cluster.id
W0119 17:51:37.636690 59551 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0119 17:51:40.641834 59713 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:40.641929 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0119 17:51:40.641973 59551 MetaClient.cpp:76] Waiting for the metad to be ready!
I0119 17:51:42.642051 59551 ClusterIdMan.h:61] Try to open cluster.id
W0119 17:51:42.642139 59551 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0119 17:51:45.647318 59717 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:45.647429 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0119 17:51:45.647454 59551 MetaClient.cpp:76] Waiting for the metad to be ready!
I0119 17:51:47.647536 59551 ClusterIdMan.h:61] Try to open cluster.id
W0119 17:51:47.647624 59551 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0119 17:51:50.653057 59727 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:50.653194 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0119 17:51:50.653264 59551 MetaClient.cpp:76] Waiting for the metad to be ready!

storaged.ERROR
Log file created at: 2021/01/19 17:51:09
Running on machine: node1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0119 17:51:09.791154 59650 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:09.791594 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0119 17:51:15.021251 59683 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:15.021394 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0119 17:51:20.206735 59687 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:20.206872 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0119 17:51:25.625253 59690 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:25.625372 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0119 17:51:30.630848 59695 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:30.630954 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0119 17:51:35.636394 59699 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:35.636488 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0119 17:51:40.641834 59713 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:40.641929 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0119 17:51:45.647318 59717 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:45.647429 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0119 17:51:50.653057 59727 MetaClient.cpp:524] Send request to [127.0.0.1:45500], exceed retry limit
E0119 17:51:50.653194 59551 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused

E0119 17:51:13.721725 59513 WebService.cpp:173] Failed to start web service: 98failed to bind to async server socket: 127.0.0.1:11000: Address already in use
E0119 17:51:13.722548 59513 MetaDaemon.cpp:231] Init web service failed: 98failed to bind to async server socket: 127.0.0.1:11000: Address already in use

meta 服务的http端口被占用了,导致meta没起来, 端口号换一个试试

并且 把 pids目录下的 文件删掉,重新启动一下所有的nebula服务

1赞

您好,我按照你的提示把端口号修改了,然后重启服务以后,没有报错了,但是内存占用还是非常高,占用了200个G的内存。

现在metad没有error的log了.
下面是nebula-metad.INFO
I0120 11:47:38.309432 19799 NebulaStore.cpp:130] Need to open 1 parts of space 0
I0120 11:47:38.309633 19904 FileBasedWal.cpp:58] [Port: 45512, Space: 0, Part: 0] lastLogId in wal is 8, lastLogTerm is 5, path is data/meta/nebula/0/wal/0/0000000000000000001.wal
I0120 11:47:38.309717 19904 RaftPart.cpp:293] [Port: 45512, Space: 0, Part: 0] There are 0 peer hosts, and total 1 copies. The quorum is 1, as learner 0, lastLogId 8, lastLogTerm 5, committedLogId 8, term 5
I0120 11:47:38.309813 19904 NebulaStore.cpp:160] Load part 0, 0 from disk
I0120 11:47:38.309829 19799 NebulaStore.cpp:175] Load space 0 complete
I0120 11:47:38.309840 19799 NebulaStore.cpp:183] Init data from partManager for [127.0.0.1:45511]
I0120 11:47:38.309850 19799 NebulaStore.cpp:247] Space 0 has existed!
I0120 11:47:38.309859 19799 NebulaStore.cpp:267] [Space: 0, Part: 0] has existed!
I0120 11:47:38.309870 19799 NebulaStore.cpp:200] Register handler…
I0120 11:47:38.309876 19799 MetaDaemon.cpp:93] Waiting for the leader elected…
I0120 11:47:38.309881 19799 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0120 11:47:39.309958 19799 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0120 11:47:40.310088 19799 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0120 11:47:41.310230 19799 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0120 11:47:42.310357 19799 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0120 11:47:43.310490 19799 MetaDaemon.cpp:106] Leader has not been elected, sleep 1s
I0120 11:47:43.764725 19905 RaftPart.cpp:991] [Port: 45512, Space: 0, Part: 0] Start leader election, reason: lastMsgDur 5455, term 5
I0120 11:47:43.764781 19905 RaftPart.cpp:1124] [Port: 45512, Space: 0, Part: 0] Sending out an election request (space = 0, part = 0, term = 6, lastLogId = 8, lastLogTerm = 5, candidateIP = 127.0.0.1, candidatePort = 45512)
I0120 11:47:43.764837 19905 RaftPart.cpp:1084] [Port: 45512, Space: 0, Part: 0] Partition is elected as the new leader for term 6
I0120 11:47:43.764842 19905 RaftPart.cpp:1179] [Port: 45512, Space: 0, Part: 0] The partition is elected as the leader
I0120 11:47:43.764889 19905 InMemoryLogBuffer.h:23] [Port: 45512, Space: 0, Part: 0] InMemoryLogBuffer ctor, firstLogId 9
I0120 11:47:44.310640 19799 MetaDaemon.cpp:133] Nebula store init succeeded, clusterId 2695977746709790328
I0120 11:47:44.310693 19799 MetaDaemon.cpp:223] Start http service
I0120 11:47:44.421258 19799 MetaDaemon.cpp:141] Starting Meta HTTP Service
I0120 11:47:44.683601 19982 WebService.cpp:142] Web service started on HTTP[11011], HTTP2[11012]
I0120 11:47:44.971349 19799 JobManager.cpp:59] JobManager initialized
I0120 11:47:44.971406 19799 MetaDaemon.cpp:254] Check and init root user
I0120 11:47:44.971441 19799 RootUserMan.h:28] Root user exists
I0120 11:47:44.990859 19799 StatsManager.cpp:112] registerHisto, bucketSize: 1000, min: 1, max: 1000000
I0120 11:47:44.990900 19799 MetaDaemon.cpp:272] The meta deamon start on [127.0.0.1:45511]
I0120 11:47:45.026576 20001 JobManager.cpp:82] JobManager::runJobBackground() enter
I0120 11:47:47.923841 20108 HBProcessor.cpp:31] Receive heartbeat from [127.0.0.1:44511]
I0120 11:47:47.923902 20108 HBProcessor.cpp:34] Set clusterId for new host [127.0.0.1:44511]!

nebula-storaged.INFO

I0120 11:47:37.500488 19829 MetaClient.cpp:44] Create meta client to [127.0.0.1:45511]
I0120 11:47:37.556764 19829 StatsManager.cpp:112] registerHisto, bucketSize: 1000, min: 1, max: 1000000
I0120 11:47:37.557293 19829 GflagsManager.cpp:125] Prepare to register 14 gflags to meta
I0120 11:47:37.557315 19829 ClusterIdMan.h:61] Try to open cluster.id
W0120 11:47:37.557325 19829 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0120 11:47:40.765810 19924 MetaClient.cpp:524] Send request to [127.0.0.1:45511], exceed retry limit
E0120 11:47:40.766291 19829 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0120 11:47:40.766315 19829 MetaClient.cpp:76] Waiting for the metad to be ready!
I0120 11:47:42.766413 19829 ClusterIdMan.h:61] Try to open cluster.id
W0120 11:47:42.766489 19829 ClusterIdMan.h:64] Open file failed, error No such file or directory
E0120 11:47:45.921396 19963 MetaClient.cpp:524] Send request to [127.0.0.1:45511], exceed retry limit
E0120 11:47:45.921526 19829 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
I0120 11:47:45.921593 19829 MetaClient.cpp:76] Waiting for the metad to be ready!
I0120 11:47:47.921681 19829 ClusterIdMan.h:61] Try to open cluster.id
W0120 11:47:47.921779 19829 ClusterIdMan.h:64] Open file failed, error No such file or directory
I0120 11:47:47.924453 19968 MetaClient.cpp:1844] Persisit the cluster Id from metad 2695977746709790328
I0120 11:47:47.924517 19968 ClusterIdMan.h:42] Remove the existed file cluster.id
I0120 11:47:47.924748 19968 ClusterIdMan.h:55] Persiste clusterId 2695977746709790328 succeeded!
I0120 11:47:48.006475 19829 MetaClient.cpp:2198] Register gflags ok 14
I0120 11:47:48.092149 19829 MetaClient.cpp:87] Register time task for heartbeat!
I0120 11:47:48.092170 19829 StorageServer.cpp:122] Init schema manager
I0120 11:47:48.092198 19829 StorageServer.cpp:126] Init index manager
I0120 11:47:48.092218 19829 StorageServer.cpp:130] Init kvstore
I0120 11:47:48.092226 19829 NebulaStore.cpp:45] Start the raft service…
I0120 11:47:48.130596 19829 RaftexService.cpp:65] Init thrift server for raft service.
I0120 11:47:48.168812 20239 RaftexService.cpp:99] Starting the Raftex Service
I0120 11:47:48.208582 20239 RaftexService.cpp:87] Starting the Raftex Service on 44512
I0120 11:47:48.208603 20239 RaftexService.cpp:111] Start the Raftex Service successfully
I0120 11:47:48.208616 19829 NebulaStore.cpp:58] Scan the local path, and init the spaces_
E0120 11:47:48.208645 19829 FileUtils.cpp:384] Failed to read the directory “data/storage/nebula” (2): No such file or directory
I0120 11:47:48.208668 19829 NebulaStore.cpp:183] Init data from partManager for [127.0.0.1:44511]
I0120 11:47:48.208688 19829 NebulaStore.cpp:200] Register handler…
I0120 11:47:48.208693 19829 StorageServer.cpp:70] Starting Storage HTTP Service
I0120 11:47:48.281256 19829 StorageServer.cpp:74] Http Thread Pool started
I0120 11:47:48.340600 20246 WebService.cpp:142] Web service started on HTTP[12000], HTTP2[12002]

nebula-storaged.ERROR
Log file created at: 2021/01/20 11:47:40
Running on machine: node1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0120 11:47:40.765810 19924 MetaClient.cpp:524] Send request to [127.0.0.1:45511], exceed retry limit
E0120 11:47:40.766291 19829 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0120 11:47:45.921396 19963 MetaClient.cpp:524] Send request to [127.0.0.1:45511], exceed retry limit
E0120 11:47:45.921526 19829 MetaClient.cpp:58] Heartbeat failed, status:RPC failure in MetaClient: N6apache6thrift9transport19TTransportExceptionE: AsyncSocketException: connect failed, type = Socket not open, errno = 111 (Connection refused): Connection refused
E0120 11:47:48.208645 19829 FileUtils.cpp:384] Failed to read the directory “data/storage/nebula” (2): No such file or directory

现在主要问题好像还是内存占用太高了,一启动就占了200个G的内存

你可以看一下 是哪个进程占用内存比较多,是nebula-graph 还是 nebula-storage, 麻烦问一下,你的数据量有多少

1赞

我目前只是启动服务,没有导入数据等操作,只执行了./nebula.service start all 操作
内存使用如下
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
19829 root 20 0 118.9g 65.5g 16768 S 0.3 12.8 0:03.68 nebula-storaged
19812 root 20 0 212.0g 36.5g 15104 S 0.3 7.2 0:03.95 nebula-graphd
19799 root 20 0 292.3g 150.0g 17920 S 0.3 29.4 0:22.07 nebula-metad

metad使用了150G内存,
nebula-graphd使用了36.5G 内存
storaged 使用了65.5G内存。

刚启动的时候会立刻使用120G内存,然后几分钟内内存占用就越升越高。

猜测是编译器的问题

建议编译的时候,打开ENABLE_ASAN参数,然后单独启动metad试试,看是否有内存问题。

1赞

鉴于 metad 占用内存最多,可以观察一下 metad 在启动的时候做了些什么工作。步骤大致如下:

  1. 安装 perf,安装之后,使用命令 sudo perf top (按 q 键退出)验证工具可用。
  2. 使用命令 ./scripts/nebula.service stop metad 停止服务;命令 ./scripts/nebula.service status metad 确认。
  3. 使用命令 ./scripts/nebula.service start metad 重新启动服务;命令 ./scripts/nebula.service status metad 获取 metad 的 PID。
  4. 使用命令 sudo perf top -p PID观察热点(PID为 metad 进程号),然后截图。

另外,可以把命令 sudo sysctl -a 的输出贴到某个我们可以看到的地方,比如 Gist。然后你也可以把两个不同环境的系统参数做一下对比。

4赞

你好,我在加入了ENABLE_ASAN以后就没有这个问题了,3个进程都是正常的 只占用了一个G内存,那我这个问题是因为编译器问题吗?要更换编译器吗?我现在用的是GCC9.3版本

你好,我在加入了ENABLE_ASAN以后就没有这个问题了,3个进程都是正常的 只占用了一个G内存,那我这个问题是因为编译器问题吗?要更换编译器吗?我现在用的是GCC9.3版本

你试过dutor说的操作吗?

你好,我这边运维的同事修改了系统的内核参数swappiness以后就目前就没有这个问题,可能是之前因为系统设置参数出错,非常感谢各位的支持

2赞

:blush:

编译文档能否共享

编译源代码 - Nebula Graph Database 手册

你可以参考这个文档进行编译,编译过程遇到问题的欢迎来论坛提问~

感谢分享,目前我们也要在国产ARM平台上进行部署

2赞

浙ICP备20010487号