Storage Error: RPC failure, probably timeout.

salahaz · 2023 年8 月 29 日 01:39

When using nebula-python to execute queries it throws:
Storage Error: RPC failure, probably timeout.

This happens even though, the query is optimized using LIMIT. Here’s the query:
match (n:malware) return n.malware.name limit 3;

Mind you, running the exact same query in Nebula console works.

NebulaGraph is run using the nebula-docker-compose, its version is 3.6.0

All cluster containers are running and healthy; the logs don’t log any specific errors.

Any ideas?

wey · 2023 年8 月 29 日 01:48

I suspect there are storage hosts failure :), replied you via the original issue and discord
Welcome to the Chinese Forum @salahaz !

salahaz · 2023 年8 月 29 日 10:59

Hello @wey, the storage hosts are online, and connected

Same for meta, and graph hosts:
graph-hosts

meta-hosts

wey · 2023 年8 月 29 日 11:19

So, this query got RPC failures, too, right?

Could you help check logs under the ./log/ of the compose folder?

for the graphd and all storaged when you reproduce the very same query.

Could you please help run example same minimal query via ipython and console and share logs? or even better query with profile as prefix? like profile match (n:malware) return n.malware.name limit 3;

salahaz · 2023 年8 月 29 日 12:29

Hello @wey,

the RPC timeout failure happens for all queries executed in llama-index or nebula-python.
Running the profile query in nebula console works, and returns the following:

profile-console-1907×971 23.8 KB

profile-console-2894×612 16.6 KB

However, running the following code in python

from nebula3.gclient.net import ConnectionPool
from nebula3.Config import Config
from nebula3.Exception import IOErrorException
from nebula3.fbthrift.transport.TTransport import TTransportException

# define a config
config = Config()
config.max_connection_pool_size = 10
# init connection pool
connection_pool = ConnectionPool()
# if the given servers are ok, return true, else return false
ok = connection_pool.init([('127.0.0.1', 9669)], config)

# option 1 control the connection release yourself
# get session from the pool
session = connection_pool.get_session('root', 'nebula')

# select space
session.execute('USE opencti')

# release session
session.release()

# option 2 with session_context, session will be released automatically
with connection_pool.session_context('root', 'nebula') as session:
    try:
        session.execute('USE opencti')
        result = session.execute('profile match (n:malware) return n.malware.name limit 3;')
        if result is None:
            raise ValueError(f"Query failed.")
        
        if not result.is_succeeded():
            raise ValueError(
                f"Query failed."
                f"Error message: {result.error_msg()}"
            )
    except (TTransportException, IOErrorException, RuntimeError) as e:
        print(
            f"Connection issue, try to recreate session pool "
            f"Erorr: {e}"
        )
        raise e

    except ValueError as e:
        # query failed on db side
        print(
            f"Query failed."
            f"Error message: {e}"
        )
        raise e
    except Exception as e:
        # other exceptions
        print(
            f"Query failed."
            f"Error message: {e}"
        )
        raise e

# close the pool
connection_pool.close()

It returns the timeout error:

The logs don’t log anything specific related to queries related to the timeout, as you can see below:

storaged-stderr.log

I20230828 15:26:23.183328     1 AdminTaskManager.cpp:40] exit AdminTaskManager::init()
I20230828 15:26:23.183528   120 AdminTaskManager.cpp:224] waiting for incoming task
I20230828 15:26:23.340339     1 MemoryUtils.cpp:171] MemoryTracker set static ratio: 0.8
I20230828 15:26:33.502504    69 MetaClient.cpp:3263] Load leader of "storaged0":9779 in 1 space
I20230828 15:26:33.503928    69 MetaClient.cpp:3263] Load leader of "storaged1":9779 in 1 space
I20230828 15:26:33.503968    69 MetaClient.cpp:3263] Load leader of "storaged2":9779 in 1 space
I20230828 15:26:33.503979    69 MetaClient.cpp:3269] Load leader ok
I20230828 15:26:43.527977    69 MetaClient.cpp:3263] Load leader of "storaged0":9779 in 1 space
I20230828 15:26:43.528112    69 MetaClient.cpp:3263] Load leader of "storaged1":9779 in 1 space
I20230828 15:26:43.528128    69 MetaClient.cpp:3263] Load leader of "storaged2":9779 in 1 space
I20230828 15:26:43.528137    69 MetaClient.cpp:3269] Load leader ok

metad-stderr.log

I20230828 15:26:20.882784     1 MetaDaemon.cpp:193] The meta daemon start on "metad0":9559
I20230828 15:26:20.883342     1 JobManager.cpp:88] Not leader, skip reading remaining jobs
I20230828 15:26:20.883716     1 JobManager.cpp:64] JobManager initialized
I20230828 15:26:20.883806   103 JobManager.cpp:150] JobManager::scheduleThread enter
I20230828 15:26:21.572297   129 HBProcessor.cpp:33] Receive heartbeat from "storaged1":9779, role = STORAGE
I20230828 15:26:21.572355   129 HBProcessor.cpp:41] Machine "storaged1":9779 is not registered
I20230828 17:57:20.924661    57 ThriftClientManager-inl.h:67] resolve "metad1":9560 as "172.19.0.7":9560
I20230829 09:14:15.053244    57 ThriftClientManager-inl.h:67] resolve "metad2":9560 as "172.19.0.3":9560

For graphd, it only contains INFO logs no ERRORS logs:

I20230828 15:26:32.722048    67 MetaClient.cpp:3269] Load leader ok
I20230828 15:26:42.741894    67 MetaClient.cpp:3263] Load leader of "storaged0":9779 in 1 space
I20230828 15:26:42.741950    67 MetaClient.cpp:3263] Load leader of "storaged1":9779 in 1 space
I20230828 15:26:42.741958    67 MetaClient.cpp:3263] Load leader of "storaged2":9779 in 1 space
I20230828 15:26:42.741961    67 MetaClient.cpp:3269] Load leader ok

As I mentioned earlier, executing queries through nebula console works completely fine, however, when done through nebula-python or llama-index there is always RPC timeout error regardless of the query.

Any ideas ?

salahaz · 2023 年8 月 29 日 15:46

Hello @wey, I did some more investigation by running the same nebula-python code in a container within the same nebula-net network and it worked without the error. I used host.docker.internal as the IP in connection_pool.init()

Whats your’re thoughts on this ? because outside the network its able to connect to NebulaGraph using localhost/127.0.0.1 and get a certain space with USE space; however, any MATCH queries throw the RPC timeout error outside the network. This is the strange part.

I have tried the recommendation found here:

While its possible to ping meta_server_addrs = 127.0.0.1:9559 outside the network (in localhost), running the storaged local_ip=127.0.0.1 with port 9779 doesn’t work. I wonder how to find the appropriate storaged ip address.

Connecting to metad ip address from local works:

Connecting to storaged local ip address () from local doesn’t work:
storaged-ip-local

Let me know what you think ? And how to find the correct storaged local ip ?

wey · 2023 年8 月 29 日 17:40

great catch, this seems to be related to network then.

While, this rpc timeout seems to be graph->storage, shouldn’t be related to how graphd-client was connected.

As graphd-client only access graphd, which, was exposed to 0.0.0.0:9669 already…

Anyway, docker-compose was more for test purposes(for production, either with binary packages or k8s), you could put client inside the container then to workaround this.

I really couldnt think of how this happened.

That case on accessing metad/storaged doesnt apply to our case, where only graphd is needed.

system · 2023 年9 月 28 日 17:40

此话题已在最后回复的 30 天后被自动关闭。不再允许新回复。