worker 1 (pid: 92) died, killed by signal 9 :( trying respawn

  • nebula 版本:2.6.0
  • 部署方式:单机
  • 安装方式:uwsgi+Django+Docker
  • 是否为线上版本:Y / N
  • 硬件信息
    • 磁盘( 推荐使用 SSD)
    • CPU、内存信息
  • 问题的具体描述
    当下查库都没有问题;
    过了几个小时再查,就报错,紧接着的下一个问题查库就会自动重新连接nebula,给出正确答案。

日志:

Thu Mar 10 17:40:36 2022 - *** HARAKIRI ON WORKER 1 (pid: 92, try: 1) ***
Thu Mar 10 17:40:36 2022 - HARAKIRI !!! worker 1 status !!!
Thu Mar 10 17:40:36 2022 - HARAKIRI [core 3] 172.17.0.1 - POST /kbqa since 1646905175
Thu Mar 10 17:40:36 2022 - HARAKIRI !!! end of worker 1 status !!!
DAMN ! worker 1 (pid: 92) died, killed by signal 9 :( trying respawn ...
Respawned uWSGI worker 1 (new pid: 150)
[2022-03-10 17:40:39,677] INFO     [ConnectionPool.py:176]:Get connection to ('10.4.40.13', 9669)

[2022-03-10 17:40:46,034] INFO     [base.py:166]:Scheduler started
[2022-03-10 17:40:46,084] INFO     [__init__.py:33]:Kbqa ready
WSGI app 0 (mountpoint='') ready in 9 seconds on interpreter 0x5603935f34a0 pid: 150 (default app)

https://uwsgi-docs.readthedocs.io/en/latest/Options.html

这有几个配置是相关 UWSGI_OPT_MEMORY 的,超过这个配置, uwsgi 会 reload,估计是发 sig 9 信号,你可以观察一下,sig 9时候占用是不是超过了这个值(没配置默认是几G估计)
这个触发不是百分百,所以尽管昨天看你OS内存还有很多,可能这个地方几个G 就杀了。

而内存占用那么大估计和代码有关

1 个赞

代码使用:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from apscheduler.schedulers.background import BlockingScheduler

# 设置定时器为阻塞型,即如果此次定时时间已到,而上次定时任务还在进行,则跳过此次定时任务
scheduler = BlockingScheduler()


from pyhanlp import *
from nebula2.gclient.net import ConnectionPool
from nebula2.Config import Config
import queue
# pip install nebula2-python==2.6.0



class QaModel():
    def __init__(self):

        # 定义配置
        config = Config()
        config.max_connection_pool_size = 10

        # 初始化连接池
        connection_pool = ConnectionPool()
        # 如果给定的服务器正常,则返回true,否则返回false。
        ok = connection_pool.init([('127.0.0.1', 9669)], config)
        print('Server status is '+str(ok))


        self.qq = queue.Queue()

        for i in range(10):
            # 方法1:控制连接自行释放。
            # 从连接池中获取会话
            session = connection_pool.get_session('root', '123456')
            # 选择图空间
            session.execute('USE test')
            # 执行查看TAG命令
            result = session.execute('SHOW TAGS')
            print(result)

            # 在队列中插入session
            self.qq.put(session)

        print('ok')



    def qa(self,ngql_sent):
        session = self.qq.get()

        res = session.execute(ngql_sent)

        self.qq.put(session)

        return res







qaModel=QaModel()

if __name__ == "__main__":

    while True:
        ngql_sent = input('请输入ngql_sent:')
        answer = qaModel.qa(ngql_sent)
        print(ngql_sent)
        print(answer)
        print()

然后,我是在213机器基于docker部署的nebula,在174机器部署的查询的代码的docker。
class Session的retry_connect=True,是默认值,按理说是默认重连?
但是,我30分钟没有操作,好像nebula就断连了。
麻烦给看一下

  • 重连之后,你的 ngql_sent 如果没有 use test,应该是会报错的,你可以在 qq.get 的时候判断一下是不是需要单独 USE test(想象你的console超时了,重新连回来需要再 use test)
  • 你可以修改超时时间为 30 秒,这样可以有更快的改代码–>测试的反馈循环,不必等 30 分钟

你好,请问在哪修改超时时间?

我之前尝试过,每次有一个ngql_sent输入,则重新创建一个session。
就是将一下内容放在qa()函数中,也没有解决该问题:

    # 从连接池中获取会话
    session = connection_pool.get_session('root', '123456')
    # 选择图空间
    session.execute('USE test')
    # 执行查看TAG命令
    result = session.execute('SHOW TAGS')
    print(result)

    print('ok')
client_idle_timeout_secs	0	空闲连接的超时时间。0表示永不超时。单位:秒。
session_idle_timeout_secs	0	空闲会话的超时时间。0表示永不超时。单位:秒。

如果你没改过的话,并不是30分钟,我才意识到你之前提到的30分钟是 studio 里的 session 不是一会儿事儿。(连接数据库 - Nebula Graph Database 手册

我感觉你这个30分钟的问题还是 uwsgi 的问题,建议回归到那个 sig9 的 root cause 本身(我简单Google了一下,不少是和 ram 有关的 i.e. DAMN ! worker 1 (pid: 108) died, killed by signal 9 :( trying respawn ... · Issue #1779 · unbit/uwsgi · GitHub )。

您说的
client_idle_timeout_secs 0 空闲连接的超时时间。0表示永不超时。单位:秒。
session_idle_timeout_secs 0 空闲会话的超时时间。0表示永不超时。单位:秒。
是不是和下边Config中的idle_time是一样的?

确实,我没有更改过这些参数配置。

class Config(object):
# the min connection always in pool
min_connection_pool_size = 0

# the max connection in pool
max_connection_pool_size = 10

# connection or execute timeout, unit ms, 0 means no timeout
timeout = 0

# 0 means will never close the idle connection, unit ms,
idle_time = 0

# the interval to check idle time connection, unit second, -1 means no check
interval_check = -1

graphd的容器看看重启过没

您好,
我没有/usr/local/nebula/etc/这个文件夹,
也没有涉及Meta服务配置。
我是单机部署,不是分布式的。

docker network ls

5dffd04f07fa nebula-docker-compose_nebula-net bridge local

docker inspect graphd_conatiner_id
显示:
[
{
“Id”: “73096c6b1738333efe9884cd1094b1cc6dd2e76f286ad43258af65dd2ea5a290”,
“Created”: “2022-02-08T09:00:54.648947791Z”,
“Path”: “/usr/local/nebula/bin/nebula-graphd”,
“Args”: [
“–flagfile=/usr/local/nebula/etc/nebula-graphd.conf”,
“–daemonize=false”,
“–containerized=true”,
“–meta_server_addrs=metad0:9559,metad1:9559,metad2:9559”,
“–port=9669”,
“–local_ip=graphd1”,
“–ws_ip=graphd1”,
“–ws_http_port=19669”,
“–log_dir=/logs”,
“–v=0”,
“–minloglevel=0”
],
“State”: {
“Status”: “running”,
“Running”: true,
“Paused”: false,
“Restarting”: false,
“OOMKilled”: false,
“Dead”: false,
“Pid”: 9305,
“ExitCode”: 0,
“Error”: “”,
“StartedAt”: “2022-03-08T07:05:51.790593317Z”,
“FinishedAt”: “2022-03-07T10:55:08.713273714Z”,
“Health”: {
“Status”: “healthy”,
“FailingStreak”: 0,
“Log”: [
{
“Start”: “2022-03-14T19:36:46.120388206+08:00”,
“End”: “2022-03-14T19:36:46.207990823+08:00”,
“ExitCode”: 0,
“Output”: “{“git_info_sha”:“d113f4a”,“status”:“running”}”
},
{
“Start”: “2022-03-14T19:37:16.224836685+08:00”,
“End”: “2022-03-14T19:37:16.317549216+08:00”,
“ExitCode”: 0,
“Output”: “{“git_info_sha”:“d113f4a”,“status”:“running”}”
},
{
“Start”: “2022-03-14T19:37:46.333781763+08:00”,
“End”: “2022-03-14T19:37:46.416375869+08:00”,
“ExitCode”: 0,
“Output”: “{“git_info_sha”:“d113f4a”,“status”:“running”}”
},
{
“Start”: “2022-03-14T19:38:16.429030779+08:00”,
“End”: “2022-03-14T19:38:16.517025397+08:00”,
“ExitCode”: 0,
“Output”: “{“git_info_sha”:“d113f4a”,“status”:“running”}”
},
{
“Start”: “2022-03-14T19:38:46.531922505+08:00”,
“End”: “2022-03-14T19:38:46.62123726+08:00”,
“ExitCode”: 0,
“Output”: “{“git_info_sha”:“d113f4a”,“status”:“running”}”
}
]
}
},
“Image”: “sha256:053fc6df1a3d3876ea2d40907f4e0ca9c07d4d76a2f987afb361598520b517ac”,
“ResolvConfPath”: “/var/lib/docker/containers/73096c6b1738333efe9884cd1094b1cc6dd2e76f286ad43258af65dd2ea5a290/resolv.conf”,
“HostnamePath”: “/var/lib/docker/containers/73096c6b1738333efe9884cd1094b1cc6dd2e76f286ad43258af65dd2ea5a290/hostname”,
“HostsPath”: “/var/lib/docker/containers/73096c6b1738333efe9884cd1094b1cc6dd2e76f286ad43258af65dd2ea5a290/hosts”,
“LogPath”: “/var/lib/docker/containers/73096c6b1738333efe9884cd1094b1cc6dd2e76f286ad43258af65dd2ea5a290/73096c6b1738333efe9884cd1094b1cc6dd2e76f286ad43258af65dd2ea5a290-json.log”,
“Name”: “/nebula-docker-compose_graphd1_1”,
“RestartCount”: 0,
“Driver”: “overlay2”,
“Platform”: “linux”,
“MountLabel”: “”,
“ProcessLabel”: “”,
“AppArmorProfile”: “docker-default”,
“ExecIDs”: null,
“HostConfig”: {
“Binds”: [
“/root/nebula-docker-compose/nebula-docker-compose/logs/graph1:/logs:rw”
],
“ContainerIDFile”: “”,
“LogConfig”: {
“Type”: “json-file”,
“Config”: {}
},
“NetworkMode”: “nebula-docker-compose_nebula-net”,
“PortBindings”: {
“19669/tcp”: [
{
“HostIp”: “”,
“HostPort”: “”
}
],
“19670/tcp”: [
{
“HostIp”: “”,
“HostPort”: “”
}
],
“9669/tcp”: [
{
“HostIp”: “”,
“HostPort”: “”
}
]
},
“RestartPolicy”: {
“Name”: “on-failure”,
“MaximumRetryCount”: 0
},
“AutoRemove”: false,
“VolumeDriver”: “”,
“VolumesFrom”: [],
“CapAdd”: [
“SYS_PTRACE”
],
“CapDrop”: null,
“CgroupnsMode”: “host”,
“Dns”: [],
“DnsOptions”: [],
“DnsSearch”: [],
“ExtraHosts”: null,
“GroupAdd”: null,
“IpcMode”: “private”,
“Cgroup”: “”,
“Links”: null,
“OomScoreAdj”: 0,
“PidMode”: “”,
“Privileged”: false,
“PublishAllPorts”: false,
“ReadonlyRootfs”: false,
“SecurityOpt”: null,
“UTSMode”: “”,
“UsernsMode”: “”,
“ShmSize”: 67108864,
“Runtime”: “runc”,
“ConsoleSize”: [
0,
0
],
“Isolation”: “”,
“CpuShares”: 0,
“Memory”: 0,
“NanoCpus”: 0,
“CgroupParent”: “”,
“BlkioWeight”: 0,
“BlkioWeightDevice”: null,
“BlkioDeviceReadBps”: null,
“BlkioDeviceWriteBps”: null,
“BlkioDeviceReadIOps”: null,
“BlkioDeviceWriteIOps”: null,
“CpuPeriod”: 0,
“CpuQuota”: 0,
“CpuRealtimePeriod”: 0,
“CpuRealtimeRuntime”: 0,
“CpusetCpus”: “”,
“CpusetMems”: “”,
“Devices”: null,
“DeviceCgroupRules”: null,
“DeviceRequests”: null,
“KernelMemory”: 0,
“KernelMemoryTCP”: 0,
“MemoryReservation”: 0,
“MemorySwap”: 0,
“MemorySwappiness”: null,
“OomKillDisable”: false,
“PidsLimit”: null,
“Ulimits”: null,
“CpuCount”: 0,
“CpuPercent”: 0,
“IOMaximumIOps”: 0,
“IOMaximumBandwidth”: 0,
“MaskedPaths”: [
“/proc/asound”,
“/proc/acpi”,
“/proc/kcore”,
“/proc/keys”,
“/proc/latency_stats”,
“/proc/timer_list”,
“/proc/timer_stats”,
“/proc/sched_debug”,
“/proc/scsi”,
“/sys/firmware”
],
“ReadonlyPaths”: [
“/proc/bus”,
“/proc/fs”,
“/proc/irq”,
“/proc/sys”,
“/proc/sysrq-trigger”
]
},
“GraphDriver”: {
“Data”: {
“LowerDir”: “/var/lib/docker/overlay2/77a15f579c63e152821629a0084d90325756c464b47a92f7cbb2f9e69271578e-init/diff:/var/lib/docker/overlay2/aee53bf05eae89821793bbb105446af39d5ba9653efb1beb9a794f19356a1a55/diff:/var/lib/docker/overlay2/f5130530e03cb31cd35f7ad8fdbcbe28ed9f0e9620400fc44e9f08d9f69c6041/diff:/var/lib/docker/overlay2/afd208d2dbe7ebc72217894462a3846264ff3ac89cd419029d52223dc62fdd5e/diff:/var/lib/docker/overlay2/658b31386eb049665841ed9c26c99954cec476a6aba663d137cff9951b25a15c/diff”,
“MergedDir”: “/var/lib/docker/overlay2/77a15f579c63e152821629a0084d90325756c464b47a92f7cbb2f9e69271578e/merged”,
“UpperDir”: “/var/lib/docker/overlay2/77a15f579c63e152821629a0084d90325756c464b47a92f7cbb2f9e69271578e/diff”,
“WorkDir”: “/var/lib/docker/overlay2/77a15f579c63e152821629a0084d90325756c464b47a92f7cbb2f9e69271578e/work”
},
“Name”: “overlay2”
},
“Mounts”: [
{
“Type”: “bind”,
“Source”: “/root/nebula-docker-compose/nebula-docker-compose/logs/graph1”,
“Destination”: “/logs”,
“Mode”: “rw”,
“RW”: true,
“Propagation”: “rprivate”
}
],
“Config”: {
“Hostname”: “73096c6b1738”,
“Domainname”: “”,
“User”: “”,
“AttachStdin”: false,
“AttachStdout”: false,
“AttachStderr”: false,
“ExposedPorts”: {
“19669/tcp”: {},
“19670/tcp”: {},
“9669/tcp”: {}
},
“Tty”: false,
“OpenStdin”: false,
“StdinOnce”: false,
“Env”: [
“USER=root”,
“TZ=UTC”,
“PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin”
],
“Cmd”: [
“–meta_server_addrs=metad0:9559,metad1:9559,metad2:9559”,
“–port=9669”,
“–local_ip=graphd1”,
“–ws_ip=graphd1”,
“–ws_http_port=19669”,
“–log_dir=/logs”,
“–v=0”,
“–minloglevel=0”
],
“Healthcheck”: {
“Test”: [
“CMD”,
“curl”,
“-sf”,
http://graphd1:19669/status
],
“Interval”: 30000000000,
“Timeout”: 10000000000,
“StartPeriod”: 20000000000,
“Retries”: 3
},
“Image”: “vesoft/nebula-graphd:v2.6.2”,
“Volumes”: {
“/logs”: {}
},
“WorkingDir”: “/usr/local/nebula”,
“Entrypoint”: [
“/usr/local/nebula/bin/nebula-graphd”,
“–flagfile=/usr/local/nebula/etc/nebula-graphd.conf”,
“–daemonize=false”,
“–containerized=true”
],
“OnBuild”: null,
“Labels”: {
“com.docker.compose.config-hash”: “2da72bf8b6f425444e328726cd55cef2a26f4169c50e58acd89cfe3ed526daa1”,
“com.docker.compose.container-number”: “1”,
“com.docker.compose.oneoff”: “False”,
“com.docker.compose.project”: “nebula-docker-compose”,
“com.docker.compose.project.config_files”: “docker-compose.yaml”,
“com.docker.compose.project.working_dir”: “/root/nebula-docker-compose/nebula-docker-compose”,
“com.docker.compose.service”: “graphd1”,
“com.docker.compose.version”: “1.29.2”,
“org.label-schema.build-date”: “20201113”,
“org.label-schema.license”: “GPLv2”,
“org.label-schema.name”: “CentOS Base Image”,
“org.label-schema.schema-version”: “1.0”,
“org.label-schema.vendor”: “CentOS”,
“org.opencontainers.image.created”: “2020-11-13 00:00:00+00:00”,
“org.opencontainers.image.licenses”: “GPL-2.0-only”,
“org.opencontainers.image.title”: “CentOS Base Image”,
“org.opencontainers.image.vendor”: “CentOS”
}
},
“NetworkSettings”: {
“Bridge”: “”,
“SandboxID”: “07fed8027cce5ce5487b6fc17a0a37725dba935213911e72e1c07fa8a3706d1c”,
“HairpinMode”: false,
“LinkLocalIPv6Address”: “”,
“LinkLocalIPv6PrefixLen”: 0,
“Ports”: {
“19669/tcp”: [
{
“HostIp”: “0.0.0.0”,
“HostPort”: “49178”
},
{
“HostIp”: “::”,
“HostPort”: “49178”
}
],
“19670/tcp”: [
{
“HostIp”: “0.0.0.0”,
“HostPort”: “49177”
},
{
“HostIp”: “::”,
“HostPort”: “49177”
}
],
“9669/tcp”: [
{
“HostIp”: “0.0.0.0”,
“HostPort”: “49179”
},
{
“HostIp”: “::”,
“HostPort”: “49179”
}
]
},
“SandboxKey”: “/var/run/docker/netns/07fed8027cce”,
“SecondaryIPAddresses”: null,
“SecondaryIPv6Addresses”: null,
“EndpointID”: “”,
“Gateway”: “”,
“GlobalIPv6Address”: “”,
“GlobalIPv6PrefixLen”: 0,
“IPAddress”: “”,
“IPPrefixLen”: 0,
“IPv6Gateway”: “”,
“MacAddress”: “”,
“Networks”: {
“nebula-docker-compose_nebula-net”: {
“IPAMConfig”: null,
“Links”: null,
“Aliases”: [
“graphd1”,
“73096c6b1738”
],
“NetworkID”: “5dffd04f07fad533a48c137a47a0e057de5b7cdb98346333b7ebfe7033da8f1e”,
“EndpointID”: “3e421ea0f147a500798b31caf00d1434685246d919fd201535711b3971663b8f”,
“Gateway”: “172.18.0.1”,
“IPAddress”: “172.18.0.8”,
“IPPrefixLen”: 16,
“IPv6Gateway”: “”,
“GlobalIPv6Address”: “”,
“GlobalIPv6PrefixLen”: 0,
“MacAddress”: “02:42:ac:12:00:08”,
“DriverOpts”: null
}
}
}
}
]

具体报错:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/nebula2/fbthrift/transport/TSocket.py", line 305, in read
    buff = self.handle.recv(sz)
TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/nebula2/gclient/net/Connection.py", line 122, in execute
    resp = self._connection.execute(session_id, stmt)
  File "/usr/local/lib/python3.6/site-packages/nebula2/graph/GraphService.py", line 1110, in execute
    return self.recv_execute()
  File "/usr/local/lib/python3.6/site-packages/nebula2/graph/GraphService.py", line 1122, in recv_execute
    (fname, mtype, rseqid) = self._iprot.readMessageBegin()
  File "/usr/local/lib/python3.6/site-packages/nebula2/fbthrift/protocol/TBinaryProtocol.py", line 137, in readMessageBegin
    sz = self.readI32()
  File "/usr/local/lib/python3.6/site-packages/nebula2/fbthrift/protocol/TBinaryProtocol.py", line 216, in readI32
    buff = self.trans.readAll(4)
  File "/usr/local/lib/python3.6/site-packages/nebula2/fbthrift/transport/TTransport.py", line 72, in readAll
    chunk = self.read(need)
  File "/usr/local/lib/python3.6/site-packages/nebula2/fbthrift/transport/TTransport.py", line 183, in read
    self.__rbuf = StringIO(self.__trans.read(max(sz, self.__rbuf_size)))
  File "/usr/local/lib/python3.6/site-packages/nebula2/fbthrift/transport/TSocket.py", line 312, in read
    message='Socket read failed: {}'.format(str(e))
nebula2.fbthrift.transport.TTransport.TTransportException: Socket read failed: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.6/site-packages/apscheduler/executors/base.py”, line 125, in run_job
retval = job.func(*job.args, **job.kwargs)
File “./apps/kbqa/kbqa_service.py”, line 542, in redis_update
kbqaModel.qa(question=‘你知道姚明的女儿是谁吗ka’)
File “./apps/kbqa/kbqa_service.py”, line 405, in qa
resp = session.execute(check)
File “/usr/local/lib/python3.6/site-packages/nebula2/gclient/net/Session.py”, line 39, in execute
resp = self._connection.execute(self._session_id, stmt)
File “/usr/local/lib/python3.6/site-packages/nebula2/gclient/net/Connection.py”, line 128, in execute
raise IOErrorException(IOErrorException.E_TIMEOUT, te.message)
nebula2.Exception.IOErrorException: Socket read failed: [Errno 110] Connection timed out

是 gclient 去连 graphD 超时 E_TIMEOUT,能确认那个时刻 graphD 还是 up 的么?

docker ps 可以看容器最近 up 了多久,另外比如 docker events --filter event=restart --since=60m 能看最近60分钟所有重启过的容器的事件。

超时的话要么是那时候网络不通,要么是 graphD 那时候不是 up 的哈。

另外你在群里提到的那个报错不一样哈,那个不是超时是 broken 哈,而且在2.6里已经fix了重连的这个问题。

ERROR [exception.py:90]:Invalid HTTP_HOST header: ‘10.3.126.174:23356’. You may need to add ‘10.3.126.174’ to ALLOWED_HOSTS.

你好,docker容器一直是up活动状态

嗯嗯,这个报错似乎是因为 10.3.126.174 这个地址和 django 通信不在 django setting 的白名单里,能看下这个地址是谁么?不是 其中一个graphD吧?

@Aiee 请问下哈,nebula2-python(2.6) 客户端session超时,graphD 没有 restart 过,除了网络不通还可能是 客户端上的别的问题么?

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/usr/local/lib/python3.6/site-packages/apscheduler/executors/base.py”, line 125, in run_job
retval = job.func(*job.args, **job.kwargs)
File “./apps/kbqa/kbqa_service.py”, line 542, in redis_update
kbqaModel.qa(question=‘你知道姚明的女儿是谁吗ka’)
File “./apps/kbqa/kbqa_service.py”, line 405, in qa
resp = session.execute(check)
File “/usr/local/lib/python3.6/site-packages/nebula2/gclient/net/Session.py”, line 39, in execute
resp = self._connection.execute(self._session_id, stmt)
File “/usr/local/lib/python3.6/site-packages/nebula2/gclient/net/Connection.py”, line 128, in execute
raise IOErrorException(IOErrorException.E_TIMEOUT, te.message)
nebula2.Exception.IOErrorException: Socket read failed: [Errno 110] Connection timed out

是我本机的ip