为了复现该问题,写了个springboot的demo部署在测试环境,其中demo里只做了hello world功能,应用类型为web_tomcat (war包部署),基础镜像是base_tomcat/java-centos6-jdk18-60-tom8050-ngx197,镜像使用的Java版本是1.8.0_60,有了上次 MySQL被kill的经验,盲猜是linux limit惹的祸,因此将打好的镜像分别部署了两批不同的机器,果不其然,新机器当晚挂掉了,老机器服务正常
看一下挂掉的limit设置
Java进程会受到limits影响?
按理说Java进程是不会受到系统limit open files(系统最大句柄数)影响的,但是为了验证这个问题,我们将他修改为正常机器的值,由于demo是web_tomcat应用,没法修改启动脚本,因此我们通过prlimit修改java进程的limit
prlimit -p 32672 --nofile=1048576
J
结果当晚00:00左右还是挂了,看来open files和java进程挂掉没关系,看dmesg也没发现什么问题
Java版本过低导致内存分配不合理?
通过寻求jdos研发组的帮助,jdos研发组的同学认为是java版本的问题,低版本可能没有限制住申请的内存大小,具体原因如下
https://blog.softwaremill.com/docker-support-in-new-java-8-finally-fd595df0ca54?gi=a0cc6736ed14
异常机器java内存情况
Java服务总在半夜挂,背后的真相竟然是... | 京东云技术团队_EXEC_05
正常机器java内存情况
Java服务总在半夜挂,背后的真相竟然是... | 京东云技术团队_定时任务_06
按照这个 文档描述,使用docker cgroups限制内存可能会导致JVM进程被终止,原因是Java读取的还是宿主机的CPU,而不是docker cgroups限制的CPU,高版本的Java解决了这个问题,文档解决方案截图如下:
Java服务总在半夜挂,背后的真相竟然是... | 京东云技术团队_Java_07
对此我们表示怀疑,因为我们的程序里设置了JVM参数
保持着试一试的心态,我们增加了一个实验组,实验组使用的Java版本是11.0.8
结果当晚实验组的Java进程还是死了,看来和Java版本也没关系
容器上存在定时任务导致的?
由于基础镜像是jdos官方提供的镜像,所以之前从来没有怀疑过是定时任务的问题,但是现在别无他法了,检查下容器的定时任务
虽然有定时任务,但是这个执行的时间点和Java挂掉的时间对不上,为此我们决定删除定时任务试试
结果当晚Java进程还是挂了,并且这次有dmesg的日志,发现Java被kill的同时crond也被kill了,被kill的原因是crond内存过高导致oom
J
难道还有系统级cron任务?于是查了一下/etc/crontab,发现果然还有cron任务(这是谁打的镜像!!!)
这个时间点和Java进程挂掉的时间点吻合,但是问题来了,执行的任务并没有logrotate.sh这个脚本,应该不会出现问题才对
到底是不是定时任务的问题,我们修改下cron的时间验证下,调整时间为中午11:00,验证下Java进程是否会挂,同时使用strace打印进程trace log
果然Java进程在中午11.00挂了,看来真的是cron任务导致的,让我们一起看一下strace
19:59:01 close(3) = 019:59:01 stat("/etc/pam.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 019:59:01 open("/etc/pam.d/crond", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=293, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "#/n# The PAM configuration file f"..., 4096) = 29319:59:01 open("/lib64/security/pam_access.so", O_RDONLY) = 519:59:01 read(5, "/177ELF/2/1/1/0/0/0/0/0/0/0/0/0/3/0>/0/1/0/0/0000/17/0/0/0/0/0/0"..., 832) = 83219:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=18552, ...}) = 019:59:01 mmap(NULL, 2113800, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd76932200019:59:01 mprotect(0x7fd769325000, 2097152, PROT_NONE) = 019:59:01 mmap(0x7fd769525000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x3000) = 0x7fd76952500019:59:01 close(5) = 019:59:01 open("/etc/ld.so.cache", O_RDONLY) = 519:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=16203, ...}) = 019:59:01 mmap(NULL, 16203, PROT_READ, MAP_PRIVATE, 5, 0) = 0x7fd7707f800019:59:01 close(5) = 019:59:01 open("/lib64/libnsl.so.1", O_RDONLY) = 519:59:01 read(5, "/177ELF/2/1/1/0/0/0/0/0/0/0/0/0/3/0>/0/1/0/0/0p@/0/0/0/0/0/0"..., 832) = 83219:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=113432, ...}) = 019:59:01 mmap(NULL, 2198192, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd76910900019:59:01 mprotect(0x7fd76911f000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd76931e000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x15000) = 0x7fd76931e00019:59:01 mmap(0x7fd769320000, 6832, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd76932000019:59:01 close(5) = 019:59:01 mprotect(0x7fd76931e000, 4096, PROT_READ) = 019:59:01 mprotect(0x7fd769525000, 4096, PROT_READ) = 019:59:01 munmap(0x7fd7707f8000, 16203) = 019:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 519:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080300019:59:01 read(5, "#%PAM-1.0/n# This file is auto-ge"..., 4096) = 69219:59:01 open("/lib64/security/pam_unix.so", O_RDONLY) = 619:59:01 read(6, "/177ELF/2/1/1/0/0/0/0/0/0/0/0/0/3/0>/0/1/0/0/0/240&/0/0/0/0/0/0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=51960, ...}) = 019:59:01 mmap(NULL, 2196352, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd768ef000019:59:01 mprotect(0x7fd768efc000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd7690fb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0xb000) = 0x7fd7690fb00019:59:01 mmap(0x7fd7690fd000, 45952, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fd7690fd00019:59:01 close(6) = 019:59:01 mprotect(0x7fd7690fb000, 4096, PROT_READ) = 019:59:01 read(5, "", 4096) = 019:59:01 close(5) = 019:59:01 munmap(0x7fd770803000, 4096) = 019:59:01 open("/lib64/security/pam_loginuid.so", O_RDONLY) = 519:59:01 read(5, "/177ELF/2/1/1/0/0/0/0/0/0/0/0/0/3/0>/0/1/0/0/0/220/t/0/0/0/0/0/0"..., 832) = 83219:59:01 fstat(5, {st_mode=S_IFREG|0755, st_size=10240, ...}) = 019:59:01 mmap(NULL, 2105480, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 5, 0) = 0x7fd768ced00019:59:01 mprotect(0x7fd768cef000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd768eee000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x1000) = 0x7fd768eee00019:59:01 close(5) = 019:59:01 mprotect(0x7fd768eee000, 4096, PROT_READ) = 019:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 519:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080300019:59:01 read(5, "#%PAM-1.0/n# This file is auto-ge"..., 4096) = 69219:59:01 open("/lib64/security/pam_keyinit.so", O_RDONLY) = 619:59:01 read(6, "/177ELF/2/1/1/0/0/0/0/0/0/0/0/0/3/0>/0/1/0/0/0`/10/0/0/0/0/0/0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=10224, ...}) = 019:59:01 mmap(NULL, 2105488, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd768aea00019:59:01 mprotect(0x7fd768aec000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd768ceb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x1000) = 0x7fd768ceb00019:59:01 close(6) = 019:59:01 mprotect(0x7fd768ceb000, 4096, PROT_READ) = 019:59:01 open("/lib64/security/pam_limits.so", O_RDONLY) = 619:59:01 read(6, "/177ELF/2/1/1/0/0/0/0/0/0/0/0/0/3/0>/0/1/0/0/0/320/20/0/0/0/0/0/0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18600, ...}) = 019:59:01 mmap(NULL, 2113848, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7688e500019:59:01 mprotect(0x7fd7688e9000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd768ae8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd768ae800019:59:01 close(6) = 019:59:01 mprotect(0x7fd768ae8000, 4096, PROT_READ) = 019:59:01 open("/lib64/security/pam_succeed_if.so", O_RDONLY) = 619:59:01 read(6, "/177ELF/2/1/1/0/0/0/0/0/0/0/0/0/3/0>/0/1/0/0/0/340/v/0/0/0/0/0/0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=14384, ...}) = 019:59:01 mmap(NULL, 2109624, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7686e100019:59:01 mprotect(0x7fd7686e4000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd7688e3000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x2000) = 0x7fd7688e300019:59:01 close(6) = 019:59:01 mprotect(0x7fd7688e3000, 4096, PROT_READ) = 019:59:01 read(5, "", 4096) = 019:59:01 close(5) = 019:59:01 munmap(0x7fd770803000, 4096) = 019:59:01 open("/etc/pam.d/password-auth", O_RDONLY) = 519:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=692, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080300019:59:01 read(5, "#%PAM-1.0/n# This file is auto-ge"..., 4096) = 69219:59:01 open("/lib64/security/pam_env.so", O_RDONLY) = 619:59:01 read(6, "/177ELF/2/1/1/0/0/0/0/0/0/0/0/0/3/0>/0/1/0/0/0/300/r/0/0/0/0/0/0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=18592, ...}) = 019:59:01 mmap(NULL, 2113776, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7684dc00019:59:01 mprotect(0x7fd7684e0000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd7686df000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0x3000) = 0x7fd7686df00019:59:01 close(6) = 019:59:01 mprotect(0x7fd7686df000, 4096, PROT_READ) = 019:59:01 open("/lib64/security/pam_deny.so", O_RDONLY) = 619:59:01 read(6, "/177ELF/2/1/1/0/0/0/0/0/0/0/0/0/3/0>/0/1/0/0/0000/5/0/0/0/0/0/0"..., 832) = 83219:59:01 fstat(6, {st_mode=S_IFREG|0755, st_size=5952, ...}) = 019:59:01 mmap(NULL, 2101272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 6, 0) = 0x7fd7682da00019:59:01 mprotect(0x7fd7682db000, 2093056, PROT_NONE) = 019:59:01 mmap(0x7fd7684da000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 6, 0) = 0x7fd7684da00019:59:01 close(6) = 019:59:01 mprotect(0x7fd7684da000, 4096, PROT_READ) = 019:59:01 read(5, "", 4096) = 019:59:01 close(5) = 019:59:01 munmap(0x7fd770803000, 4096) = 019:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 open("/etc/pam.d/other", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=154, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "#%PAM-1.0/nauth required "..., 4096) = 15419:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash/n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 019:59:01 open("/etc/security/access.conf", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=4620, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "# Login access control table./n#/n"..., 4096) = 409619:59:01 read(3, " should get access from ipv4 net"..., 4096) = 52419:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 getuid() = 019:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash/n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 geteuid() = 019:59:01 open("/etc/shadow", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG, st_size=901, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:$6$4.53VPrJ$1wxMpbsWYp4VKea"..., 4096) = 90119:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 socket(PF_NETLINK, SOCK_RAW, 9) = 319:59:01 fcntl(3, F_SETFD, FD_CLOEXEC) = 019:59:01 readlink("/proc/self/exe", "/usr/sbin/crond", 4096) = 1519:59:01 sendto(3, "p/0/0/0M/4/5/0/1/0/0/0/0/0/0/0op=PAM:accountin"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 11219:59:01 poll([{fd=3, events=POLLIN}], 1, 500) = 1 ([{fd=3, revents=POLLIN}])19:59:01 recvfrom(3, "$/0/0/0/2/0/0/1/1/0/0/0/227/7/0/0/0/0/0/0p/0/0/0M/4/5/0/1/0/0/0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 recvfrom(3, "$/0/0/0/2/0/0/1/1/0/0/0/227/7/0/0/0/0/0/0p/0/0/0M/4/5/0/1/0/0/0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 close(3) = 019:59:01 open("/etc/security/pam_env.conf", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=2980, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "#/n# This is the configuration fi"..., 4096) = 298019:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 open("/etc/environment", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 socket(PF_NETLINK, SOCK_RAW, 9) = 319:59:01 fcntl(3, F_SETFD, FD_CLOEXEC) = 019:59:01 sendto(3, "p/0/0/0O/4/5/0/2/0/0/0/0/0/0/0op=PAM:setcred a"..., 112, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 11219:59:01 poll([{fd=3, events=POLLIN}], 1, 500) = 1 ([{fd=3, revents=POLLIN}])19:59:01 recvfrom(3, "$/0/0/0/2/0/0/1/2/0/0/0/227/7/0/0/0/0/0/0p/0/0/0O/4/5/0/2/0/0/0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 recvfrom(3, "$/0/0/0/2/0/0/1/2/0/0/0/227/7/0/0/0/0/0/0p/0/0/0O/4/5/0/2/0/0/0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 close(3) = 019:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash/n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 open("/proc/self/loginuid", O_WRONLY|O_TRUNC|O_NOFOLLOW) = 319:59:01 write(3, "0", 1) = 119:59:01 close(3) = 019:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash/n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 getuid() = 019:59:01 getgid() = 019:59:01 keyctl(0, 0xfffffffd, 0, 0, 0) = 49646638519:59:01 keyctl(0, 0xfffffffb, 0, 0, 0x30) = 78570213219:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash/n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 getrlimit(RLIMIT_CPU, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_FSIZE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_DATA, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_STACK, {rlim_cur=8192*1024, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_CORE, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_RSS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 019:59:01 getrlimit(RLIMIT_MEMLOCK, {rlim_cur=64*1024, rlim_max=64*1024}) = 019:59:01 getrlimit(RLIMIT_AS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_LOCKS, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 getrlimit(RLIMIT_SIGPENDING, {rlim_cur=883632, rlim_max=883632}) = 019:59:01 getrlimit(RLIMIT_MSGQUEUE, {rlim_cur=800*1024, rlim_max=800*1024}) = 019:59:01 getrlimit(RLIMIT_NICE, {rlim_cur=0, rlim_max=0}) = 019:59:01 getrlimit(RLIMIT_RTPRIO, {rlim_cur=0, rlim_max=0}) = 019:59:01 getpriority(PRIO_PROCESS, 0) = 2019:59:01 open("/etc/security/limits.conf", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1835, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "# /etc/security/limits.conf/n#/n#E"..., 4096) = 183519:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 open("/etc/security/limits.d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 319:59:01 getdents(3, /* 3 entries */, 32768) = 8819:59:01 open("/usr/lib64/gconv/gconv-modules.cache", O_RDONLY) = 519:59:01 fstat(5, {st_mode=S_IFREG|0644, st_size=26060, ...}) = 019:59:01 mmap(NULL, 26060, PROT_READ, MAP_SHARED, 5, 0) = 0x7fd7707f500019:59:01 close(5) = 019:59:01 getdents(3, /* 0 entries */, 32768) = 019:59:01 close(3) = 019:59:01 open("/etc/security/limits.d/90-nproc.conf", O_RDONLY) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=193, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "# Default limit for number of us"..., 4096) = 19319:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 setrlimit(RLIMIT_NPROC, {rlim_cur=RLIM_INFINITY, rlim_max=RLIM_INFINITY}) = 019:59:01 setpriority(PRIO_PROCESS, 0, 0) = 019:59:01 getuid() = 019:59:01 open("/etc/passwd", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=1057, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 read(3, "root:x:0:0:root:/root:/bin/bash/n"..., 4096) = 105719:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 socket(PF_NETLINK, SOCK_RAW, 9) = 319:59:01 fcntl(3, F_SETFD, FD_CLOEXEC) = 019:59:01 sendto(3, "t/0/0/0Q/4/5/0/3/0/0/0/0/0/0/0op=PAM:session_o"..., 116, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 11619:59:01 poll([{fd=3, events=POLLIN}], 1, 500) = 1 ([{fd=3, revents=POLLIN}])19:59:01 recvfrom(3, "$/0/0/0/2/0/0/1/3/0/0/0/227/7/0/0/0/0/0/0t/0/0/0Q/4/5/0/3/0/0/0"..., 8988, MSG_PEEK|MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 recvfrom(3, "$/0/0/0/2/0/0/1/3/0/0/0/227/7/0/0/0/0/0/0t/0/0/0Q/4/5/0/3/0/0/0"..., 8988, MSG_DONTWAIT, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 3619:59:01 close(3) = 019:59:01 setgid(0) = 019:59:01 open("/proc/sys/kernel/ngroups_max", O_RDONLY) = 319:59:01 read(3, "65536/n", 31) = 619:59:01 close(3) = 019:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 319:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)19:59:01 close(3) = 019:59:01 socket(PF_FILE, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 319:59:01 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)19:59:01 close(3) = 019:59:01 open("/etc/group", O_RDONLY|O_CLOEXEC) = 319:59:01 fstat(3, {st_mode=S_IFREG|0644, st_size=497, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 lseek(3, 0, SEEK_CUR) = 019:59:01 read(3, "root:x:0:/nbin:x:1:bin,daemon/ndae"..., 4096) = 49719:59:01 read(3, "", 4096) = 019:59:01 close(3) = 019:59:01 munmap(0x7fd770804000, 4096) = 019:59:01 setgroups(1, [0]) = 019:59:01 setreuid(0, 4294967295) = 019:59:01 rt_sigaction(SIGCHLD, {SIG_DFL, [CHLD], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, {0x558826e03b80, [], SA_RESTORER|SA_RESTART, 0x7fd76fa316a0}, 8) = 019:59:01 pipe([3, 5]) = 019:59:01 pipe([6, 7]) = 019:59:01 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd7707fca70) = 194619:59:01 gettid() = 194319:59:01 open("/proc/self/task/1943/attr/exec", O_RDWR) = 819:59:01 write(8, NULL, 0) = -1 EINVAL (Invalid argument)19:59:01 close(8) = 019:59:01 close(3) = 019:59:01 close(7) = 019:59:01 close(5) = 019:59:01 fcntl(6, F_GETFL) = 0 (flags O_RDONLY)19:59:01 fstat(6, {st_mode=S_IFIFO|0600, st_size=0, ...}) = 019:59:01 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd77080400019:59:01 lseek(6, 0, SEEK_CUR) = -1 ESPIPE (Illegal seek)19:59:01 read(6, "/bin/bash: ./logrotate.sh: /346/262/241/346/234"..., 4096) = 5519:59:01 uname({sys="Linux", node="host-11-159-73-176", ...}) = 019:59:01 getrlimit(RLIMIT_NOFILE, {rlim_cur=1073741816, rlim_max=1073741816}) = 019:59:01 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd6682da00019:59:01 --- SIGCHLD (Child exited) @ 0 (0) ---19:59:06 +++ killed by SIGKILL +++
可以看到最后用 mmap 一次分配了 4G 内存,然后就被kill了。
mmap前调用了getrlimit,和上次 MySQL的问题一样,都是根据系统资源限制来分配内存
为了确定就是cron导致java挂掉的元凶,我们把cron进程手动kill掉,这样就不会执行定时任务了,这次我们在验证下Java进程是否会挂掉
高版本CentOS是否也会出现类似问题?
按理说oom killer应该只kill掉占用内存最高的才对,Java进程占用内存又不是最高的,高版本的CentOS系统oom killer策略会不会有升级?
让我们来一起验证下高版本的CentOS系统是否有这个问题
当前镜像的CentOS版本是CentOS release 6.6 (Final),为了验证高版本的CentOS是否也有类似的问题,我们将增加两个实验组,分别升级基础镜像至CentOS release 6.10 (Final)和CentOS Linux release 7.9.2009 (Core),也添加相同的cron任务
结果发现CentOS release 6.10 (Final)和CentOS Linux release 7.9.2009 (Core)都没有kill掉Java进程,只kill掉了cron的子进程
结论
由于容器limit open files(系统最大句柄数)设置不合理导致cron执行任务时使容器内存飙升,存在内存溢出的风险,linux由于保护机制会kill掉占用内存高的进程,导致cron子任务进程和Java进程一起被kill(但是问题来了,这个jdos基础镜像为什么会执行一个完全不存在的shell脚本,而且还是执行两次???),高版本的CentOS系统不会kill java进程,猜测不同版本的CentOS的kill选择策略略有不同
Cron任务执行逻辑
在Linux中,crontab工具是由croine软件包提供的,让我们一起看下cron的执行过程
其中child_process()执行了cron子进程,cron执行子进程时会有发送mail的动作
cron_popen在执行时会按照open files(系统最大句柄数)清除内存
综上,cron oom的原因找到了,是由于open files设置过大且cron任务没有标准输出,导致执行了发送mail逻辑,而清除的内存大小超出了容器本身内存的大小,导致oom。
croine 1.5.4 版本之后修复了该问题,如果想查看当前容器croine版本可执行如下命令:
1.rpm -q cronie
Linux内核OOM killer机制
Linux 内核有个机制叫OOM killer(Out Of Memory killer),该机制会监控那些占用内存过大,尤其是瞬间占用内存很快的进程,然后防止内存耗尽而自动把该进程杀掉。内核检测到系统内存不足、挑选并杀掉某个进程的过程可以参考内核源代码linux/mm/oom_kill.c,当系统内存不足的时候,out_of_memory()被触发,然后调用select_bad_process()选择一个”bad”进程杀掉。
以下是一些主要的进程选择策略:
注:不同版本的Linux oom killer机制可能会存在一些差异
使用高版本稳定的CentOS系统,如果业务无法升级CentOS,则需要设置合理的limit open files数量,application/_worker类型应用可以在启动脚本中手动修改limit,web/_tomcat类型应用没法修改启动脚本,可以选择kill掉cron进程或删除系统cron任务,也可以手动升级cronie的版本至1.5.7-5
open files这个坑很大,栽这个坑两次了,大家一定要检查自己服务对应容器的CentOS版本和limit设置是否合理,本次案例发生在测试环境,尚不会引起事故,如果在生产出现类似情况,后果不堪设想
由于测试环境新增的这批机器都存在这个问题,我们团队已经联系机器提供方上报了该问题,后续这批机器会由提供方统一修改系统最大句柄数,如果当前问题影响到了业务的正常使用,可以临时删除容器中/etc/crontab中的任务
https://cloud.tencent.com/developer/article/1183262
https://github.com/cronie-crond/cronie
本文链接:http://www.28at.com/showinfo-26-14805-0.htmlJava服务总在半夜挂,背后的真相竟然是...
声明:本网页内容旨在传播知识,若有侵权等问题请及时与本网联系,我们将在第一时间删除处理。邮件:2376512515@qq.com
上一篇: 理解 Go 调度器并探索其工作原理
下一篇: Java基础:如何理解面向对象?