ES某节点CPU增长至100%的诡异问题

Wesley13
• 阅读 805

这是一个从事发到目前为止我没有从根本上解决的技术问题,也是我心中的一个非常大的疑惑。 写于: 2020-06-14 周日,下午14:00

问题已解决,解决过程和方案可以看文章最末尾,解决方案写于 2020-06-20 15:26

问题一: 2020年6月4号上午10点左右(高峰期),A集群某节点,我们姑且称之为37号节点,CPU增长至100%居高不下,导致查询队列积压,大量查询请求被拒绝,扩容后问题解决(我们之后新增了五个节点),但是扩容之后的集群里,37号节点平时的CPU占用还是要远远高于其他节点, 相差10倍左右,也就是说其它节点在闲置着看戏,但是扩容后至少不会出现CPU占满的情况。

问题二: 2020年6月12号上午10点左右(高峰期),B集群某节点,我们姑且称之为2号节点,CPU增长至100%居高不下,导致查询队列积压,大量查询请求被拒绝,扩容后问题解决,但是2号节点平时的CPU占用要远远高于其他节点, 相差10倍左右。 我们排查出第三方接口查询量突增非常多的量,因为时间紧急,我们熔断了第三方的查询接口,超过N的QPS直接返回错误信息,2号节点的CPU直线下跌至正常状态,其它节点在这期间闲置看戏。

问题一,经过排查,我们检查了当时37号节点的ES进程里占用非常高的线程情况:

"elasticsearch[logsearch-datanode-36][index][T#14]" #273 daemon prio=5 os_prio=0 tid=0x00007ef8e8084800 nid=0x6f62 runnable [0x00007ef4d37eb000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
    at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
    at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
    at java.lang.ThreadLocal.get(ThreadLocal.java:163)
    at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
    at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
    at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
    at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
    at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
    at org.elasticsearch.index.engine.InternalEngine.innerDelete(InternalEngine.java:582)
    - locked <0x00000000c3c36900> (a java.lang.Object)
    at org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:558)
    at org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:641)
    at org.elasticsearch.action.delete.TransportDeleteAction.executeDeleteRequestOnPrimary(TransportDeleteAction.java:143)
    at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:136)
    at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:54)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - <0x000000008ae17720> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[logsearch-datanode-36][index][T#13]" #272 daemon prio=5 os_prio=0 tid=0x00007ef8f4085000 nid=0x6f61 runnable [0x00007ef4d38ec000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
    at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
    at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
    at java.lang.ThreadLocal.get(ThreadLocal.java:163)
    at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
    at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
    at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
    at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
    at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
    - locked <0x00000001619253d8> (a java.lang.Object)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - <0x000000008ae10650> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[logsearch-datanode-36][index][T#12]" #271 daemon prio=5 os_prio=0 tid=0x00007ef8f8083800 nid=0x6f60 runnable [0x00007ef4d39ed000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
    at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
    at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
    at java.lang.ThreadLocal.get(ThreadLocal.java:163)
    at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
    at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
    at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
    at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
    at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
    - locked <0x0000000156f6a138> (a java.lang.Object)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnReplica(TransportIndexAction.java:196)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:170)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:392)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:291)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:283)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - <0x000000008ae1b570> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[logsearch-datanode-36][index][T#11]" #270 daemon prio=5 os_prio=0 tid=0x00007ef8e0003000 nid=0x6f5f runnable [0x00007ef4d3aee000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ThreadLocal$ThreadLocalMap.set(ThreadLocal.java:467)
    at java.lang.ThreadLocal$ThreadLocalMap.access$100(ThreadLocal.java:298)
    at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:184)
    at java.lang.ThreadLocal.get(ThreadLocal.java:170)
    at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:481)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
    at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
    at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:55)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:450)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - <0x000000008ae15d78> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[logsearch-datanode-36][index][T#10]" #269 daemon prio=5 os_prio=0 tid=0x00007ef910083800 nid=0x6f5e runnable [0x00007ef4d3bee000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
    at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
    at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
    at java.lang.ThreadLocal.get(ThreadLocal.java:163)
    at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
    at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
    at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
    at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
    at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
    at org.elasticsearch.index.engine.InternalEngine.innerDelete(InternalEngine.java:582)
    - locked <0x000000016c7c8a98> (a java.lang.Object)
    at org.elasticsearch.index.engine.InternalEngine.delete(InternalEngine.java:558)
    at org.elasticsearch.index.shard.IndexShard.delete(IndexShard.java:641)
    at org.elasticsearch.action.delete.TransportDeleteAction.executeDeleteRequestOnPrimary(TransportDeleteAction.java:143)
    at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:136)
    at org.elasticsearch.action.delete.TransportDeleteAction.shardOperationOnPrimary(TransportDeleteAction.java:54)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - <0x000000008ae19da0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[logsearch-datanode-36][index][T#9]" #267 daemon prio=5 os_prio=0 tid=0x00007ef8f0082800 nid=0x6f5c runnable [0x00007ef4d3df0000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
    at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
    at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
    at java.lang.ThreadLocal.get(ThreadLocal.java:163)
    at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
    at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
    at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
    at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
    at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
    - locked <0x0000000156f71bf8> (a java.lang.Object)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - <0x000000008ae178c8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[logsearch-datanode-36][index][T#8]" #266 daemon prio=5 os_prio=0 tid=0x00007ef8e8082800 nid=0x6f5b runnable [0x00007ef4d3ef1000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
    at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
    at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
    at java.lang.ThreadLocal.get(ThreadLocal.java:163)
    at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryAcquireShared(ReentrantReadWriteLock.java:481)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
    at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
    at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:55)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:450)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - <0x000000008ae19f48> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[logsearch-datanode-36][index][T#7]" #265 daemon prio=5 os_prio=0 tid=0x00007ef8f4083000 nid=0x6f5a runnable [0x00007ef4d3ff3000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ThreadLocal$ThreadLocalMap.set(ThreadLocal.java:467)
    at java.lang.ThreadLocal$ThreadLocalMap.access$100(ThreadLocal.java:298)
    at java.lang.ThreadLocal.setInitialValue(ThreadLocal.java:184)
    at java.lang.ThreadLocal.get(ThreadLocal.java:170)
    at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
    at org.apache.lucene.index.CodecReader.getNumericDocValues(CodecReader.java:143)
    at org.apache.lucene.index.FilterLeafReader.getNumericDocValues(FilterLeafReader.java:430)
    at org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.<init>(PerThreadIDAndVersionLookup.java:77)
    at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:83)
    at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
    at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
    at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
    - locked <0x00000001aa0732e0> (a java.lang.Object)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnReplica(TransportIndexAction.java:196)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:170)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:392)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:291)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:283)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - <0x000000008ae1b718> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[logsearch-datanode-36][index][T#6]" #264 daemon prio=5 os_prio=0 tid=0x00007ef8f8082000 nid=0x6f59 runnable [0x00007ef4d40f4000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
    at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
    at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
    at java.lang.ThreadLocal.get(ThreadLocal.java:163)
    at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
    at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
    at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
    at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
    at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
    - locked <0x00000000c3c36b70> (a java.lang.Object)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnPrimary(TransportIndexAction.java:237)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:158)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.doRun(TransportReplicationAction.java:639)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:279)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:271)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - <0x000000008ae15f20> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"elasticsearch[logsearch-datanode-36][index][T#5]" #263 daemon prio=5 os_prio=0 tid=0x00007ef8e0001000 nid=0x6f58 runnable [0x00007ef4d41f5000]
   java.lang.Thread.State: RUNNABLE
    at java.lang.ThreadLocal$ThreadLocalMap.getEntryAfterMiss(ThreadLocal.java:444)
    at java.lang.ThreadLocal$ThreadLocalMap.getEntry(ThreadLocal.java:419)
    at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:298)
    at java.lang.ThreadLocal.get(ThreadLocal.java:163)
    at org.apache.lucene.util.CloseableThreadLocal.get(CloseableThreadLocal.java:78)
    at org.elasticsearch.common.lucene.uid.Versions.getLookupState(Versions.java:81)
    at org.elasticsearch.common.lucene.uid.Versions.loadDocIdAndVersion(Versions.java:123)
    at org.elasticsearch.common.lucene.uid.Versions.loadVersion(Versions.java:140)
    at org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1032)
    at org.elasticsearch.index.engine.InternalEngine.innerIndex(InternalEngine.java:501)
    - locked <0x0000000156f69b08> (a java.lang.Object)
    at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:457)
    at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:601)
    at org.elasticsearch.index.engine.Engine$Index.execute(Engine.java:836)
    at org.elasticsearch.action.index.TransportIndexAction.executeIndexRequestOnReplica(TransportIndexAction.java:196)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:170)
    at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnReplica(TransportIndexAction.java:66)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:392)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:291)
    at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:283)
    at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:300)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
    - <0x000000008ae1a0f0> (a java.util.concurrent.ThreadPoolExecutor$Worker)

起初我是怀疑是不是 ThreadLocal里的ThreadLocalMap导致的,为此又恶补了下 ThreadLocalMap 的原理,当存储线程变量的时候,它会分配一个数组给这个线程用,变量放进去的时候用ThreadLocal的这个变量的hash值定位到在数组的那个位置去,如果有冲突则找这个位置的下一个位置。

这是设置值的时候,如果在get的时候也是通过ThreadLocal的hash去找到数组的位置,冲突了之后找当前位置的下一个位置,如果下一个位置的这个key已经是null了说明被回收了,那么要进行一次清理,它会向后清理掉为空的值,这就是所谓的启发式清理,没有固定的线程去做这个,靠着查询的时候出现空值来进行清理。

本来我怀疑这里有问题,因为当冲突过多的时候,或者这个数组太大了,因为冲突多了,就会扩容这个数组,导致每次冲突就要往下寻找数据,这是一个循环操作非常的损耗CPU,会不会是这个数组快要满了但是又没有满,导致了查询一直在循环的寻找这个ThreadLocal,后来我在本地起了20个线程,跑了40亿个数据来验证这个观点,每一千万次查询和插入操作,只会出现三四个这种耗时达到100ms以上的情况,且耗时长的那三四个操作都维持在200ms左右,这对性能理论上没有影响。

问题二:

后来又觉得会不会是数据不均衡导致的,我们看了下数据分片存储的数据大小,几乎都是一致的,只有1G的差别大小,这可以忽略不计。

又怀疑是系统核心参数配置的问题导致的,那台机器的CPU配置是节能模式,我们改为了高性能模式,也没有太大的变化。

到这里又发现了一个疑点,所有节点进来的流量是一样的,但是唯独那台机器出去的流量是别的节点的10倍。

是否是某个查询的结果落到了2号节点,那个数据分片正好都在2号,那个查询结果的量特别特别的大,但是我们粗略的也没看出来有哪个查询的量特别庞大。

最后猜测是否是机器原因,别的机器都正常,唯独那两个集群的那两个节点存在问题。

两个集群的ES版本都比较低,2.3.4

问题节点的机器是物理机,配置很高。

记录下,后续再看下怎么解决。

问题已解决,写于 2020-06-20 15:26

  1. 6-16号晚上我们将CPU高的那台36机器下线了,集群就变成平稳且健康了。

  2. 下线的这台36物理机器从配置上来看是非常高的,内存60G以上,CPU是32核,但是下线了这个CPU高的节点后就恢复了。另外一个集群也把CPU飙高的02节点下线了,集群整体非常稳定。

  3. 这里猜测原因是第一,这两个节点的硬件或者系统存在问题,第二是ES 2.3.4版本对集群内部的机器异构支持很差(集群内部机器的硬件配置不一致)。

事发整个实际过程回顾

2020-06-04,快递员集群出现CPU负载过高的问题,37和36节点都出现CPU很高,其他节点闲置。

2020-06-04 10:00 到 2020-06-04 12:00 出现的问题:件已经入柜,但是FC管家查询不到信息,不会有明显影响,只是写入ES的速度很慢,kafka消息积压,等到恢复后,受影响的积压数据自动就写入了。

2020-06-04 13:16分左右,37节点突然下降到正常水平,36节点在持续CPU负载过高。

2020-06-04 14:30左右,36节点瞬间恢复正常水平。

2020-06-04 下午17:10左右,36节点的CPU又达到90%以上,37稍微有点高,但是还是健康状态。

2020-06-04 下午17:10 到 2020-06-04 18:30 出现的问题:件已经取出,但是FC管家查询到的数据还是未取状态,不会有明显影响,只是写入ES的速度很慢,kafka消息积压,等到恢复后,受影响的积压数据自动就写入了。

2020-06-04 下午18:55左右,36节点的CPU瞬间由高点下降到正常水平.

2020-06-04 晚上20:00 开始扩容,加入了6台物理机,晚上22:40分左右完成扩容。

2020-06-05 早上10:00 一切正常,36节点略微有点偏高但是处于正常水平,CPU负载20%左右,CPU负载明显高于其他节点。

2020-06-12 10:00 左右 消费者集群 post_consumer_data_02节点出现CPU负载变高,我们称之为189节点。

2020-06-12 10:20左右,客服和全员群上报微信支付分订单问题,原因:微信支付分查询使用的是消费者集群的索引数据。

2020-06-12 10:30左右开放平台报大量请求查询变慢甚至查询不出来结果,经研发排查,发现是第三方(华为)接口查询量大增,查询流量进入消费者集群。

2020-06-12 10:45左右, 紧急熔断掉第三方调用我方流量最大的查询接口。

2020-06-12 10:50左右,189节点的CPU在10:52从99%瞬间回落至8%,业务恢复正常,后续问题上报存在时间差,其实已经恢复。

2020-06-12 11:00 ,和第三方沟通查询量限制,但是189节点CPU变高的原因尚未找到。

2020-06-13 周六,继续熔断第三方接口,限流。 2020-06-14 周日,继续熔断第三方接口,限流。

2020-06-15 周一晚上,如果继续限流,PDD和HW等第三方都会来投诉,15号晚上我们把第三方的查询接口由消费者集群切换到了快递员集群。 基于2020-06-04那天晚上快递员集群已经扩容,我们把流量导向这个性能更强大的集群,取消限流。

2020-06-16 周二早上10:19分,快递员集群的36节点CPU开始变高,其他节点闲置。客服系统、工单系统、FC管家查询派件信息卡顿、缓慢、查不到结果。

2020-06-16 11:30,对第三方查询接口进行限流,36节点CPU瞬间恢复。

2020-06-16 中午13:04分左右,对37节点的分片进行了迁移,切换了该节点上的两个分片到别的节点,效果不是非常明显。

2020-06-16 13:44,取消第三方查询接口的流量限制。

2020-06-16 15:37分左右,36节点CPU开始增长且上下波动剧烈。讨论ES快递员集群全部下线虚拟机节点,下掉出问题严重的36节点。

2020-06-16 23:50左右开始准备ES快递员集群的迁移,下掉所有的虚拟机节点,下掉问题节点36节点。

2020-06-17 凌晨04:27分,快递员集群14个数据节点 2个master节点 2个client节点 全部运行于物理机,36机器移除快递员ES集群。

2020-06-17 上午10:00,快递员集群正常,高峰期,37这个节点比起其它的节点还是弱了一些,没有36后它是最差的,但是处于健康可控状态。

2020-06-17 晚上22:55左右,下线之前出问题的消费者集群189节点。

2020-06-18 到 现在,集群处理稳定监控状态,快递员集群37节点的CPU负载要略高于所有其他节点两倍,但是处于健康可控状态。

结论:

快递员集群36节点和消费者189节点存在系统或者机器问题, 37节点也存在类似的问题,但是相对于差的来说要好一些。

点赞
收藏
评论区
推荐文章
Easter79 Easter79
2年前
tidb集群某个节点报错之:node_exporter
今天启动集群tidb时出现一个错误,是某个tikv节点报错:node\_exporter9100.service failed一个节点的问题会导致整个集群启动失败。去此节点下的日志文件中查找,发现没有什么报错原因。无奈此时只能去系统日志中查看发生了什么问题果然发现了问题Jan1615:35:05ip1723126133
Souleigh ✨ Souleigh ✨
3年前
Typescript入门最佳姿势
前言:Typescript是前端当中一门饱受争议的技术,有人爱有人恨。在本文中,我不会劝你使用或者不使用TS,而是会站在一个客观的角度,探讨TS这门技术所解决的更本质的问题(即JS类型问题)及其解决方案(TS只是其中一种)。希望阁下看完这篇文章之后,能够超脱于TS本身,看到更加本质的问题,看到更多的解决方案。之后具体用不用,就是阁下自己的事
达里尔 达里尔
2年前
Javascript中小数乘以100之后会出现精度缺失问题
问题:这个数是我用73.54100之后获得的,但是只是少部分的小数会遇到精度缺失情况,希望以后统一做一下精度缺失判断。解决方案:javascriptMath.round(myvalue100)/100简单粗暴...
Easter79 Easter79
2年前
Tomcat基于MSM+Memcached实现Session共享
前言在Tomcat集群中,当一个节点出现故障,其他节点该如何接管故障节点的Session信息呢?本文带来的解决方案是基于MSMMemcached实现Session共享。相关介绍MSMMSM–MemcachedSessionManager是一个高可用的TomcatSession共享解决方案,除了可以从本机内存
Stella981 Stella981
2年前
CentOS 7下 部署Redis
redis集群是一个无中心的分布式redis存储架构,可以在多个节点之间进行数据共享,解决了redis高可用、可扩展等问题,redis集群提供了以下两个好处:1)将数据自动切分(split)到多个节点2)当集群中的某一个节点故障时,redis还可以继续处理客户端的请求一个Redis集群包含16384个哈希槽(hashslot
Stella981 Stella981
2年前
IE7、IE8、IE9对min
问题:    IE7、IE8、IE9对minheight不识别,其他无问题解决:   box{width:100px;height:35px;}   htmlbodybox{width:auto;height:auto;width:100px;minheight:35px;} 实例:
Stella981 Stella981
2年前
Kubernetes Pod OOM 排查日记
一、发现问题在一次系统上线后,我们发现某几个节点在长时间运行后会出现内存持续飙升的问题,导致的结果就是Kubernetes集群的这个节点会把所在的Pod进行驱逐OOM;如果调度到同样问题的节点上,也会出现Pod一直起不来的问题。我们尝试了杀死Pod后手动调度的办法(label),当然也可以排除调度节点。但是在一段时间后还会复现,我们通过监控
Stella981 Stella981
2年前
Hadoop中重新格式化namenode
题记:Hadoop在我放下很久以后,再次拿起来。使用bin/startall.sh命令启动hadoop集群,死活不见namenode节点。在解决这个问题的时候,又不断的引出了很多新的问题。虽然这些问题在以前学习hadoop时都解决过。但,正因为以前解决过,现在遇到这些问题就更不想重复解决了。想要以最快的速度启动集群。最后想了
Stella981 Stella981
2年前
Elasticsearch集群管理之1——如何高效的添加、删除节点?
1、问题抛出1.1新增节点问题我的群集具有黄色运行状况,因为它只有一个节点,因此副本保持未分配状态,我想要添加一个节点,该怎么弄?1.2删除节点问题假设集群中有5个节点,我必须在运行时删除2个节点。那么如何在不影响指数的情况下完成?我有接近10Gb/hour的连续数据流,这些数据正在连续写
Wesley13 Wesley13
2年前
Java开发中的编码问题总结
Java中的中文编解码问题一直以来令开发者比较头疼,这在Web开发中表现尤为突出。网上的解决方案琳琅满目,但一般是针对于一个特定问题的。我自己在开发过程中也是多次遇到过这种问题,经过冷静分析和查阅参考资料问题也都得以解决(或许解决方案不是最佳的)。但是每一次的解决并没有为下一次问题的出现提供特别好的联想效果。今天特地在这里将Java开发中的中文编解码问题做一