不规范的查询导致elasticsearch StackOverflowError

昨天我们公司的elasticsearch的集群有几个节点出现了StackOverflowError,然后es进程退出的问题。最终排查发现导致es节点出现下面的异常:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[2019-08-05T20:31:35,367][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [order6] fatal error in thread [elasticsearch[order6][search][T#38]], exiting
java.lang.StackOverflowError: null
at org.apache.lucene.store.DataInput.readVLong(DataInput.java:184) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.store.DataInput.readVLong(DataInput.java:169) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.util.fst.FST.readUnpackedNodeTarget(FST.java:931) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.util.fst.FST.readNextRealArc(FST.java:1143) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.util.fst.FST.readFirstRealTargetArc(FST.java:992) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.util.fst.FST.findTargetArc(FST.java:1270) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.util.fst.FST.findTargetArc(FST.java:1186) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:483) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.search.TermQuery$TermWeight.getTermsEnum(TermQuery.java:132) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:100) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:746) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:155) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:746) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:155) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:746) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.elasticsearch.indices.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:155) ~[elasticsearch-5.1.1.jar:5.1.1]
at org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]
at org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:746) ~[lucene-core-6.3.0.jar:6.3.0 a66a44513ee8191e25b477372094bfa846450316 - shalin - 2016-11-02 19:47:11]

我们排查后发现是业务方的程序出现bug,导致where条件组装出错,where条件的存在非常深的嵌套,进而导致es节点在处理请求的时候出现java.lang.StackOverflowError异常。

这件事情中让我好奇的是一个请求导致的java.lang.StackOverflowError直接导致es的进程退出,后来看了一下es的代码发现在org.elasticsearch.bootstrap.ElasticsearchUncaughtExceptionHandler类中存在如下的代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
class ElasticsearchUncaughtExceptionHandler implements Thread.UncaughtExceptionHandler {

private final Supplier<String> loggingPrefixSupplier;

ElasticsearchUncaughtExceptionHandler(final Supplier<String> loggingPrefixSupplier) {
this.loggingPrefixSupplier = Objects.requireNonNull(loggingPrefixSupplier);
}

@Override
public void uncaughtException(Thread t, Throwable e) {
if (isFatalUncaught(e)) {
try {
onFatalUncaught(t.getName(), e);
} finally {
// we use specific error codes in case the above notification failed, at least we
// will have some indication of the error bringing us down
if (e instanceof InternalError) {
halt(128);
} else if (e instanceof OutOfMemoryError) {
halt(127);
} else if (e instanceof StackOverflowError) {
halt(126);
} else if (e instanceof UnknownError) {
halt(125);
} else if (e instanceof IOError) {
halt(124);
} else {
halt(1);
}
}
} else {
onNonFatalUncaught(t.getName(), e);
}
}

static boolean isFatalUncaught(Throwable e) {
return e instanceof Error;
}

void onFatalUncaught(final String threadName, final Throwable t) {
final Logger logger = Loggers.getLogger(ElasticsearchUncaughtExceptionHandler.class, loggingPrefixSupplier.get());
logger.error(
(org.apache.logging.log4j.util.Supplier<?>)
() -> new ParameterizedMessage("fatal error in thread [{}], exiting", threadName), t);
}

void onNonFatalUncaught(final String threadName, final Throwable t) {
final Logger logger = Loggers.getLogger(ElasticsearchUncaughtExceptionHandler.class, loggingPrefixSupplier.get());
logger.warn((org.apache.logging.log4j.util.Supplier<?>)
() -> new ParameterizedMessage("uncaught exception in thread [{}]", threadName), t);
}

void halt(int status) {
AccessController.doPrivileged(new PrivilegedHaltAction(status));
}

static class PrivilegedHaltAction implements PrivilegedAction<Void> {

private final int status;

private PrivilegedHaltAction(final int status) {
this.status = status;
}

@SuppressForbidden(reason = "halt")
@Override
public Void run() {
// we halt to prevent shutdown hooks from running
Runtime.getRuntime().halt(status);
return null;
}

}
}

从上面的代码中可以看出,当出现StackOverflowError异常的时候,es的会主动停止自己,这一点倒是让我有点意外。

类似的问题还有:

Comments

Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×