[NXP-20802] Fix long time fulltext parsing when importing HTML - Nuxeo Issue Tracker

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 8.3
Fix Version/s: 8.10
Component/s: Core IO

Description

DefaultFulltextParser infinitely loops/times out when attempting to parse a tabular, plain/text file with a form similar to that listed below that contains 60000 lines of text and is 6MB in size.

<key1a> value1a <key1b> value1b <key1c> value1c ... <key1p> value1p
<key1a> value2a <key1b> value2b <key1c> value2c ... <key1p> value2p
<key1a> value3a <key1b> value3b <key1c> value3c ... <key1p> value3p
...
<key1a> value60000a <key1b> value60000b <key1c> value60000c ... <key1p> value60000p

A thread dump of the server includes all four Nuxeo-Work-default threads in the following state with CPU utilization at 100%:

"Nuxeo-Work-default-47" #3389 daemon prio=5 os_prio=0 tid=000000000015f08800 nid=0x2fd4 runnable [0x00002b5d4e90f000]
   java.lang.Thread.State: RUNNABLE
       at net.htmlparser.jericho.Source.getNextEndTag(Source.java:1319)
       at net.htmlparser.jericho.StartTag.getEndTagInternal(StartTag.java:559)
       at net.htmlparser.jericho.StartTag.getElement(StartTag.java:167)
       at net.htmlparser.jericho.Source.getChildElements(Source.java:742)
       at net.htmlparser.jericho.Renderer$Processor.appendTo(Renderer.java:823)
       at net.htmlparser.jericho.Renderer.appendTo(Renderer.java:140)
       at net.htmlparser.jericho.CharStreamSourceUtil.toString(CharStreamSourceUtil.java:63)
       at net.htmlparser.jericho.Renderer.toString(Renderer.java:150)
       at org.nuxeo.ecm.core.storage.DefaultFulltextParser.removeHtml(DefaultFulltextParser.java:93)
       at org.nuxeo.ecm.core.storage.DefaultFulltextParser.preprocessField(DefaultFulltextParser.java:83)
       at org.nuxeo.ecm.core.storage.DefaultFulltextParser.parse(DefaultFulltextParser.java:65)
       at org.nuxeo.ecm.core.storage.DefaultFulltextParser.parse(DefaultFulltextParser.java:52)
       at org.nuxeo.ecm.core.storage.FulltextExtractorWork.extractBinaryText(FulltextExtractorWork.java:151)
       at org.nuxeo.ecm.core.storage.FulltextExtractorWork.work(FulltextExtractorWork.java:108)
       at org.nuxeo.ecm.core.work.AbstractWork.runWorkWithTransaction(AbstractWork.java:416)
       at org.nuxeo.ecm.core.work.AbstractWork.runWorkWithTransactionAndCheckExceptions(AbstractWork.java:377)
       at org.nuxeo.ecm.core.work.AbstractWork.run(AbstractWork.java:338)
       at org.nuxeo.ecm.core.work.WorkHolder.run(WorkHolder.java:54)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       at java.lang.Thread.run(Thread.java:745)

This problem seems to be caused by the excessive amount of time and CPU resources it takes the Jericho html parser to parse this type of file.

In our environment this file changes frequently as a result of additional lines being appended to it as part of a checkin. Fulltext indexing of the file has always been quite slow likely due to the fact that it has many "startTags", no "endTags", and it is clearly not html. It appears the size of the file has grown to the point that the FulltextParser can no longer parse it before its transaction times out.

This bug causes all Nuxeo servers in our cluster to experience 100% CPU utilization and causes the nuxeo:work:*:default Redis data structures to grow without bound until the Redis server exhausts its memory. Once the Nuxeo-Work-default threads experience the transaction timeout, it appears the problematic sqlFulltextExtractorWork is retried indefinitely. The only way we could temporarily work-around the issue was to: a). move the problematic file to the Trash temporarily b). shutdown the system and c). manually remove all the problematic works related to the problematic document from the various Redis data structures.

The following addition to the DefaultFulltextParser unit test appears to expose the problem. This unit test requires approximately 3 minutes to parse a 20,000 line file. The time likely increases in a non-linear fashion as the number of lines and key-value pairs is increased. So, for a 6MB, 60,000 line file, it is easy to see how the transaction timeout could be triggered.

diff --git a/nuxeo-core/nuxeo-core-storage/src/test/java/org/nuxeo/ecm/core/storage/TestDefaultFulltextParser.java b/nuxeo-core/nuxeo-core-storage/src/test/java/org/nuxeo/ecm/core/storage/TestDefaultFulltextParser.java
index d8febd6..b6fdf76 100644
--- a/nuxeo-core/nuxeo-core-storage/src/test/java/org/nuxeo/ecm/core/storage/TestDefaultFulltextParser.java
+++ b/nuxeo-core/nuxeo-core-storage/src/test/java/org/nuxeo/ecm/core/storage/TestDefaultFulltextParser.java
@@ -47,6 +47,18 @@ public class TestDefaultFulltextParser extends NXRuntimeTestCase {
         // check html removal and entities unescape
         check("test|é|test", "test &eacute; test");
         check("test|é|test", "test <p style=\"something\">&eacute;</p> test");
+        StringBuilder actual = new StringBuilder();
+        StringBuilder expected = new StringBuilder();
+        for (int i = 0; i < 20000; i++) {
+            for (int j = 1; j < 8; j++) {
+                actual.append("<Key" + j + "> value" + i + j + " ");
+                expected.append("value" + i + j + "|");
+            }
+            actual = actual.replace(actual.length() - 1, actual.length() - 1, System.getProperty("line.separator"));
+        }
+        actual = actual.deleteCharAt(actual.length() - 1);
+        expected = expected.deleteCharAt(expected.length() - 1);
+        check(expected.toString(), actual.toString());
     }
 
 }

Why is the work retried indefinitely causing all Nuxeo-Work-default threads to be overtaken?

How should the DefaultFulltextParser be modified to prevent html parsing of these types of files? Would it make sense to introduce some simple Regex-based HTML Detection (see https://github.com/dbennett455/DetectHtml) into the DefaultFulltextParser such that an attempt to removeHtml() using the Jericho parser is only performed when HTML is detected? The current "HTML Detection" algorithm is based on whether the text contains a "<" character which is extremely weak.

By placing this problematic file temporarily in the Trash, we have broken Production processing that relies on this file. Your timely feedback regarding this issue is much appreciated.

Attachments

Options
- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Attachments

dftp_patch.patch
1 kB
2016-10-19 13:32

Fix long time fulltext parsing when importing HTML

Details

Description

Attachments

Attachments

Activity

People

Dates