Uploaded image for project: 'Nuxeo Platform'
  1. Nuxeo Platform
  2. NXP-30851

Improve KV TransientStore GC resiliency

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.10, 2021.0
    • Fix Version/s: 2021.17
    • Component/s: S3, TransientStore
    • Release Notes Summary:
      The garbage collection for transient stores is more resilient
    • Backlog priority:
      850
    • Upgrade notes:
      Hide

      Transient GC was not working in environments with segregated front and worker nodes.
      As the result, transient stores in s3 might have accumulated lots of data and the current transient GC implementation might not be able to clean them efficiently.
      In this case, it is recommended to purge manually all objects older than 3 days on transient stores before applying this hotfix.
      This can be done using scripts or by creating an Object Lifecycle Management rule with a correct prefix /transient_*/.

      Show
      Transient GC was not working in environments with segregated front and worker nodes. As the result, transient stores in s3 might have accumulated lots of data and the current transient GC implementation might not be able to clean them efficiently. In this case, it is recommended to purge manually all objects older than 3 days on transient stores before applying this hotfix. This can be done using scripts or by creating an Object Lifecycle Management rule with a correct prefix /transient_*/ .
    • Sprint:
      nxplatform #56, nxplatform #57
    • Story Points:
      3

      Description

      Problems

      The transient storage garbage collection (TransiantStorageGCWork) is scheduled every 15 min.
      When using a KeyValueBlobTransientStore with S3 storage there are multiple problems that affect the resiliency:

      1. If the first scan of S3 blobs fails (during gc.start) for any reason then the GC work will never be executed again on this node until restarted, the cause of error is also probably hidden.

      2. The TransiantStorageGCWork that calls the GC on all transient storages is only running on storage that has been used, this is because transient storages are lazily initialized. This means that front nodes that use specific transient storage (batch import) may not be GC by a worker node (where all Works are executed).

      3. The TransiantStorageGCWork is run in the common pool using a default random identifier, as a result, the GC is run concurrently (possibly at JVM and cluster level). This probably creates huge stress on S3 access to list blob, as well as on KV storage (Mongo). There is no point in having concurrency on a single node and with KV+S3 there is no point of having concurrency at cluster level, (TODO: check for other implementation if there is a case where it must be run on all nodes, but this is not going to work with front/worker nodes segregation)

      4. The GC is performed within a transaction and is subject to timeout -> NXP-30720

      5. Because of the in-memory implementation of the GC (listing S3 objects, listing KV objects, diffing, deleting), the GC will probably fail to scale at some point.

      Proposed solution

      1. This must be fixed and we should add traces (in case of failure) and metrics to monitor when transient GC is running

      2. Code must be adapted to make sure we run GC on all transient storage defined not only those that have been used since the node is started

      3. Create a dedicated queue for TransientStorageGCWork with a single partition, this will ensure there is no concurrency at cluster level when running the GC,
      we could also have a work per transient storage to avoid failure on all storage if one is problematic.
      Moving the work out of the common queue will bring monitoring to DD (running, timers, rates ...)

      4. Should be run outside of a transaction.

      5. Being able to scale to any volume requires refactoring based on Stream processing, this is out of the scope of the ticket, see for NXP-28679.


      Detail to reproduce problem 1 (gc.start)

      When a transient store Garbage Collection is launched with an underlying s3 blob store, we can end up in a situation where GC will fail with the following stack trace until a server restart:

      
      2022-01-27T10:41:00,007 ERROR [AbstractWork] Exception during work: TransiantStorageGCWork(, Progress(?%, ?/0), null)2022-01-27T10:41:00,007 ERROR [AbstractWork] Exception during work: TransiantStorageGCWork(, Progress(?%, ?/0), null) org.nuxeo.ecm.core.api.NuxeoException: Already started at org.nuxeo.ecm.core.blob.AbstractBlobGarbageCollector.start(AbstractBlobGarbageCollector.java:52) ~[nuxeo-core-api-2021.15.6.jar:?] at org.nuxeo.ecm.core.blob.CachingBlobStore$CachingBinaryGarbageCollector.start(CachingBlobStore.java:469) ~[nuxeo-core-api-2021.15.6.jar:?] at org.nuxeo.ecm.core.transientstore.keyvalueblob.KeyValueBlobTransientStore.doGC(KeyValueBlobTransientStore.java:283) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at java.util.HashMap$Values.forEach(HashMap.java:977) ~[?:?] at org.nuxeo.ecm.core.transientstore.TransientStorageComponent.doGC(TransientStorageComponent.java:98) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at org.nuxeo.ecm.core.transientstore.TransiantStorageGCWork.work(TransiantStorageGCWork.java:41) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at org.nuxeo.ecm.core.work.AbstractWork.runWorkWithTransaction(AbstractWork.java:533) [nuxeo-core-event-2021.15.6.jar:?] at org.nuxeo.ecm.core.work.AbstractWork.run(AbstractWork.java:385) [nuxeo-core-event-2021.15.6.jar:?] at org.nuxeo.ecm.core.work.WorkHolder.run(WorkHolder.java:57) [nuxeo-core-event-2021.15.6.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-01-27T10:41:00,012 ERROR [WorkManagerImpl] Uncaught error on thread: Nuxeo-Work-common-4, current work might be lost, WorkManager metrics might be corrupted. org.nuxeo.ecm.core.api.NuxeoException: Work failed after 0 retries, class=class org.nuxeo.ecm.core.transientstore.TransiantStorageGCWork id=61350420930031.2003512040 category=TransiantStorageGCWork title=Transient Store GC at org.nuxeo.ecm.core.work.AbstractWork.workFailed(AbstractWork.java:479) ~[nuxeo-core-event-2021.15.6.jar:?] at org.nuxeo.ecm.core.work.AbstractWork.run(AbstractWork.java:398) ~[nuxeo-core-event-2021.15.6.jar:?] at org.nuxeo.ecm.core.work.WorkHolder.run(WorkHolder.java:57) ~[nuxeo-core-event-2021.15.6.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: org.nuxeo.ecm.core.api.NuxeoException: Already started at org.nuxeo.ecm.core.blob.AbstractBlobGarbageCollector.start(AbstractBlobGarbageCollector.java:52) ~[nuxeo-core-api-2021.15.6.jar:?] at org.nuxeo.ecm.core.blob.CachingBlobStore$CachingBinaryGarbageCollector.start(CachingBlobStore.java:469) ~[nuxeo-core-api-2021.15.6.jar:?] at org.nuxeo.ecm.core.transientstore.keyvalueblob.KeyValueBlobTransientStore.doGC(KeyValueBlobTransientStore.java:283) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at java.util.HashMap$Values.forEach(HashMap.java:977) ~[?:?] at org.nuxeo.ecm.core.transientstore.TransientStorageComponent.doGC(TransientStorageComponent.java:98) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at org.nuxeo.ecm.core.transientstore.TransiantStorageGCWork.work(TransiantStorageGCWork.java:41) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at org.nuxeo.ecm.core.work.AbstractWork.runWorkWithTransaction(AbstractWork.java:533) ~[nuxeo-core-event-2021.15.6.jar:?] at org.nuxeo.ecm.core.work.AbstractWork.run(AbstractWork.java:385) ~[nuxeo-core-event-2021.15.6.jar:?] ... 4 more
      

      Typical scenario to observe the issue:

      • install amazon-s3-online-storage addon
      • configure a transient store on a s3 bucket (i.e. nuxeo.s3storage.transient.bucket=xxxx-dev-sandbox)
      • start a server
      • turn off your network and wait for 15 min max (the transient store GCs are scheduled to run every 15 min see transientstore-service.xml) until you see the following error:
        Caused by: java.net.UnknownHostException: grenard-dev-sandbox.s3.eu-west-3.amazonaws.com
        	at java.net.InetAddress$CachedAddresses.get(InetAddress.java:797) ~[?:?]
        	at java.net.InetAddress.getAllByName0(InetAddress.java:1509) ~[?:?]
        	at java.net.InetAddress.getAllByName(InetAddress.java:1368) ~[?:?]
        	at java.net.InetAddress.getAllByName(InetAddress.java:1302) ~[?:?]
        	at com.amazonaws.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:27) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.DelegatingDnsResolver.resolve(DelegatingDnsResolver.java:38) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112) ~[httpclient-4.5.13.jar:4.5.13]
        	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[httpclient-4.5.13.jar:4.5.13]
        	at jdk.internal.reflect.GeneratedMethodAccessor119.invoke(Unknown Source) ~[?:?]
        	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
        	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
        	at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.conn.$Proxy85.connect(Unknown Source) ~[?:?]
        	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[httpclient-4.5.13.jar:4.5.13]
        	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[httpclient-4.5.13.jar:4.5.13]
        	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar:4.5.13]
        	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar:4.5.13]
        	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar:4.5.13]
        	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.13.jar:4.5.13]
        	at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[aws-java-sdk-core-1.11.885.jar:?]
        	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5219) ~[aws-java-sdk-s3-1.11.885.jar:?]
        	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5165) ~[aws-java-sdk-s3-1.11.885.jar:?]
        	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5159) ~[aws-java-sdk-s3-1.11.885.jar:?]
        	at com.amazonaws.services.s3.AmazonS3Client.getObjectLockConfiguration(AmazonS3Client.java:3280) ~[aws-java-sdk-s3-1.11.885.jar:?]
        	at org.nuxeo.ecm.blob.s3.S3BlobStoreConfiguration.isS3RetentionEnabled(S3BlobStoreConfiguration.java:574) ~[nuxeo-core-binarymanager-s3-2021.14.7.jar:?]
        	at org.nuxeo.ecm.blob.s3.S3BlobStoreConfiguration.<init>(S3BlobStoreConfiguration.java:301) ~[nuxeo-core-binarymanager-s3-2021.14.7.jar:?]
        	at org.nuxeo.ecm.blob.s3.S3BlobProvider.getConfiguration(S3BlobProvider.java:109) ~[nuxeo-core-binarymanager-s3-2021.14.7.jar:?]
        	at org.nuxeo.ecm.blob.s3.S3BlobProvider.getBlobStore(S3BlobProvider.java:76) ~[nuxeo-core-binarymanager-s3-2021.14.7.jar:?]
        	at org.nuxeo.ecm.core.blob.BlobStoreBlobProvider.initialize(BlobStoreBlobProvider.java:60) ~[nuxeo-core-api-2021.15.6.jar:?]
        	at org.nuxeo.ecm.core.blob.BlobManagerComponent.getBlobProvider(BlobManagerComponent.java:209) ~[nuxeo-core-api-2021.15.6.jar:?]
        	at org.nuxeo.ecm.core.blob.BlobManagerComponent.getBlobProviderWithNamespace(BlobManagerComponent.java:236) ~[nuxeo-core-api-2021.15.6.jar:?]
        	at org.nuxeo.ecm.core.transientstore.keyvalueblob.KeyValueBlobTransientStore.getBlobProvider(KeyValueBlobTransientStore.java:213) ~[nuxeo-core-cache-2021.15.6.jar:?]
        	at org.nuxeo.ecm.core.transientstore.keyvalueblob.KeyValueBlobTransientStore.doGC(KeyValueBlobTransientStore.java:279) ~[nuxeo-core-cache-2021.15.6.jar:?]
        	at java.util.HashMap$Values.forEach(HashMap.java:977) ~[?:?]
        	at org.nuxeo.ecm.core.transientstore.TransientStorageComponent.doGC(TransientStorageComponent.java:98) ~[nuxeo-core-cache-2021.15.6.jar:?]
        	at org.nuxeo.ecm.core.transientstore.TransiantStorageGCWork.work(TransiantStorageGCWork.java:41) ~[nuxeo-core-cache-2021.15.6.jar:?]
        	at org.nuxeo.ecm.core.work.AbstractWork.runWorkWithTransaction(AbstractWork.java:533) ~[nuxeo-core-event-2021.15.6.jar:?]
        	at org.nuxeo.ecm.core.work.AbstractWork.run(AbstractWork.java:385) ~[nuxeo-core-event-2021.15.6.jar:?]
        	... 4 more
        
      • turn your network back on and wait another 15 minutes max and you will have the error mentioned at the beginning

      As a matter of fact, KeyValueBlobTransientStore#doGC will eventually call S3BlobStore#computeToDelete which will throw an exception (e.g. java.net.UnknownHostException in this particular case).

      Since the gc.start() is outside the try/finally block, gc.stop(delete) will never be called, startTime never reset to 0, and we end up in a situation where the GC never ended (i.e. startTime != 0 AbstractBlobGarbageCollector#start).

      We can't recover from this state without restarting the server :|.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.