[NXP-30851] Improve KV TransientStore GC resiliency - Nuxeo Issue Tracker

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 10.10, 2021.0
Fix Version/s: 2021.17
Component/s: S3, TransientStore

Release Notes Summary:
The garbage collection for transient stores is more resilient
Tags:
- SupCom
- nxplatform
Backlog priority:
850
Upgrade notes:

Hide

Transient GC was not working in environments with segregated front and worker nodes.
As the result, transient stores in s3 might have accumulated lots of data and the current transient GC implementation might not be able to clean them efficiently.
In this case, it is recommended to purge manually all objects older than 3 days on transient stores before applying this hotfix.
This can be done using scripts or by creating an Object Lifecycle Management rule with a correct prefix /transient_*/.

Show
Transient GC was not working in environments with segregated front and worker nodes. As the result, transient stores in s3 might have accumulated lots of data and the current transient GC implementation might not be able to clean them efficiently. In this case, it is recommended to purge manually all objects older than 3 days on transient stores before applying this hotfix. This can be done using scripts or by creating an Object Lifecycle Management rule with a correct prefix /transient_*/ .
Sprint:
nxplatform #56, nxplatform #57
Story Points:
3

Description

Problems

The transient storage garbage collection (TransiantStorageGCWork) is scheduled every 15 min.
When using a KeyValueBlobTransientStore with S3 storage there are multiple problems that affect the resiliency:

1. If the first scan of S3 blobs fails (during gc.start) for any reason then the GC work will never be executed again on this node until restarted, the cause of error is also probably hidden.

2. The TransiantStorageGCWork that calls the GC on all transient storages is only running on storage that has been used, this is because transient storages are lazily initialized. This means that front nodes that use specific transient storage (batch import) may not be GC by a worker node (where all Works are executed).

3. The TransiantStorageGCWork is run in the common pool using a default random identifier, as a result, the GC is run concurrently (possibly at JVM and cluster level). This probably creates huge stress on S3 access to list blob, as well as on KV storage (Mongo). There is no point in having concurrency on a single node and with KV+S3 there is no point of having concurrency at cluster level, (TODO: check for other implementation if there is a case where it must be run on all nodes, but this is not going to work with front/worker nodes segregation)

4. The GC is performed within a transaction and is subject to timeout -> ~~NXP-30720~~

5. Because of the in-memory implementation of the GC (listing S3 objects, listing KV objects, diffing, deleting), the GC will probably fail to scale at some point.

Proposed solution

1. This must be fixed and we should add traces (in case of failure) and metrics to monitor when transient GC is running

2. Code must be adapted to make sure we run GC on all transient storage defined not only those that have been used since the node is started

3. Create a dedicated queue for TransientStorageGCWork with a single partition, this will ensure there is no concurrency at cluster level when running the GC,
we could also have a work per transient storage to avoid failure on all storage if one is problematic.
Moving the work out of the common queue will bring monitoring to DD (running, timers, rates ...)

4. Should be run outside of a transaction.

5. Being able to scale to any volume requires refactoring based on Stream processing, this is out of the scope of the ticket, see for ~~NXP-28679~~.

Detail to reproduce problem 1 (gc.start)

When a transient store Garbage Collection is launched with an underlying s3 blob store, we can end up in a situation where GC will fail with the following stack trace until a server restart:


2022-01-27T10:41:00,007 ERROR [AbstractWork] Exception during work: TransiantStorageGCWork(, Progress(?%, ?/0), null)2022-01-27T10:41:00,007 ERROR [AbstractWork] Exception during work: TransiantStorageGCWork(, Progress(?%, ?/0), null) org.nuxeo.ecm.core.api.NuxeoException: Already started at org.nuxeo.ecm.core.blob.AbstractBlobGarbageCollector.start(AbstractBlobGarbageCollector.java:52) ~[nuxeo-core-api-2021.15.6.jar:?] at org.nuxeo.ecm.core.blob.CachingBlobStore$CachingBinaryGarbageCollector.start(CachingBlobStore.java:469) ~[nuxeo-core-api-2021.15.6.jar:?] at org.nuxeo.ecm.core.transientstore.keyvalueblob.KeyValueBlobTransientStore.doGC(KeyValueBlobTransientStore.java:283) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at java.util.HashMap$Values.forEach(HashMap.java:977) ~[?:?] at org.nuxeo.ecm.core.transientstore.TransientStorageComponent.doGC(TransientStorageComponent.java:98) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at org.nuxeo.ecm.core.transientstore.TransiantStorageGCWork.work(TransiantStorageGCWork.java:41) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at org.nuxeo.ecm.core.work.AbstractWork.runWorkWithTransaction(AbstractWork.java:533) [nuxeo-core-event-2021.15.6.jar:?] at org.nuxeo.ecm.core.work.AbstractWork.run(AbstractWork.java:385) [nuxeo-core-event-2021.15.6.jar:?] at org.nuxeo.ecm.core.work.WorkHolder.run(WorkHolder.java:57) [nuxeo-core-event-2021.15.6.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?] at java.lang.Thread.run(Thread.java:829) [?:?] 2022-01-27T10:41:00,012 ERROR [WorkManagerImpl] Uncaught error on thread: Nuxeo-Work-common-4, current work might be lost, WorkManager metrics might be corrupted. org.nuxeo.ecm.core.api.NuxeoException: Work failed after 0 retries, class=class org.nuxeo.ecm.core.transientstore.TransiantStorageGCWork id=61350420930031.2003512040 category=TransiantStorageGCWork title=Transient Store GC at org.nuxeo.ecm.core.work.AbstractWork.workFailed(AbstractWork.java:479) ~[nuxeo-core-event-2021.15.6.jar:?] at org.nuxeo.ecm.core.work.AbstractWork.run(AbstractWork.java:398) ~[nuxeo-core-event-2021.15.6.jar:?] at org.nuxeo.ecm.core.work.WorkHolder.run(WorkHolder.java:57) ~[nuxeo-core-event-2021.15.6.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] at java.lang.Thread.run(Thread.java:829) [?:?] Caused by: org.nuxeo.ecm.core.api.NuxeoException: Already started at org.nuxeo.ecm.core.blob.AbstractBlobGarbageCollector.start(AbstractBlobGarbageCollector.java:52) ~[nuxeo-core-api-2021.15.6.jar:?] at org.nuxeo.ecm.core.blob.CachingBlobStore$CachingBinaryGarbageCollector.start(CachingBlobStore.java:469) ~[nuxeo-core-api-2021.15.6.jar:?] at org.nuxeo.ecm.core.transientstore.keyvalueblob.KeyValueBlobTransientStore.doGC(KeyValueBlobTransientStore.java:283) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at java.util.HashMap$Values.forEach(HashMap.java:977) ~[?:?] at org.nuxeo.ecm.core.transientstore.TransientStorageComponent.doGC(TransientStorageComponent.java:98) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at org.nuxeo.ecm.core.transientstore.TransiantStorageGCWork.work(TransiantStorageGCWork.java:41) ~[nuxeo-core-cache-2021.15-SNAPSHOT.jar:?] at org.nuxeo.ecm.core.work.AbstractWork.runWorkWithTransaction(AbstractWork.java:533) ~[nuxeo-core-event-2021.15.6.jar:?] at org.nuxeo.ecm.core.work.AbstractWork.run(AbstractWork.java:385) ~[nuxeo-core-event-2021.15.6.jar:?] ... 4 more

Typical scenario to observe the issue:

install amazon-s3-online-storage addon
configure a transient store on a s3 bucket (i.e. nuxeo.s3storage.transient.bucket=xxxx-dev-sandbox)
start a server

turn off your network and wait for 15 min max (the transient store GCs are scheduled to run every 15 min see transientstore-service.xml) until you see the following error:

Caused by: java.net.UnknownHostException: grenard-dev-sandbox.s3.eu-west-3.amazonaws.com
	at java.net.InetAddress$CachedAddresses.get(InetAddress.java:797) ~[?:?]
	at java.net.InetAddress.getAllByName0(InetAddress.java:1509) ~[?:?]
	at java.net.InetAddress.getAllByName(InetAddress.java:1368) ~[?:?]
	at java.net.InetAddress.getAllByName(InetAddress.java:1302) ~[?:?]
	at com.amazonaws.SystemDefaultDnsResolver.resolve(SystemDefaultDnsResolver.java:27) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.DelegatingDnsResolver.resolve(DelegatingDnsResolver.java:38) ~[aws-java-sdk-core-1.11.885.jar:?]
	at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:112) ~[httpclient-4.5.13.jar:4.5.13]
	at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376) ~[httpclient-4.5.13.jar:4.5.13]
	at jdk.internal.reflect.GeneratedMethodAccessor119.invoke(Unknown Source) ~[?:?]
	at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
	at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
	at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.conn.$Proxy85.connect(Unknown Source) ~[?:?]
	at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393) ~[httpclient-4.5.13.jar:4.5.13]
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[httpclient-4.5.13.jar:4.5.13]
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186) ~[httpclient-4.5.13.jar:4.5.13]
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.13.jar:4.5.13]
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.13.jar:4.5.13]
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.13.jar:4.5.13]
	at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1333) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1145) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:802) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:770) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:744) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:704) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:686) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:550) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:530) ~[aws-java-sdk-core-1.11.885.jar:?]
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5219) ~[aws-java-sdk-s3-1.11.885.jar:?]
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5165) ~[aws-java-sdk-s3-1.11.885.jar:?]
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5159) ~[aws-java-sdk-s3-1.11.885.jar:?]
	at com.amazonaws.services.s3.AmazonS3Client.getObjectLockConfiguration(AmazonS3Client.java:3280) ~[aws-java-sdk-s3-1.11.885.jar:?]
	at org.nuxeo.ecm.blob.s3.S3BlobStoreConfiguration.isS3RetentionEnabled(S3BlobStoreConfiguration.java:574) ~[nuxeo-core-binarymanager-s3-2021.14.7.jar:?]
	at org.nuxeo.ecm.blob.s3.S3BlobStoreConfiguration.<init>(S3BlobStoreConfiguration.java:301) ~[nuxeo-core-binarymanager-s3-2021.14.7.jar:?]
	at org.nuxeo.ecm.blob.s3.S3BlobProvider.getConfiguration(S3BlobProvider.java:109) ~[nuxeo-core-binarymanager-s3-2021.14.7.jar:?]
	at org.nuxeo.ecm.blob.s3.S3BlobProvider.getBlobStore(S3BlobProvider.java:76) ~[nuxeo-core-binarymanager-s3-2021.14.7.jar:?]
	at org.nuxeo.ecm.core.blob.BlobStoreBlobProvider.initialize(BlobStoreBlobProvider.java:60) ~[nuxeo-core-api-2021.15.6.jar:?]
	at org.nuxeo.ecm.core.blob.BlobManagerComponent.getBlobProvider(BlobManagerComponent.java:209) ~[nuxeo-core-api-2021.15.6.jar:?]
	at org.nuxeo.ecm.core.blob.BlobManagerComponent.getBlobProviderWithNamespace(BlobManagerComponent.java:236) ~[nuxeo-core-api-2021.15.6.jar:?]
	at org.nuxeo.ecm.core.transientstore.keyvalueblob.KeyValueBlobTransientStore.getBlobProvider(KeyValueBlobTransientStore.java:213) ~[nuxeo-core-cache-2021.15.6.jar:?]
	at org.nuxeo.ecm.core.transientstore.keyvalueblob.KeyValueBlobTransientStore.doGC(KeyValueBlobTransientStore.java:279) ~[nuxeo-core-cache-2021.15.6.jar:?]
	at java.util.HashMap$Values.forEach(HashMap.java:977) ~[?:?]
	at org.nuxeo.ecm.core.transientstore.TransientStorageComponent.doGC(TransientStorageComponent.java:98) ~[nuxeo-core-cache-2021.15.6.jar:?]
	at org.nuxeo.ecm.core.transientstore.TransiantStorageGCWork.work(TransiantStorageGCWork.java:41) ~[nuxeo-core-cache-2021.15.6.jar:?]
	at org.nuxeo.ecm.core.work.AbstractWork.runWorkWithTransaction(AbstractWork.java:533) ~[nuxeo-core-event-2021.15.6.jar:?]
	at org.nuxeo.ecm.core.work.AbstractWork.run(AbstractWork.java:385) ~[nuxeo-core-event-2021.15.6.jar:?]
	... 4 more

turn your network back on and wait another 15 minutes max and you will have the error mentioned at the beginning

As a matter of fact, KeyValueBlobTransientStore#doGC will eventually call S3BlobStore#computeToDelete which will throw an exception (e.g. java.net.UnknownHostException in this particular case).

Since the gc.start() is outside the try/finally block, gc.stop(delete) will never be called, startTime never reset to 0, and we end up in a situation where the GC never ended (i.e. startTime != 0 AbstractBlobGarbageCollector#start).

We can't recover from this state without restarting the server :|.

Attachments

Issue Links

is related to

NXP-30720 Prevent TransientStore GC to timeout

Resolved

NXP-30910 StreamWorkManager is not taking into account maxThreads

Resolved

NXP-28679 Improve blob GC to do cleanup for selected blobs

Resolved

NXP-30919 Improve KV TransientStore GC resiliency on 10.10

Resolved

NXP-31323 TransientStore GC throws "Not Started" Exception

Resolved

Is referenced in

PR for 2021: #365

(1 Is referenced in)

Improve KV TransientStore GC resiliency