Wednesday, September 13, 2023

Alfresco repository performance tuning checklist

 

As an Alfresco developer/admin we often get into situations to optimize the alfresco repository for optimal performance and stability. I often get messages in this regards and, so I decided to put a checklist that you should consider whenever you are dealing with alfresco repository performance.

Part of this is also referred as to how the system was sized in first place. Not every system is sized with pre-defined no. of named users, content size, no. of concurrent users etc. In some cases, we might have to revisit the sizing and tune the repository in order to match with increasing users and content size requirements. This is an incremental process in most cases.

Performance tuning and improvement is not one time job, it is done based on evaluation, trials and updates in an incremental manner. 

"Performance tuning is always an open ended process."

Having said that, here are some pointers (to give you ideas to start with) at a very high level that you should consider as a starting point:

  • Analyze Thread dump, GC Log and Heap dump.
    • Analyze the thread dump and hot threads, this may help troubleshoot performance problems, CPU usage and deadlocks. You can use support tools for enterprise version and OOTBee support tools for community version. You can also use FastThread to analyze thread dump.
      • To export thread dump, follow these steps:
        • Find the java process id (pid). Use this command: 
          • pgrep java
        • Export the thread dump, use this command with processId (e.g. 1): 
          • jstack [pid] > filepathtosave
            • jstack 1 > /usr/local/alfresco-community70/tomcat/threaddump.txt
      • To enable GC Logging use following JVM parameter (java 9 or later):
        • -Xlog:gc*:file=/usr/local/alfresco-community70/tomcat/repo_gc.log
      • Capture heap dumps automatically on OOM Errors by using following JVM parameter:
        • java -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/alfresco-community70/tomcat/repoheapdump.bin
      • To export heap dump, follow these steps:
        • Find the java process id (pid). Use this command:
          • pgrep java
        • Export the heap dump, use this command with processId (e.g. 1):
          • jmap -dump:[live],format=b,file=<filepathtosave> <pid>
            • jmap -dump:[live],format=b,file=/usr/local/alfresco-community70/tomcat/repoheapdump.bin 1
      • It is sometimes also helpful to review the java system properties, vm flags and vm arguments. You can use below given command see and review:
        • jinfo <pid>
          • If you are trying the understand the java memory allocations and usage while app is running then this command will be helpful:
            • jcmd <pid> GC.heap_info
          • If you want to know metaspace usage while app is running, then this command will be helpful:
            • jcmd <pid> VM.metaspace
          • You can try jcmd <pid> to get all other available options pertaining to JVM on your running application.
      • Learn more on these terminologies here.
    • Analyze the required no. of users you want to support concurrently. Some of these inputs can be obtained by doing a load test in your target environment:
      • How much concurrent users is currently being handled by your system?
        • How much is the targeted no. of concurrent users down the lane?
        • You can consider creating a client program (using REST or CMIS API) to verify the system and analyze if your system can support the estimated number of users with the expected load before moving to PROD.
      • How much is supported total no. of DB connections? (this no. relates to the total number of concurrent users as well, so to support concurrent users allowed DB connections must be optimized and configured to an appropriate value). Default max_connections limit is 275.
        • An example of how to increase thread pool and max connections to support your requirement. In this example need to support max 400 connections, DB instance type in AWS is db.t4g.medium.
    RUN sed -i "s/port=\"8080\"\ protocol=\"HTTP\/1.1\"/port=\"8080\"\ protocol=\"HTTP\/1.1\"\ maxHttpHeaderSize=\"32768\"\ maxThreads=\"325\"/g" $TOMCAT_DIR/conf/server.xml ;
    
    OR (server.xml)

    <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" URIEncoding="UTF-8" 
          redirectPort="8443" maxThreads="325" maxHttpHeaderSize="32768"/>

    ######### (325 maxThreads +75) in JAVA_OPTS or alfresco-global.properties ###########
    -Ddb.pool.max=400

    • Analyze the sources of content and current load:
      • What is the current size of repository?
      • What will be the target size based on estimates? If applicable, collect information about document types to be managed including their size, number of versions, daily volumes, number of properties, distribution ratios, etc. This will yield information for the database and storage requirements that may be required to be upgraded to support the target state.
        • Consider creating a plan to increase resources with the time. When the repository grows, apart from disk storage, resources like RAM and CPU should also be assessed and increased.
    • Revisit the JVM settings based on above assessment. You can also refer these docs for an overview:
      • Configure your java heap memory such that there is enough room for OS and non-heap processes in your application to avoid system crashes.
      • Some useful resources around JVM:
    • Re-validate if there is any latency between repository and DB.
      • Is your DB undergoing regular maintenance and tuning? (Think DB Vacuuming)
        • Regular maintenance and tuning of the database is necessary. Specifically, all of the database servers that supports require at the very least that some form of index statistics maintenance be performed at frequent, regular intervals to maintain optimal performance. Index maintenance can have a severe impact on performance while in progress, hence it needs to be discussed with your project team and scheduled appropriately.
      • Round trip time should be less than 1ms (recommended)
      • Round trip times greater than 1ms indicate unoptimized network that will adversely impact performance. Highly variable round trip time is also of concern, as that may increase the variability of Alfresco Content Services performance.
    • Revisit the repository cache settings if you have modified it (Check your alfresco-global.properties or JMX Settings to confirm).
      • Have you modified the default cache limits? (Modifying the limits without understanding it could lead to out of memory issues as it consumes heap memory)
        • Checkout these posts/docs to deep dive into repo cache:
        • There is critical connection between repository instances in cluster. There is a cache layer that sits just above the object relational mapping layer that sits above the database tier within repository. This cache is synced across all members of a repository cluster, and that syncing mechanism is sensitive to latency.
        • If you have a DR environment (active-active setup) and cloud regions are farther from each other and clustering is enabled you will see slower performance as both environments will try to connect each other. So consider this implication and proper remedy before setting up "alfresco.cluster.enabled=true" for DR environment.
          • If DR environment is not set up as active, it should be ok to have clustering enabled as you will bring DR environment when primary region is down.
      • As a general rule:
        • Do not keep more than 1000 nodes (files/folders) in same parent node (folder) if  Share UI is your primary interface for users.
        • Overall do not keep more than 3000 nodes (files/folders) in same parent node in repository.
      • However it depends on the resources of the server. I’ve seen systems with 10k nodes per folder working fine. A nice addition would be an organization scheme to avoid uploading “all” the files in the same folder, like using year/month/day.
    • A folder with linked nodes loads slow in Share UI, usually takes longer time than usual to display due to the metadata of all the linked nodes being retrieved as well in the response. If the linked nodes have large custom properties, the JSON response would be huge and can have a considerable impact on the time taken to display the folder in Share. You can set a configuration to lazily load these properties instead by adding the following in JAVA_OPTS.
    -Dstrip.linked-node.properties=true
    • Slower login response? -> We often hear about slow login response and quickly end-up concluding the issue with Alfresco. But this is not always the case which causes the slower login responses. 
      • There may be issue with Alfresco in-terms of network delays between repository and database that may sometimes cause slower login response time. As indicated above that the round trip time should be less than 1ms (recommended). 
      • Consider reviewing the authentication chain configured for your repository. We miss this critical part of the flow where problem lies with slow connectivity/network delays with for example ldap, external authentication systems etc. configured in authentication chain.  Learn more on authentication subsystem here.

      • Asses and analyze your custom code as well. Many of the times problem lies within custom code which we often doesn't consider reviewing.
        • Do you have excessive logging in your code? 
          • Never run your production environment with debug enabled. Set the appropriate log level (e.g.: error) for production. You can use support tools for enterprise version and OOTBee support tools for community version to enable/disable debug logs on need basis.
        • Are there unclosed Streams/IO/ResultSet/HttpClient/HttpResponse? 
          • Must consider closing Result Set (org.alfresco.service.cmr.search.ResultSet) after extracting the search results.
          • Must consider proper exception handling and closing resources via finally block. Thumb rule for exception handling: "Services should throw and Caller of services should catch. Handled exception must depict WWW (what, where and why)"
        • Do you have an implementation of behaviors wide open? (Hint: Bind behaviors to specific CLASS/ASSOCIATION/PROPERTIES/SERVICE that you need)
          • Class - Most commonly used, fired for a particular TYPE/ASPECT
          • Association - Fired for a particular association or association of a particular type.
          • Properties - Fired for properties of a particular type or with a particular name.
          • Service - Fired every time the service generates an event.
        • Node property/metadata updates via NodeService API:
          • Ask yourself these questions and analyze your source code:
            • Adding new properties to the node via NodeService but using nodeService.setProperties instead of using nodeService.addProperties?
              • Be careful and understand what you are doing, When using nodeService.setProperties before creating versions programmatically. This leads to replacing the property values set from previous versions. You will loose any property values set on workspace node (current version). So make sure you understand what you are doing. Use nodeService.addProperties otherwise.
            • Updating one property of a node but creating new map of properties and applying it via nodeService.setProperties instead of using  nodeService.setProperty
            • Updating multiple properties on a node but nodeService.setProperty is being used to set these properties in multiple steps instead of using nodeService.addProperties?
          • Consider analyzing these methods and your use cases before choosing the methods to use.
        • Deleting the huge amount of content/folder?
          • Consider using batch processing and apply "sys:temporary" aspect before deleting them after analyzing your organization's content retention policy.
        • If you manage some sort of Job IDs (content less nodes) within repository to track the jobs via metadata/properties, make sure you also set the cm:isContentIndexed property on the nodes so that it is an indication that Solr should not try to index the content for these nodes. Specially if you have full text content indexing enabled.
        • When using SearchService Java foundation API, if you just needs the list of nodeRefs from result and not metadata then make sure you DO NOT setIncludeMetadata to true. This is considered to be a best practice.
          • When using SearchService Java foundation API, fetch the results via pagination (using ResultSetSPI) instead of fetching all results at once. For example:- Fetch 1000 results in one batch and keep fetching by iterating the result set until result set hasMore results.
          • Avoid deep pagination:
            • Some common examples, these type of search parameters should be avoided:
              • skipCount=0&maxItems=1000000
              • skipCount=999000&maxItems=1000
        • Are you using any action that performs transactional operations such as "creating folders" (think an organization scheme such as timestamped folder structure where a file is moved as soon as it gets created) which is triggered multiple times on an event from a folder rule? It can be background operation or foreground operation ? Are you facing challenges to handle concurrency and may be getting "exceptions (e.g. FileExistsException or ConcurrencyFailureException )" ? 
          • If your answer is yes and you jumped to a quick solution to create/update method(s) with "synchronized" keyword? -> We always tempt to use "synchronized" as a quick solution to these problems, but remember that it is considered to be an Anti-pattern in most cases if used without understanding its consequence (think twice before jumping to this solution). It can be slow and lead to contention between threads.
              - org.springframework.dao.ConcurrencyFailureException
              - org.springframework.dao.DeadlockLoserDataAccessException
              - org.springframework.jdbc.JdbcUpdateAffectedIncorrectNumberOfRowsException
              - org.springframework.jdbc.UncategorizedSQLException
              - java.sql.SQLException
              - java.sql.BatchUpdateException
              - org.springframework.dao.DataIntegrityViolationException
              - org.alfresco.service.license.LicenseIntegrityException
              - org.apache.ibatis.exceptions.TooManyResultsException
              - org.alfresco.util.LockHelper.LockTryException

            • ConcurrencyFailureException will be automatically re-tried.
            • FileExistsException will not be by default covered via this semantic implementation. This exception will not trigger the transaction to retry unless it is manually caught and rethrown by wrapping in an one of the exceptions (as listed above) which will be retried. 
      • There is a property that can be configured via spring bean to include additional exceptions (also called extraExceptions , an example can be found here). But do not try to include FileExistsException in the list of retry exceptions and it is also considered to be an Anti-pattern (Think if this exception is being thrown from a poorly written code that does not check if a node exist before trying to create it).  
      • I would rather use a try-catch block to catch the  FileExistsException and re-throw the exception as a relevant exception from already configured list (shown above) such as the most relevant seems in this case is "DataIntegrityViolationException". This type of exception is handled automatically by retrying transaction considering the implementation is using RetryingTransactionHelper" and "RetryingTransactionCallback". On the next retry, the operation should then find/see the concurrently created node/folder in the existence check (nodeService.exists(nodeRef)) and skip creating it again and goes to next step.  

       

      • Asses the archive store content requirements. Based on your organization's retention policy, try to clean trashcan or setup trash-can-cleaner scheduled job so you keep appropriate amount of deleted content in archive store. Also cleanup contentstore.deleted folder often.
      • If you are using ACS 5.2 or ACS 6.0/6.1 which uses legacy transformation services, then configure async LibreOffice subsystem. This part is not applicable to ACS 6.2 and later.
        • Checkout the documentation here:
      • Asses the services/subsystems/features being used and un-used. Some examples which you can considering to disable if not in use:

      cifs.enabled=false
      audit.enabled=false
      audit.alfresco-access.enabled=false
      audit.tagging.enabled=false
      imap.server.enabled=false
      imap.server.attachments.extraction.enabled=false
      googledocs.enabled=false
      system.webdav.servlet.enabled=false
      
      ftp.enabled=false
      
      system.usages.enabled=false
      

      activities.feed.notifier.enabled=false activities.feed.notifier.cronExpression=* * * * * ? 2099 activities.feed.cleaner.enabled=false activities.feed.cleaner.cronExpression=* * * * * ? 2099 activities.feed.generator.enabled=false activities.feed.generator.cronExpression=* * * * * ? 2099 activities.post.cleaner.cronExpression=* * * * * ? 2099 activities.post.cleaner.enabled=false activities.post.lookup.cronExpression=* * * * * ? 2099 activities.post.lookup.enabled=false

        • If you are not using out of process extensions, you can also disable event2. Enable it when you plan to use it. Check more details here:

      repo.event2.enabled=false

        • If you have enabled replication but not really using it, then disable it. It is disabled by default unless you enable it.


      Additional Ideas (some of them do not apply in 7.x):





      As we have covered repository part of it so far, we should not forget to review Search Services (specially SOLR) part of it as well, after all it is heart of search in Alfresco user interfaces such as Share and Content App. Let's review some high level pointers (to give you ideas to proceed with) that should be the starting points to analyze any performance considerations for alfresco search service (aka Solr).
      • If you're experiencing slow search results with Alfresco Solr, there could be several factors contributing to the issue. Here are a few reasons and potential solutions:
        • Hardware: Check if your server has sufficient resources (CPU, memory, disk space) to handle the search queries efficiently. Consider scaling up the hardware if necessary.
        • Indexing: Ensure that the Solr indexes are optimized and up-to-date. Regularly monitor and optimize the indexing process to maintain good search performance.
        • Query optimization: Analyze the search queries being executed and identify any potential bottlenecks. Review the Solr query syntax, filters, and sorting parameters to optimize the search queries for better performance.
        • Network latency: Evaluate the network connectivity between your Alfresco server and the Solr instance. Ensure there are no network issues or bottlenecks that could impact the search performance.
      • Make sure you use faster disk such as SSD for better performance.
      • Identify and reduce any network latency. Also consider whether you need SSL. If within local intranet, you can disable SSL to reduce complexity and use "secret" based communication.
      • Analyze your setup and search requirements to see if Full Text indexing really necessary. Full text indexes are by default enabled but not used in some cases. So keep it enabled only if you use it.

      • Memory tuning for Solr is also a key consideration depending on your setup and requirements. Solr utilizes RAM for several important reasons:

        • Sorting and Ranking: Solr performs sorting and ranking operations on search results. These operations involve loading relevant data into memory for efficient computation and faster response times.
        • Document and Field CachesSolr maintains caches for individual documents and fields to speed up retrieval. These caches store frequently accessed data in memory, enabling rapid access without having to fetch data from disk.
        • Caching: Solr employs caching mechanisms to improve search performance. RAM is utilized to store frequently accessed data structures, such as filters, query results, and facets. Caching in RAM significantly reduces the disk I/O operations, resulting in faster search responses.
        • IndexingSolr uses RAM to load portions of the index into memory during the indexing process. This allows for faster document indexing and enables efficient updates to the index.
        • Heap size-> Should it be too large or too small?
          • A popular saying "one size does not fit all", same applies with Solr heap. Keep the heap memory large enough so that you don't get into OutOfMemory errors and problems with constant garbage collection. But keep it small enough that memory is not goes waste or running into huge GC pauses. Appropriate GC size can be also determined by analyzing gc_logs. There are some tools like GCViewer, GCEasy, GCPlot,  IBM Garbage Collection and Memory Visualizer, Garbagecat etc. that help analyze GC logs. It can provide the info around amount of memory used after GC has completed.
          • Refer following post to learn more:

        • Review your repository and Solr sizing needs. Avoid running repository and Solr on same node. Refer following post to deep dive:
        • Review your indexing requirements, visit alfresco documentation here to learn more.
          • Know how many content and paths remain in index: 
          • You can consider using this tool for checking indexes (Caution: Use it with care, specially in PROD* like environments):
        • Review FINGERPRINT based document search requirements, disable document FINGERPRINT if not being used. You will save a lot of disk space and increases performance of Indexing. Note that, you will have to do a full-reindex after disabling the document FINGERPRINT. Set alfresco.fingerprint=false in solrcore.properties
          • The good news is that from Search Services 2.0.2 it is disabled by default.
        • Enabling the document cache provides performance benefits at the cost of additional memory requirements. SOLR Document Caches feature is disabled by default since Search Services 2.0.2. Following properties in the solrcore.properties file are set as default, you can tune the values as per your requirement by test and trials:
        solr.documentCache.size=0
        solr.documentCache.initialSize=0
        solr.documentCache.autowarmCount=0
          • If you wish to take advantage of caching considering you are willing to provide additional memory, while it will have some performance benefits but having excessive caching can be problematic too. Do not configure excessive caching . Disable SOLR document cache if you intend to reduce memory requirements. If you let a lot objects stored in the caches, the JVM's GC is eventually going to have to clean it all up. Having lots of garbage increases the duration of garbage collections and may cause slow down responses.
        • Optimize the SOLR index (alfresco and archive cores) in timely manner during off business hours to improve search performance. Solr index optimization is a process that compacts the index and merges segments in order to improve query performance. Whenever documents are deleted/updated during indexing process, documents are marked as deleted in their original segments. This consumes additional storage for deleted document indexes. After a bulk import/upload operation, the percentage of deleted documents can be up to 50%. This percentage is determined by the ratio of numDocs to maxDocs. The  greater the ratio of deleted documents the Solr Index contains, the slower Search Services will be at searching and indexing. 
          • If you are not optimizing the indexes in periodic manner then it may have its own implications in some use cases specially large repository indexes, review your setup before starting the optimization. Ensure after the initial optimization, that a periodic execution of the optimization process is carried out in order to preserve the performance benefits.
          • By now you must be thing whether Solr optimize action requires more memory and disk?: So the answer is Yes, it is usually a quite "expensive operation particularly for large indexes*". Optimization can take hours in some cases on large indexes. In later versions of Solr i.e. Solr 7.5 or later there were some improvements done and optimize/enforced merge better behave. But in case of Alfresco Solr 6 is the only option.
            • Optimizing an Solr index can require additional memory and disk space. When you optimize the indexes, Solr merges smaller index segments into larger ones, which can reduce disk space usage and improve query performance. But, during this process, Solr may temporarily require extra disk space for the new segments being created. Optimizing can be a resource-intensive operation, so it may require more memory and CPU capacity while it's running.
            • Monitor your system's resource usage during the optimization process to make sure it doesn't impact your server's performance. Hence it is recommended you perform the optimization during off hours.
            • You may look at this video to understand more about Solr index optimization.
            • It is also a good idea to restart the Solr server once optimization is fully completed to allow physical memory and cache buffers. Make sure you are performing optimization during off hours.
          • You can optimize the indexes by selecting the respective cores (alfresco/archive) from Solr admin UI. or you can use following REST API to so so:
        Where N (no. of segments) is >=1. To determine the appropriate value of N, you can divide the size by 10. For example your index size is 20GB then N = 20/10 -> 2 

        • Increase batch size (alfresco.batch.count=5000 in solrcore.properties) to get more results on your indexing webscript on the repository side.
        • During the indexing, plug in a monitoring tool (YourKit) to check the repository health during the indexing. Sometimes, during the indexing process, the repository layer executes heavy and IO/CPU/Memory intensive operations like transformation of content to text in order to send it to Solr for indexing. This can become a bottleneck when for example the transformations are not working properly or the GC cycles are taking a lot of time.
        • Monitor closely the JVM health of both Solr and Alfresco (GC, Heap usage).
        • Review commits and transactions logs (TLOG .tlog files). Tlog is a file that contains the raw documents for recovery purposes. In large repos, specially when secondary nodes/associations (think folder links) gets created and during indexing the .tlog file can become huge in size as well. Tlogs can be huge even in bulk loading scenarios (Think you are loading 1000 documents every seconds so the tlog file can have 1000x60x60 i.e. 3,600,000 raw documents).You can also review this documentation. This is a really broad topic, here is what i have found based on all my readings. 
          • The entire document gets written to the tlog on update, this holds true for atomic updates as well, including data read from the old version of the document. The document written to the tlog is not the "delta" for atomic updates.
          • For consistency purposes tlog is most important. These files are used to bring an index up-to-date if the JVM is stopped/crashed before segments get closed. If the JVM crashes, the documents are still safely written to disk. If the op system crashes, then not. The transaction log will be replayed on server restart if the server was not gracefully shut down!
            • If tlog is huge then restarting the server can be slow. Depending on the situation and size of indexes, it may take more time than usual.
          • TLOGs are only rolled over when you configure and tell Solr to do so, i.e. issue a hard commit (or autoCommit happens, configured in solrconfig.xml). If you have very large tlogs then it is not a good sign and you should change your hard commit settings. You can configure autoCommit to avoid filling up your disk with tlogs e.g.:
              • The tlog continue to grow and are not truncated. Soft commits make the indexed document visible.
            •   -Dsolr.autoCommit.maxTime=300000 (Hard commit, time in ms)
              • The tlog is truncated on a hard commit and a new tlog is started. Old tlogs will be deleted if there are more than 100 documents in newer tlog, closed tlogs.
              • The current index segment is closed and flushed.
              • Background segment merges may be initiated.
          • tlog files can be found under index cores for example in Alfresco Search Services 2.x, it can be found here (based on your deployment): /opt/alfresco-search-services/data/alfresco/tlog
        • Avoid using "*" (wildcarded) based searches and avoid path search queries, those are usually slow.
        • Do not enable excessive logs, trying to log everything (sometimes may be needed if you are debugging any issue*) can have an impact on the performance*.
        • Thinking to implement Solr Sharding to increase Solr performance? --> This us another very broad topic, Understand your requirements, size of repository etc. before Sharing. Solr Sharding is recommended only if you have a really large repository containing over 50+ million documents. Learn more on Solr Sharding and best practices here

        Additional resources (Do checkout ASS 2.x release notes):



        Want to learn on how to control the indexing behavior, checkout following posts as well:


        Acknowledgments

        Thank you Angel Borroy for your valuable inputs and knowledge sharing. 

        I want to also thank Luis Colorado and Luis Cabaceira for their excellent articles on Solr and Repository performance tuning.


        References:

        • https://docs.alfresco.com
        • https://hub.alfresco.com
        • https://www.ziaconsulting.com
        • https://solr.apache.org/guide/6_6/
        • https://www.slideshare.net
        • https://texter.ai/technical-articles/
        • https://www.baeldung.com
        • https://sematext.com/blog/
        • https://blog.cloudera.com
        • https://clouderatemp.wpengine.com/blog
        • https://www.ibm.com/docs
        • https://www.github.com
        • https://blog.gceasy.io




        8 comments:

        1. Great job ! Thank's a lot

          ReplyDelete
        2. Congratulations, This is great work, very helpful. Love your writing style, have been following your blogs for 1 year

          ReplyDelete
        3. Thanks for the detailed checklist. You are right, i had a login slowness and it was happening because of external authentication happening through our HRMS. Network rerouting fixed it. Thanks got this path to check after reading your post.

          ReplyDelete
        4. Thanks a lot Abhinav,
          But is it possible to collect such topics for 7.x version specific , which we can apply?

          ReplyDelete
          Replies
          1. These checklists are not specific to alfresco versions. These are always same irrespective of any version. Any version specific checklist is already present in the list. If you have anything specific, please elaborate.

            Delete

        Thanks for your comments/Suggestions.