Wednesday, September 13, 2023

Alfresco repository performance tuning checklist

 

As an Alfresco developer/admin we often get into situations to optimize the alfresco repository for optimal performance and stability. I often get messages in this regards and, so I decided to put a checklist that you should consider whenever you are dealing with alfresco repository performance.

Part of this is also referred as to how the system was sized in first place. Not every system is sized with pre-defined no. of named users, content size, no. of concurrent users etc. In some cases, we might have to revisit the sizing and tune the repository in order to match with increasing users and content size requirements. This is an incremental process in most cases.

Performance tuning and improvement is not one time job, it is done based on evaluation, trials and updates in an incremental manner. 

"Performance tuning is always an open ended process."

Having said that, here are some pointers (to give you ideas to start with) at a very high level that you should consider as a starting point:

  • Analyze Thread dump, GC Log and Heap dump.
    • Analyze the thread dump and hot threads, this may help troubleshoot performance problems, CPU usage and deadlocks. You can use support tools for enterprise version and OOTBee support tools for community version. You can also use FastThread to analyze thread dump.
      • To export thread dump, follow these steps:
        • Find the java process id (pid). Use this command: 
          • pgrep java
        • Export the thread dump, use this command with processId (e.g. 1): 
          • jstack [pid] > filepathtosave
            • jstack 1 > /usr/local/alfresco-community70/tomcat/threaddump.txt
      • To enable GC Logging use following JVM parameter (java 9 or later):
        • -Xlog:gc*:file=/usr/local/alfresco-community70/tomcat/repo_gc.log
      • Capture heap dumps automatically on OOM Errors by using following JVM parameter:
        • java -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/alfresco-community70/tomcat/repoheapdump.bin
      • To export heap dump, follow these steps:
        • Find the java process id (pid). Use this command:
          • pgrep java
        • Export the heap dump, use this command with processId (e.g. 1):
          • jmap -dump:[live],format=b,file=<filepathtosave> <pid>
            • jmap -dump:[live],format=b,file=/usr/local/alfresco-community70/tomcat/repoheapdump.bin 1
      • It is sometimes also helpful to review the java system properties, vm flags and vm arguments. You can use below given command see and review:
        • jinfo <pid>
      • Learn more on these terminologies here.
    • Analyze the required no. of users you want to support concurrently. Some of these inputs can be obtained by doing a load test in your target environment:
      • How much concurrent users is currently being handled by your system?
        • How much is the targeted no. of concurrent users down the lane?
        • You can consider creating a client program (using REST or CMIS API) to verify the system and analyze if your system can support the estimated number of users with the expected load before moving to PROD.
      • How much is supported total no. of DB connections? (this no. relates to the total number of concurrent users as well, so to support concurrent users allowed DB connections must be optimized and configured to an appropriate value). Default max_connections limit is 275.
        • An example of how to increase thread pool and max connections to support your requirement. In this example need to support max 400 connections, DB instance type in AWS is db.t4g.medium.
    RUN sed -i "s/port=\"8080\"\ protocol=\"HTTP\/1.1\"/port=\"8080\"\ protocol=\"HTTP\/1.1\"\ maxHttpHeaderSize=\"32768\"\ maxThreads=\"325\"/g" $TOMCAT_DIR/conf/server.xml ;
    
    OR (server.xml)

    <Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" URIEncoding="UTF-8" 
          redirectPort="8443" maxThreads="325" maxHttpHeaderSize="32768"/>

    ######### (325 maxThreads +75) in JAVA_OPTS or alfresco-global.properties ###########
    -Ddb.pool.max=400

    • Analyze the sources of content and current load:
      • What is the current size of repository?
      • What will be the target size based on estimates? If applicable, collect information about document types to be managed including their size, number of versions, daily volumes, number of properties, distribution ratios, etc. This will yield information for the database and storage requirements that may be required to be upgraded to support the target state.
        • Consider creating a plan to increase resources with the time. When the repository grows, apart from disk storage, resources like RAM and CPU should also be assessed and increased.
    • Revisit the JVM settings based on above assessment. You can also refer these docs for an overview:
      • Configure your java heap memory such that there is enough room for OS and non-heap processes in your application to avoid system crashes.
      • Some useful resources around JVM:
    • Re-validate if there is any latency between repository and DB.
      • Is your DB undergoing regular maintenance and tuning? (Think DB Vacuuming)
        • Regular maintenance and tuning of the database is necessary. Specifically, all of the database servers that supports require at the very least that some form of index statistics maintenance be performed at frequent, regular intervals to maintain optimal performance. Index maintenance can have a severe impact on performance while in progress, hence it needs to be discussed with your project team and scheduled appropriately.
      • Round trip time should be less than 1ms (recommended)
      • Round trip times greater than 1ms indicate unoptimized network that will adversely impact performance. Highly variable round trip time is also of concern, as that may increase the variability of Alfresco Content Services performance.
    • Revisit the repository cache settings if you have modified it (Check your alfresco-global.properties or JMX Settings to confirm).
      • Have you modified the default cache limits? (Modifying the limits without understanding it could lead to out of memory issues as it consumes heap memory)
        • Checkout these posts/docs to deep dive into repo cache:
        • There is critical connection between repository instances in cluster. There is a cache layer that sits just above the object relational mapping layer that sits above the database tier within repository. This cache is synced across all members of a repository cluster, and that syncing mechanism is sensitive to latency.
        • If you have a DR environment (active-active setup) and cloud regions are farther from each other and clustering is enabled you will see slower performance as both environments will try to connect each other. So consider this implication and proper remedy before setting up "alfresco.cluster.enabled=true" for DR environment.
          • If DR environment is not set up as active, it should be ok to have clustering enabled as you will bring DR environment when primary region is down.
      • As a general rule:
        • Do not keep more than 1000 nodes (files/folders) in same parent node (folder) if  Share UI is your primary interface for users.
        • Overall do not keep more than 3000 nodes (files/folders) in same parent node in repository.
      • However it depends on the resources of the server. I’ve seen systems with 10k nodes per folder working fine. A nice addition would be an organization scheme to avoid uploading “all” the files in the same folder, like using year/month/day.
    • A folder with linked nodes loads slow in Share UI, usually takes longer time than usual to display due to the metadata of all the linked nodes being retrieved as well in the response. If the linked nodes have large custom properties, the JSON response would be huge and can have a considerable impact on the time taken to display the folder in Share. You can set a configuration to lazily load these properties instead by adding the following in JAVA_OPTS.
    -Dstrip.linked-node.properties=true
    • Slower login response? -> We often hear about slow login response and quickly end-up concluding the issue with Alfresco. But this is not always the case which causes the slower login responses. 
      • There may be issue with Alfresco in-terms of network delays between repository and database that may sometimes cause slower login response time. As indicated above that the round trip time should be less than 1ms (recommended). 
      • Consider reviewing the authentication chain configured for your repository. We miss this critical part of the flow where problem lies with slow connectivity/network delays with for example ldap, external authentication systems etc. configured in authentication chain.  Learn more on authentication subsystem here.

      • Asses and analyze your custom code as well. Many of the times problem lies within custom code which we often doesn't consider reviewing.
        • Do you have excessive logging in your code? 
          • Never run your production environment with debug enabled. Set the appropriate log level (e.g.: error) for production. You can use support tools for enterprise version and OOTBee support tools for community version to enable/disable debug logs on need basis.
        • Are there unclosed Streams/IO/ResultSet/HttpClient/HttpResponse? 
          • Must consider closing Result Set (org.alfresco.service.cmr.search.ResultSet) after extracting the search results.
          • Must consider proper exception handling and closing resources via finally block. Thumb rule for exception handling: "Services should throw and Caller of services should catch. Handled exception must depict WWW (what, where and why)"
        • Do you have an implementation of behaviors wide open? (Hint: Bind behaviors to specific CLASS/ASSOCIATION/PROPERTIES/SERVICE that you need)
          • Class - Most commonly used, fired for a particular TYPE/ASPECT
          • Association - Fired for a particular association or association of a particular type.
          • Properties - Fired for properties of a particular type or with a particular name.
          • Service - Fired every time the service generates an event.
        • Node property/metadata updates via NodeService API:
          • Ask yourself these questions and analyze your source code:
            • Adding new properties to the node via NodeService but using nodeService.setProperties instead of using nodeService.addProperties?
              • Be careful and understand what you are doing, When using nodeService.setProperties before creating versions programmatically. This leads to replacing the property values set from previous versions. You will loose any property values set on workspace node (current version). So make sure you understand what you are doing. Use nodeService.addProperties otherwise.
            • Updating one property of a node but creating new map of properties and applying it via nodeService.setProperties instead of using  nodeService.setProperty
            • Updating multiple properties on a node but nodeService.setProperty is being used to set these properties in multiple steps instead of using nodeService.addProperties?
          • Consider analyzing these methods and your use cases before choosing the methods to use.
        • Deleting the huge amount of content/folder?
          • Consider using batch processing and apply "sys:temporary" aspect before deleting them after analyzing your organization's content retention policy.
        • If you manage some sort of Job IDs (content less nodes) within repository to track the jobs via metadata/properties, make sure you also set the cm:isContentIndexed property on the nodes so that it is an indication that Solr should not try to index the content for these nodes. Specially if you have full text content indexing enabled.
        • When using SearchService Java foundation API, if you just needs the list of nodeRefs from result and not metadata then make sure you DO NOT setIncludeMetadata to true. This is considered to be a best practice.
          • When using SearchService Java foundation API, fetch the results via pagination (using ResultSetSPI) instead of fetching all results at once. For example:- Fetch 1000 results in one batch and keep fetching by iterating the result set until result set hasMore results.
          • Avoid deep pagination:
            • Some common examples, these type of search parameters should be avoided:
              • skipCount=0&maxItems=1000000
              • skipCount=999000&maxItems=1000
        • Are you using any action that performs transactional operations such as "creating folders" (think an organization scheme such as timestamped folder structure where a file is moved as soon as it gets created) which is triggered multiple times on an event from a folder rule? It can be background operation or foreground operation ? Are you facing challenges to handle concurrency and may be getting "exceptions (e.g. FileExistsException or ConcurrencyFailureException )" ? 
          • If your answer is yes and you jumped to a quick solution to create/update method(s) with "synchronized" keyword? -> We always tempt to use "synchronized" as a quick solution to these problems, but remember that it is considered to be an Anti-pattern in most cases if used without understanding its consequence (think twice before jumping to this solution). It can be slow and lead to contention between threads.
              - org.springframework.dao.ConcurrencyFailureException
              - org.springframework.dao.DeadlockLoserDataAccessException
              - org.springframework.jdbc.JdbcUpdateAffectedIncorrectNumberOfRowsException
              - org.springframework.jdbc.UncategorizedSQLException
              - java.sql.SQLException
              - java.sql.BatchUpdateException
              - org.springframework.dao.DataIntegrityViolationException
              - org.alfresco.service.license.LicenseIntegrityException
              - org.apache.ibatis.exceptions.TooManyResultsException
              - org.alfresco.util.LockHelper.LockTryException

            • ConcurrencyFailureException will be automatically re-tried.
            • FileExistsException will not be by default covered via this semantic implementation. This exception will not trigger the transaction to retry unless it is manually caught and rethrown by wrapping in an one of the exceptions (as listed above) which will be retried. 
      • There is a property that can be configured via spring bean to include additional exceptions (also called extraExceptions , an example can be found here). But do not try to include FileExistsException in the list of retry exceptions and it is also considered to be an Anti-pattern (Think if this exception is being thrown from a poorly written code that does not check if a node exist before trying to create it).  
      • I would rather use a try-catch block to catch the  FileExistsException and re-throw the exception as a relevant exception from already configured list (shown above) such as the most relevant seems in this case is "DataIntegrityViolationException". This type of exception is handled automatically by retrying transaction considering the implementation is using RetryingTransactionHelper" and "RetryingTransactionCallback". On the next retry, the operation should then find/see the concurrently created node/folder in the existence check (nodeService.exists(nodeRef)) and skip creating it again and goes to next step.  

       

      • Asses the archive store content requirements. Based on your organization's retention policy, try to clean trashcan or setup trash-can-cleaner scheduled job so you keep appropriate amount of deleted content in archive store. Also cleanup contentstore.deleted folder often.
      • If you are using ACS 5.2 or ACS 6.0/6.1 which uses legacy transformation services, then configure async LibreOffice subsystem. This part is not applicable to ACS 6.2 and later.
        • Checkout the documentation here:
      • Asses the services/subsystems/features being used and un-used. Some examples which you can considering to disable if not in use:

      cifs.enabled=false
      audit.enabled=false
      audit.alfresco-access.enabled=false
      audit.tagging.enabled=false
      imap.server.enabled=false
      imap.server.attachments.extraction.enabled=false
      googledocs.enabled=false
      system.webdav.servlet.enabled=false
      
      ftp.enabled=false
      
      system.usages.enabled=false
      

      activities.feed.notifier.enabled=false activities.feed.notifier.cronExpression=* * * * * ? 2099 activities.feed.cleaner.enabled=false activities.feed.cleaner.cronExpression=* * * * * ? 2099 activities.feed.generator.enabled=false activities.feed.generator.cronExpression=* * * * * ? 2099 activities.post.cleaner.cronExpression=* * * * * ? 2099 activities.post.cleaner.enabled=false activities.post.lookup.cronExpression=* * * * * ? 2099 activities.post.lookup.enabled=false

        • If you are not using out of process extensions, you can also disable event2. Enable it when you plan to use it. Check more details here:

      repo.event2.enabled=false

        • If you have enabled replication but not really using it, then disable it. It is disabled by default unless you enable it.


      Additional Ideas (some of them do not apply in 7.x):