Monday, October 10, 2022

Control indexing in Alfresco with Alfresco Search Services

I came acorss a query from a friend to disable content indexing and allow only matadata indexing recently. The last time i tried this when i was using Alfresco 5.0 with Solr4. This time i tried with Alfresco Content Services 7.2.0.1 and Alfresco Search Services 2.0.3.5 and the good news is that, it still works. 

If your application does not require full-text content search capability, then disabling content indexing comes handy and increases the performance as well.

If you are curious to try out, then follow along.

Pre-requisites
  • You have an environment up and running with Alfresco Content Services 7.x and Alfresco Search Services 2.x (Solr6)
  • You have administrative privileges

Looking for Alfresco Content Services 7.x with Alfresco Search Services 2.x installation steps? , checkout these posts:




Content and metadata are indexed by default, it is out-of-the-box behavior. There are two ways you can control content/metadata indexing behavior in order to fulfil the search and indexing requirements. We will go over both options.

Control indexing behavior with help of content model aspect:


To control the indexing behavior, you can make use of a content model aspect named "cm:indexControl" which has two properties. These properties indicate whether content/metadata should be indexed.

The value of these properties are set to true by default.

<aspect name="cm:indexControl">
	<title>Index Control</title>
	<properties>
		<property name="cm:isIndexed">
			<title>Is indexed</title>
			<type>d:boolean</type>
			<default>true</default>
		</property>
		<property name="cm:isContentIndexed">
			<title>Is content indexed</title>
			<type>d:boolean</type>
			<default>true</default>
		</property>
	</properties>
</aspect>

You can apply cm:indexControl aspect on the nodes to control the indexing behavior by setting the appropriate properties. Note that, this approach works only for certain types like cm:folder, cm:content and sub-types. You need to keep in mind that, if you have a large number of nodes which needs to be excluded from content/metadata indexing then this option is not a right choice as you will have to apply the aspect by setting "cm:isContentIndexed" to "false" on all those nodes. 
In this situation second option (which we will see next) comes handy.

To learn more on content model, aspects and their application, refer:  Content Model Extension Point

If you wish to bulk apply the aspect with updated values, this post may be useful as a reference: Applying the aspects in bulk


Control indexing behavior from solr:


This approach comes handy in order to control indexing for all document types across the repository. We can configure the solr indexing behavior by setting these properties (alfresco.index.transformContent and/or alfresco.ignore.datatype.1) in solrcore.properties file. 

You need to keep in mind that, if you needed to exclude only specific nodes then this option is not a right choice. This approach controls the indexing behavior globally. It can either enable or disable globally at a time.

If the below given property is set in solrcore.properties file as 'false' , then content indexing will be disabledThe index tracker will not transform any content and only the metadata will be indexed.

If the below given property is set in solrcore.properties file, then metadata indexing will be disabledThe index tracker will not index metadata. This is not ideal setting in most cases as you want to be able to search your documents by their metadata at least. So be mindful when setting this property.

If both of the below given properties are set in solrcore.properties file then both content and metadata indexing will be disabled.

alfresco.index.transformContent=false
alfresco.ignore.datatype.1=d:content

To learn more on how to configure these properties, refer the installation guides if your environment is setup using distribution package:


If you wish to set the properties for a docker based environment, then you would have to make use of DockerFile and docker-compose.yml in combination. You can take a look at this repo and this post to understand how DockerFile and docker-compose.yml can be used together to build/update base docker images.

Follow the steps given below to update the properties for a docker based environment:
  • Create a folder named configs-to-override in same folder where you have kept your docker-compose.yml file
  • Create a folder named solr within configs-to-override folder
  • Create a file named DockerFile under configs-to-override/solr which will be used for building alfresco-search-services image with additional instructions to disable indexing behavior
  • Copy following instructions in the DockerFile (configs-to-override/solr)

FROM alfresco/alfresco-search-services:2.0.3.5

#To disable content indexing
RUN sed -i '/^bash.*/i sed -i "'"/alfresco.index.transformContent/s/^#//g"'" ${DIST_DIR}/solrhome/templates/rerank/conf/solrcore.properties\n' \ ${DIST_DIR}/solr/bin/search_config_setup.sh;

#To disable metadata indexing
#RUN sed -i '/^bash.*/i sed -i "'"/alfresco.ignore.datatype.1/s/^#//g"'" ${DIST_DIR}/solrhome/templates/rerank/conf/solrcore.properties\n' \
#${DIST_DIR}/solr/bin/search_config_setup.sh;

#TODO:: Add more steps as needed

solr6:
    build:
      dockerfile: ./Dockerfile
      context: ./configs-to-override/solr
    mem_limit: 2g
    environment:
      # Solr needs to know how to register itself with Alfresco
      SOLR_ALFRESCO_HOST: "alfresco"
      SOLR_ALFRESCO_PORT: "8080"
      # Alfresco needs to know how to call solr
      SOLR_SOLR_HOST: "solr6"
      SOLR_SOLR_PORT: "8983"
      # Create the default alfresco and archive cores
      SOLR_CREATE_ALFRESCO_DEFAULTS: "alfresco,archive"
      # HTTPS or SECRET
      ALFRESCO_SECURE_COMMS: "secret"
      # SHARED SECRET VALUE
      JAVA_TOOL_OPTIONS: "
          -Dalfresco.secureComms.secret=secret
      "
    ports:
      - "8083:8983" # Browser port

  • Launch the containers again, use following command. This would build the alfresco-search-services image with updated property:
docker-compose -f ./docker-compose.yml up --build


Note: Disabling/Enabling indexing behaviors requires to re-index the repository.

On the side note, If you want archive or zip files to be unzipped and the files included in the index, set the following property:

transformer.Archive.includeContents=true

The default setting is false.


Validation:


To validate whether indexes are being disabled or not, I have uploaded a text document containing a line of text to make sure that content and metadata are being indexed and document (text file) is returned in search result based on content/metadata query. I have not applied any of the aforementioned methods to disable the indexing yet.

See the details below:

  • Uploaded a text file containing a line of text "Control indexing in Alfresco with Solr6"

  

  • Run a content search query like: "indexing in Alfresco"



  • Run a query via solr admin to see if content is being indexed:

  • Run a metadata search query like (name of the file, i.e. cm:name) "test-indexing.txt"

  • Run a metadata search query like (title of the file, i.e. cm:title) "IndexControl"


As per the results above, we can see that, search queries for content and metadata are returning results. Now, Let's try disabling the content indexing. We expect to see metadata query returning the results but content query should return 0 results. We will be re-running all the tests that we executed above before disabling the content indexing.


  • Following the the steps given above (Option 2), I disabled content indexing
  • Deleted the solr indexes for full re-indexing and restarted the servers/containers
  • Verified that property value is set to false for disabling the content indexing

  • Run a content search query like: "indexing in Alfresco" (0 results expected)

  • Run a query via solr admin to see if content is being indexed (no response expected):

  • Run a metadata search query like (name of the file, i.e. cm:name) "test-indexing.txt". Result should be returned based on cm:name metadata

  • Run a metadata search query like (title of the file, i.e. cm:title) "IndexControl". Result should be returned based on cm:title metadata



Thank you Angel Borroy for tips :) 












5 comments:

  1. Thanks for sharing. Your articles are very informative, you cover even minor details as well which is really appreciated

    ReplyDelete
  2. Thanks for the nice explanation. Was trying to get understanding on control indexing behavior in alfresco. Very helpful

    ReplyDelete
  3. Thanks Abhinav for informative article. Can we reindex failed documents or a particular document instead of running full reindex????

    ReplyDelete
    Replies
    1. Yes you can, checkout these docs:

      https://docs.alfresco.com/search-services/latest/admin/monitor/#unindexed-transactions

      https://docs.alfresco.com/search-services/latest/admin/restapi/

      Delete

Thanks for your comments/Suggestions.