Friday, April 28, 2023

Control indexing in Alfresco with Search Enterprise


Following my previous post on controlling indexing behavior, I tested the indexing behavior with Alfresco Content Services 7.2.1.3 enterprise and Search Enterprise 3.1.1.

Pre-requisites
  • You have an environment up and running with Alfresco Content Services 7.x and Search Enterprise 3.x)
  • You have administrative privileges

Looking for Alfresco Content Services 7.x with Search Enterprise 3.x installation steps? , checkout this post:


Content and metadata are indexed by default, it is out-of-the-box behavior. Indexation with Elastic Search connector happens using events. There are two ways you can control content/metadata indexing behavior in order to fulfil the search and indexing requirements. We will go over both options.

Control indexing behavior with help of content model aspect:


To control the indexing behavior, you can make use of a content model aspect named "cm:indexControl" which has two properties. These properties indicate whether content/metadata should be indexed.

The value of these properties are set to true by default.

<aspect name="cm:indexControl">
	<title>Index Control</title>
	<properties>
		<property name="cm:isIndexed">
			<title>Is indexed</title>
			<type>d:boolean</type>
			<default>true</default>
		</property>
		<property name="cm:isContentIndexed">
			<title>Is content indexed</title>
			<type>d:boolean</type>
			<default>true</default>
		</property>
	</properties>
</aspect>

You can apply cm:indexControl aspect on the nodes to control the indexing behavior by setting the appropriate properties. Note that, this approach works only for certain types like cm:folder, cm:content and sub-types. You need to keep in mind that, if you have a large number of nodes which needs to be excluded from indexing then this option is not a right choice as you will have to apply the aspect by setting "cm:isContentIndexed" to "false" on all those nodes. This approach is exactly same even if you are using Alfresco Search Service (Solr6), no difference.

In this situation second option (which we will see next) comes handy. 


A known issue with UPDATE event:
  • If you use folder rule to apply the cm:indexControl aspect 
  • Or If you use any script to update the node to apply the cm:indexControl aspect with cm:isContentIndexed to false

As we know that, indexation with elasticsearch happens with events, so when you create/upload a content and try to set the cm:indexControl aspect using a folder rule, the CREATE and UPDATE events occur one after other.

At the moment Live Indexing app is missing to remove the document from the index if that UPDATE contains the setting for the cm:indexControl aspect. It seems to be just omitting the indexation of the changes. 

This is a known issue and being tracked using this ticket: https://alfresco.atlassian.net/browse/MNT-23347

[EDIT] : As of 01/29/2024 the above known issue is not yet fixed. It seems to be in 2024-Q2 Roadmap. We will have to wait for it.

There are following alternatives to apply the cm:indexControl aspect at CREATE event:
  • By creating a behavior using "org.alfresco.repo.node.NodeServicePolicies" policy and implement "onCreateNode" method and set the cm:indexControl aspect. Learn more here on implementing Behavior Policies.
  • Create a custom aspect by overriding the default values as stated above
    • Create a custom aspect overriding cm:indexControl. 
    • For example, if you want to disable content indexing, set the cm:isContentIndexed to falseTo learn more on content model, aspects and their application, refer: Content Model Extension Point
<aspect name="demo:customIndexControl">
	<title>Override Index Control to disable content indexing by default</title>
        <parent>cm:indexControl</parent> 
	<overrides>
		<property name="cm:isIndexed">
                        <default>true</default>
</property> <property name="cm:isContentIndexed"> <default>false</default> </property> </overrides> </aspect>

Thank you Angel Borroy for help on this.

Control indexing behavior via LiveIndexingApp mediation-filter :

Live Indexing App is a component within Elastic Search connector which is responsible for indexing nodes (content/metadata). There is a component called Mediation (alfresco-elasticsearch-live-indexing-mediation) which subscribes to the alfresco.event.topic (activemq:topic:alfresco.repo.event2) and processes the incoming node events. The configuration of this component allows you to declare four blacklist sets for filtering out nodes or attributes to be indexed. These blacklists can be specified in the file using the alfresco.mediation.filter-file property. The default file is called mediation-filter.yml that must be in the module classpath.

You need to keep in mind that, if you needed to exclude only specific nodes then this option is not a right choice. This approach controls the indexing behavior globally. It can either enable or disable globally at a time.

mediation-filter.yml out of the box (showing blacklisted aspects and fields):

mediation:
  nodeTypes:
  contentNodeTypes:
  nodeAspects:
    - sys:hidden
  fields:
    - cmis:changeToken
    - alfcmis:nodeRef
    - cmis:isImmutable
    - cmis:isLatestVersion
    - cmis:isMajorVersion
    - cmis:isLatestMajorVersion
    - cmis:isVersionSeriesCheckedOut
    - cmis:versionSeriesCheckedOutBy
    - cmis:versionSeriesCheckedOutId
    - cmis:checkinComment
    - cmis:contentStreamId
    - cmis:isPrivateWorkingCopy
    - cmis:allowedChildObjectTypeIds
    - cmis:sourceId
    - cmis:targetId
    - cmis:policyText
    - trx:password
    - pub:publishingEventPayload


Where:

nodeTypes: if the node wrapped in the incoming event has a type which is included in this set, the node processing is skipped.

contentNodeTypes: if the node wrapped in the incoming event has a content change associated with it and it has a type which is included in this set, then the corresponding content processing won’t be executed. This means nodes belonging to one of the node types in this set, won’t have any content indexed in Elasticsearch.

nodeAspects: if the node wrapped in the incoming event has an aspect which is included in this set, the node processing is skipped.

fields: fields listed in this set are removed from the incoming nodes metadata. This means fields in this set won’t be sent to Elasticsearch for indexing, and therefore they won’t be searchable.

For more details on setting up Elastic Connector and its components visit here.

Disable/Blacklist content indexing:

To disable content indexing we need to blacklist the "cm:content" type under "contentNodeTypes". With this set, node content won't be transformed and content will not be added to Elasticsearch Index. You won't be able to search the document by content.

The updated 'mediation-filter.yml' file looks as follows:

mediation:
  nodeTypes:
  contentNodeTypes:
    - cm:content
  fields:
    - cmis:changeToken
    - alfcmis:nodeRef
    - cmis:isImmutable
    - cmis:isLatestVersion
    - cmis:isMajorVersion
    - cmis:isLatestMajorVersion
    - cmis:isVersionSeriesCheckedOut
    - cmis:versionSeriesCheckedOutBy
    - cmis:versionSeriesCheckedOutId
    - cmis:checkinComment
    - cmis:contentStreamId
    - cmis:isPrivateWorkingCopy
    - cmis:allowedChildObjectTypeIds
    - cmis:sourceId
    - cmis:targetId
    - cmis:policyText
    - trx:password
    - pub:publishingEventPayload

Blacklist specific properties from being indexed:

To disable/blacklist one or more specific metadata from being indexed, we need to blacklist the properties e.g. "demo:documentIDInternal" (a content model property) under "fields". These properties won’t be sent to Elasticsearch for indexing, and therefore they won’t be searchable.

The updated 'mediation-filter.yml' file looks as follows:

mediation:
  nodeTypes:
  contentNodeTypes:
  nodeAspects:
    - sys:hidden
  fields:
    - cmis:changeToken
    - alfcmis:nodeRef
    - cmis:isImmutable
    - cmis:isLatestVersion
    - cmis:isMajorVersion
    - cmis:isLatestMajorVersion
    - cmis:isVersionSeriesCheckedOut
    - cmis:versionSeriesCheckedOutBy
    - cmis:versionSeriesCheckedOutId
    - cmis:checkinComment
    - cmis:contentStreamId
    - cmis:isPrivateWorkingCopy
    - cmis:allowedChildObjectTypeIds
    - cmis:sourceId
    - cmis:targetId
    - cmis:policyText
    - trx:password
    - pub:publishingEventPayload
    - demo:documentIDInternal
    - demo:publisherInternal


Blacklist specific types from being indexed:

To disable/blacklist one or more types from being indexed, we need to blacklist the types e.g. "demo:invoice" (a content model type) under "nodeTypes". The nodes having type as "demo:invoice" would be excluded from indexing and can't be searched.

The updated 'mediation-filter.yml' file looks as follows:

mediation:
  nodeTypes:
    - demo:invoice
  contentNodeTypes:
  nodeAspects:
    - sys:hidden
  fields:
    - cmis:changeToken
    - alfcmis:nodeRef
    - cmis:isImmutable
    - cmis:isLatestVersion
    - cmis:isMajorVersion
    - cmis:isLatestMajorVersion
    - cmis:isVersionSeriesCheckedOut
    - cmis:versionSeriesCheckedOutBy
    - cmis:versionSeriesCheckedOutId
    - cmis:checkinComment
    - cmis:contentStreamId
    - cmis:isPrivateWorkingCopy
    - cmis:allowedChildObjectTypeIds
    - cmis:sourceId
    - cmis:targetId
    - cmis:policyText
    - trx:password
    - pub:publishingEventPayload
 

Following property needs to be set to control indexing behavior via mediation-filter:

alfresco.mediation.filter-file or ALFRESCO_MEDIATION_FILTER-FILE => The configuration file which contains fields and node types blacklists. The default value is classpath:mediation-filter.yml

Note: Default mediation-filter config resides in 'alfresco-elasticsearch-live-indexing-shared-xxx.jar' which is a dependency of alfresco-elasticsearch-live-indexing-xxx-app.


Mediation filter can be provided via either of the following:

    • If you are using separate services for each component 

live-indexing-mediation:
        image: quay.io/alfresco/alfresco-elasticsearch-live-indexing-mediation:3.1.1
        depends_on:
            - elasticsearch
            - alfresco
        environment:
            SPRING_ELASTICSEARCH_REST_URIS: http://elasticsearch:9200
            SPRING_ACTIVEMQ_BROKERURL: nio://activemq:61616
            ALFRESCO_MEDIATION_FILTER-FILE: file:/usr/tmp/mediation-filter.yml
        volumes:
            - ./mediation-filter.yml:/usr/tmp/mediation-filter.yml
    • If you are using AIO live indexing service (alfresco-elasticsearch-live-indexing)

live-indexing:
        image: quay.io/alfresco/alfresco-elasticsearch-live-indexing:3.1.1
        depends_on:
            - elasticsearch
            - alfresco
        environment:
            SPRING_ELASTICSEARCH_REST_URIS: http://elasticsearch:9200
            SPRING_ACTIVEMQ_BROKERURL: nio://activemq:61616
            ALFRESCO_MEDIATION_FILTER-FILE: file:/usr/tmp/mediation-filter.yml
            ALFRESCO_ACCEPTED_CONTENT_MEDIA_TYPES_CACHE_BASE_URL: http://transform-core-aio:8090/transform/config
            ALFRESCO_SHAREDFILESTORE_BASEURL: http://shared-file-store:8099/alfresco/api/-default-/private/sfs/versions/1/file/
        volumes:
            - ./mediation-filter.yml:/usr/tmp/mediation-filter.yml

          Bind mount syntax -> [SOURCE:]TARGET[:MODE]

- SOURCE can be a named volume or a (relative or absolute) path on the host system. 

- TARGET is an absolute path in the container where the volume is mounted. 

- MODE is a mount option which can be read-only (ro) or read-write (rw) (default).

For more details read docker volumes documentation here.

    • Launch the containers again, use following command. This would launch the containers with updated changes:
docker-compose -f ./docker-compose.yml up

  • If you have installation based on distribution package then pass the following param to live indexing boot app and start it:

java -jar C:\alfresco-elastic-search-services\alfresco-elasticsearch-live-indexing-3.1.1-app.jar ^
	--alfresco.mediation.filter-file=file:C:\\alfresco-elastic-search-services\\mediation-filter.yml

OR

java -jar C:\alfresco-elastic-search-services\alfresco-elasticsearch-live-indexing-3.1.1-app.jar ^
	-DALFRESCO_MEDIATION_FILTER-FILE=file:C:\\alfresco-elastic-search-services\\mediation-filter.yml


Note: Any newly created/uploaded content will be taken care by the live indexing app. For existing content, a re-indexing will be required.


More on Elastic Search connector can be found here.




References: