Configuring Solr to control the search results in Enterprise 10 – Help Center

The search mechanism of Solr is a powerful and complex tool and it can take some time to get it configured in such a way that users are able to easily find the files that they are after.

In this article we will highlight some of the main aspects to take note of and provide examples and links for further reading.

Info: This article can be used for Solr 4, Solr 6, and Solr 7.

Tokens

As explained in Understanding the Solr Search functionality in Enterprise Server, Solr handles data by using 'tokens'.

Example: The following sentence:

"Please, email john.doe@foo.com by 03-09, re: m37-xq."

is split into the following tokens:

"Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq"

The default token length varies between 4 and 15 characters.

Also, Solr by default removes characters such as underscores or dashes from the words that are indexed.

Example: "wi-fi" is indexed as "wi" and "fi".

From this, we can conclude that:

The more tokens exist, the more search results can be returned (potentially too many to be practical).
Terms shorter than 4 characters will be ignored. For terms that are longer than 15 characters, only the first 15 characters are included.
When a user enters a search phrase that contains underscores or dashes, no results are displayed.

To resolve these issues, we can:

Stop Solr from tokenizing on subwords

Do this by disabling (commenting-out) the WordDelimiterFilterFactory class in the schema.xml by wrapping the filter elements between <!- - and - -> brackets as follows:

Step 1. Open the file <Solr installation directory>/schema.xml.

Step 2. Disable (comment-out) the WordDelimiterFilterFactory class.

<fieldType name="textNGram" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!-- 
   <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" splitOnNumerics="0"/>
-->
  <filter class="solr.LowerCaseFilterFactory"/>       
  <filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="15"/>
  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>

 <analyzer type="query">
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
  <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- 
  <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"/>
-->
  <filter class="solr.LowerCaseFilterFactory"/>
  <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
 </analyzer>
</fieldType>

Step 3. Save and close the file.

Step 4. (For Solr 4 only) Re-start Tomcat.

Step 5. Re-index Solr from the Search Server page in Enterprise Server.

Customize the character range of the search token

By default, Solr's N-Gram Tokenizer is enabled and generates n-gram tokens of sizes in the default range of 4 – 15 characters. This is configured in the schema.xml file.

If you wish to adjust the default range or to disable the tokenizer, please make sure that these changes are also reflected in the config_solr.php file by following the steps below.

Step 1. Open the file <Solr installation directory>/schema.xml.

Step 2. Locate any reference to 'solr.NGramFilterFactory':

<filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="15"/>

Step 3. To adjust the default range simply change the minGramSize and/or maxGramSize attributes of the filter, for example:

<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="18"/>

Or, to disable the n-gram tokenizer, comment-out the filter option as follows:

<!--
<filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="15"/>
-->

Step 4. Save and close the file.

Step 5. Open the config_solr.php file.

Tip: (For Enterprise Server 10.1 or higher only) Easily manage and configure settings of all configuration files by adding them to a single configuration file.

Step 6. Locate the SOLR_NGRAM_SIZE option:

//Defines the range used for NGRAM size
define ('SOLR_NGRAM_SIZE', serialize(array(	
4,	// MinGramSize
15,	// MaxGramSize
)));

Step 7. To adjust the default range, simply change the values configured for the MinGramSize and/or MaxGramSize options, for example:

 //Defines the range used for NGRAM size
define ('SOLR_NGRAM_SIZE', serialize(array(
   3, // MinGramSize
   18 // MaxGramSize
)));

Or, to disable the n-gram tokenizer, comment-out the option as follows:

// Defines the range used for NGRAM size
//define ('SOLR_NGRAM_SIZE', serialize(array(
//   4, // MinGramSize
//   15 // MaxGramSize
//)));

Step 8. Save and close the file.

Step 9. (For Solr 4 only) Re-start Tomcat.

Step 10. Re-start Solr.

Step 11. Re-index Solr from the Search Server page in Enterprise Server.

Configure Solr search to find objects with underscores or dashes

Step 1. Open the file <Solr installation directory>/schema.xml.

Step 2. Look up the fieldType block starting with <fieldType name="textNGram".

This block contains 2 definitions for 'solr.WordDelimiterFilterFactory' (in analyzer type="index" and "query").

Step 3. Have the original token indexed without modifications by setting preserveOriginal="1"

<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"/>

Step 4. Save and close the file.

Step 5. (For Solr 4 only) Restart Tomcat (it might also be needed to restart Apache as well).

The applied schema.xml changes now need to be applied to the database objects.

Step 6. Access the Search Server Maintenance page in Enterprise Server.

Step 7. In the Indexing section, click Clear, followed by Start.

Additional configuration settings

The examples provided above are just a few of the many possible solutions for getting better search results. Which of these solutions you need for your scenario depends on many factors (such as the type of characters used in file names and the length of file names, both typically controlled by file naming conventions).

Correctly configuring Solr for your environment requires a good understanding of the concepts used by Solr and an awareness of the available settings that can be configured.

We therefore advise to go through the Solr documentation, such as Analyzers, Tokenizers, and Token Filters.

Tokens

Additional configuration settings

Related articles