The search mechanism of Solr is a powerful and complex tool and it can take some time to get it configured in such a way that users are able to easily find the files that they are after.
In this article we will highlight some of the main aspects to take note of and provide examples and links for further reading.
Info: This article can be used for Solr 4, Solr 6, and Solr 7.
Tokens
As explained in Understanding the Solr Search functionality in Enterprise Server, Solr handles data by using 'tokens'.
Example: The following sentence: "Please, email john.doe@foo.com by 03-09, re: m37-xq." is split into the following tokens: "Please", "email", "john.doe@foo.com", "by", "03-09", "re", "m37-xq" |
The default token length varies between 4 and 15 characters.
Also, Solr by default removes characters such as underscores or dashes from the words that are indexed.
Example: "wi-fi" is indexed as "wi" and "fi".
From this, we can conclude that:
- The more tokens exist, the more search results can be returned (potentially too many to be practical).
-
Terms shorter than 4 characters will be ignored. For terms that are longer than 15 characters, only the first 15 characters are included.
- When a user enters a search phrase that contains underscores or dashes, no results are displayed.
To resolve these issues, we can:
Do this by disabling (commenting-out) the WordDelimiterFilterFactory class in the schema.xml by wrapping the filter elements between <!- - and - -> brackets as follows:
Step 1. Open the file <Solr installation directory>/schema.xml.
Step 2. Disable (comment-out) the WordDelimiterFilterFactory class.
<fieldType name="textNGram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<!--
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="0" splitOnNumerics="0"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="15"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!--
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Step 3. Save and close the file.
Step 4. (For Solr 4 only) Re-start Tomcat.
Step 5. Re-index Solr from the Search Server page in Enterprise Server.
By default, Solr's N-Gram Tokenizer is enabled and generates n-gram tokens of sizes in the default range of 4 – 15 characters. This is configured in the schema.xml file.
If you wish to adjust the default range or to disable the tokenizer, please make sure that these changes are also reflected in the config_solr.php file by following the steps below.
Step 1. Open the file <Solr installation directory>/schema.xml.
Step 2. Locate any reference to 'solr.NGramFilterFactory':
<filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="15"/>
Step 3. To adjust the default range simply change the minGramSize and/or maxGramSize attributes of the filter, for example:
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="18"/>
Or, to disable the n-gram tokenizer, comment-out the filter option as follows:
<!-- <filter class="solr.NGramFilterFactory" minGramSize="4" maxGramSize="15"/> -->
Step 4. Save and close the file.
Step 5. Open the config_solr.php file.
Tip: (For Enterprise Server 10.1 or higher only) Easily manage and configure settings of all configuration files by adding them to a single configuration file.
- <Enterprise installation directory>/config/
Step 6. Locate the SOLR_NGRAM_SIZE option:
//Defines the range used for NGRAM size
define ('SOLR_NGRAM_SIZE', serialize(array(
4, // MinGramSize
15, // MaxGramSize
)));
Step 7. To adjust the default range, simply change the values configured for the MinGramSize and/or MaxGramSize options, for example:
//Defines the range used for NGRAM size define ('SOLR_NGRAM_SIZE', serialize(array( 3, // MinGramSize 18 // MaxGramSize )));
Or, to disable the n-gram tokenizer, comment-out the option as follows:
// Defines the range used for NGRAM size //define ('SOLR_NGRAM_SIZE', serialize(array( // 4, // MinGramSize // 15 // MaxGramSize //)));
Step 8. Save and close the file.
Step 9. (For Solr 4 only) Re-start Tomcat.
Step 10. Re-start Solr.
Step 11. Re-index Solr from the Search Server page in Enterprise Server.
Step 1. Open the file <Solr installation directory>/schema.xml.
Step 2. Look up the fieldType block starting with <fieldType name="textNGram".
This block contains 2 definitions for 'solr.WordDelimiterFilterFactory' (in analyzer type="index" and "query").
Step 3. Have the original token indexed without modifications by setting preserveOriginal="1"
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="0" generateNumberParts="0" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0"/>
Step 4. Save and close the file.
Step 5. (For Solr 4 only) Restart Tomcat (it might also be needed to restart Apache as well).
The applied schema.xml changes now need to be applied to the database objects.
Step 6. Access the Search Server Maintenance page in Enterprise Server.
Step 6a. Click Integrations in the Maintenance menu or on the Home page.
Step 6b. Click Search Server.
Step 7. In the Indexing section, click Clear, followed by Start.
Additional configuration settings
The examples provided above are just a few of the many possible solutions for getting better search results. Which of these solutions you need for your scenario depends on many factors (such as the type of characters used in file names and the length of file names, both typically controlled by file naming conventions).
Correctly configuring Solr for your environment requires a good understanding of the concepts used by Solr and an awareness of the available settings that can be configured.
We therefore advise to go through the Solr documentation, such as Analyzers, Tokenizers, and Token Filters.
Comment
Do you have corrections or additional information about this article? Leave a comment! Do you have a question about what is described in this article? Please contact Support.
0 comments
Please sign in to leave a comment.