These days, hardly anyone is searching an online store by rambling among the categories or scrolling down the long lists of products.
There is a bunch of available onsite search tools that can make an internal site search fast, intuitive and adjusted to any customer needs.
In this series of articles we are going to review the functionality of the most popular eCommerce onsite search solutions. And the first search toolkit on the list is Sphinx.
Table of Contents
What is Sphinx?
Sphinx is an open source search engine with fast full-text search capabilities.
High speed of indexation, flexible search capabilities, integration with the most popular data base management systems (e.g. MySQL, PostgreSQL) and the support of various programming language APIs (e.g. for PHP, Python, Java, Perl, Ruby, .NET и C++ etc) — all that make the search engine popular with thousands of eCommerce developers and merchants.
This is what makes Sphinx stand out:
- high indexing performance (up to 10-15 Mb/s on one core)
- rapid search performance (up to 150-250 Mb/s on a core with 1,000,000 documents)
- high scalability (the biggest known cluster is capable of indexing up to 3,000,000,000 documents and can handle more than 50 millions of queries per day)
- support of the distributed real-time search
- simultaneous support of several fields (up to 32 by default) for full-text document search
- the ability to support a number of extra attributes for every document (e.g. groups, time tags, etc.)
- support of stop words
- the ability to handle both single-byte encodings and UTF-8
- support of morphologic search
- and dozens more
All in all, Sphinx has more than 50 different features (and this number is constantly growing). Follow this link to overview the search engine functionality.
How Sphinx Works
The whole complexity of the search engine working pattern can be summed up in 2 key points:
- using the source table, Sphinx creates its own index database
- next, when you send an API query, Sphinx returns an array of IDs that correspond to those in the source table.
Installing Sphinx on a Server
The installation procedure is pretty easy. Follow the links below for a step-by-step installation instructions on:
This is a particular example of installing the search engine on CentOS:
wget http://sphinxsearch.com/files/sphinx-2.1.6-1.rhel6.x86_64.rpm yum localinstall sphinx-2.1.6-1.rhel6.x86_64.rpm
When the installation is complete, Sphinx will create the path to the Config file. In the standard scenario it is:
/etc/sphinx/sphinx.conf
If you are going to simultaneously use Sphinx for several projects, it’s generally advised to create a separate folder for the Config file, Index and Log.
E.g.
Config path – /etc/sphinx/searchsuite.yasha.web.ra/
Index path – /var/lib/sphinx/searchsuite.yasha.web.ra/
Logs path – /var/log/sphinx/searchsuite.yasha.web.ra/
Configuring Sphinx.conf File
Sphinx configurator consists of 4 constituents:
- Data Source
- Index
- Indexer
- Search Daemon
Here is how you can configure each of them:
1. Data Source
source catalogsearch_fulltext # catalogsearch_fulltext - the name of the source { type = mysql # the type of the database Sphinx connects to sql_host = # the host where the remote database is placed sql_user = # a remote database user sql_pass = # a remote database password sql_db = yasha_searchsuite # the name of the remote database sql_port = 3306 # optional, default is 3306 ; the port, used to connect to the remote database sql_sock = /var/lib/mysql/mysql.sock # the socket, used to connect to the remote database (if necessary) sql_query = SELECT fulltext_id, data_index1, data_index2, data_index3, data_index4, data_index5 FROM catalogsearch_fulltext sql_attr_uint = fulltext_id # sql_attr_* — the attributes that are returned during the search process sql_attr_uint = product_id sql_attr_uint = store_id sql_field_string = data_index1 # sql_field_* — these are the fields that should be indexed ... sql_field_string = data_index5 sql_query_info = SELECT * FROM catalogsearch_fulltext WHERE fulltext_id=$id # additional query }
2. Index
index catalogsearch_fulltext { source = catalogsearch_fulltext # the data source path = /var/lib/sphinx/searchsuite.yasha.web.ra/catalogsearch_fulltext # the path to the location where the index is stored docinfo = extern charset_type = utf-8 min_word_len = 3 # the minimum number of characters necessary to initiate the search min_prefix_len = 0 # if 0 - the setting is off, > 0 - the minimum number of characters at the beginning of a search query that is necessary to start searching min_infix_len = 3 # if 0 - the setting is off, > 0 - the minimum number of characters in the whole word, necessary to initiate the search }
And here is what some of the settings from the list above settings mean:
Prefixes — indexing prefixes allows you to run wildcard searching by ‘wordstart* wildcards. Say, if the minimum prefix length is set to > 0, the Indexer will include all the possible keyword prefixes (or, as we call them, word beginnings) in addition to the main keyword.
Thus, in addition to the keyword itself, e.g. ‘example’, Sphinx will add extra ‘exa’, ‘exam’, ‘examp’, ‘exampl’ prefixes to its index.
Note, too short prefixes (below the minimum allowed length) will not be indexed.
Infixes — Sphinx is capable of including any infixes (aka word parts) into its index. E.g. In our example, indexing the keyword “test” will add its parts “te”, “es”, “st”, “tes”, “est” in addition to the main word.
IMPORTANT! It’s not possible to enable these 2 settings at the same time. If done, you’ll get a fatal error during indexation.
Also, enabling either of these 2 settings can significantly slow down the indexation and search performance. Especially, when working with big data volumes.
3. Indexer
To configure the Indexer, you just need to set the appropriate memory limit that can be used by the Daemon Indexer.
indexer { mem_limit = 128M # }
4. Search Daemon
Here are the general Sphinx Search Daemon settings (supplied with the explanatory comments).
searchd { listen = 9312 # the port, used to connect to Sphinx listen = 9306:mysql41 log = /var/log/sphinx/searchsuite.yasha.web.ra/searchd.log # Daemon log file query_log = /var/log/sphinx/searchsuite.yasha.web.ra/query.log # search log read_timeout = 5 # time (in seconds) the Daemon waits in case of a lost connection (when communicating data to a searcher) max_children = 30 # The maximum number of simultaneously processed queries When set to 0, no limitation is applied. pid_file = /var/run/sphinx/searchd.pid # The file, the launch PIDs are stored in max_matches = 1000 seamless_rotate = 1 preopen_indexes = 1 unlink_old = 1 workers = threads # for RT to work binlog_path = /var/lib/sphinx/ # Binlog for crash recovery }
Morphology
After splitting the text into separate words, the morphology preprocessors are slapped into action.
These mechanisms can replace different forms of the same word with the basic, aka ‘normal’ one. This approach lets the search engine ‘synchronize’ the main search query with its forms, so that it would be possible to find all forms of the same word in the index.
When Sphinx morphology algos are enabled, the search engine returns the same search results for different forms of a word. E.g. the results may be totally identical for both ‘laptop’ and ‘laptops’.
Sphinx supports 3 types of morphology preprocessors:
- Stemmer
- Lemmatizer
- phonetical algorithms
1. Stemmer
It’s the easiest and fastest morphology preprocessor. It lets the search engine find the word’s stem (a part of a word that remains unchanged for all its forms) without using any extra morphological dictionaries.
Basically, the Stemmer removes or replaces certain word suffixes and/or endings.
This morphology preprocessor works fine for most of search queries. However, there are some exceptions. For instance, with this method, ’set’ and ‘setting’ will be considered as 2 separate queries.
Also, the preprocessor can treat words that have different meaning but the same stem as identical.
To enable the Stemmer, add the following line to the Index:
morphology = stem_enru
2. Lemmatizer
Unlike the Stemmer, this morphology preprocessor uses morphological dictionaries, which lets the search engine strip the keyword down to lemma. The lemma is a proper, natural language root word.
E.g. the search query ‘settings’ will be reduced to its infinitive form ‘set’.
To use the Lemmatizer, you need to download the morphological dictionaries. You can do that on the official website at sphinxsearch.com
In Config file – Indexer block you can find the lemmatizer_base option. This option will let you specify the path to the folder where you store all the dictionaries.
indexer { ... lemmatizer_base = /var/lib/sphinx/data/dict/ }
When done, you need to select either lemmatize_en or lemmatize_en_all built-in value. In the latter case, Sphinx will apply the Lemmitizer and the Index all the root word forms.
3. Phonetics algos
At the moment, Sphinx supports 2 phonetical algorithms, these are: Soundex and Metaphone.
Currently, they both work for the English language only.
Basically, these algos substitute the words of the search query with specially crafted phonetic codes. It lets the search engine treat the words that are different in meaning but phonetically close as the same.
This way of search can be of great help when searching by a customer’s name/ surname.
To enable the phonetic algos, you need to specify the values of soundex or metaphone for the morphology option.
morphology = metaphone
Stop Words
The stopwords features in Sphinx lets the search engine ignore certain keywords when creating an index and implementing searches.
All you need is to make a file with all your stop words, upload it to the server and set a path for Sphinx to find it.
When creating a list of stop words, it’s generally recommended to include the keywords that are so frequently mentioned in the text that have no influence on search results. As a rule, these are: articles, prepositions, conjunctions, etc.
With the help of the Indexer it’s possible to create a dictionary of index frequency, where all the indexes are sorted by keyword frequency. You can do that using the commands:
--buildstops and --buildfreqs. stopwords = /var/lib/sphinx/data/stopwords.txt stopwords = stopwords-ru.txt stopwords-en.txt
Word Forms
The wordforms feature in Sphinx enables the search engine to deliver the same search results no mater which word form of the search query is used. E.g. customers who are looking for ‘iphone 6’ or ‘i phone 6’ will get the same results.
This functionality comes really useful if you need to define the normal word form in cases when the Stemmer can’t do it. Also, having the file with all word forms, you will be able to easily set up the dictionary of search synonyms.
These dictionaries are used to normalize the search queries during indexation and when implementing search. Hence, to apply changes in the wordforms file, you need to run re-indexation.
The example of the file:
walks > walk
walked > walk
walking > walk
Note, that starting with 2.1.1 version, it’s possible to use к “=>” instead of “>”. Starting with 2.2.4 version you can also use
multiple destination tokens:
s02e02 => season 2 episode 2
s3 e3 => season 3 episode 3
wordforms = /var/lib/sphinx/data/wordforms.txt
Main Sphinx Commands
And finally, below you can find the list of the commands used for different operations with the search engine:
1. Editing Sphinx config file:
vi /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf
2. Indexing data from the targeted config sources:
sudo -usphinx indexer –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf –all –rotate
3. Launching the Search Daemon:
sudo -usphinx searchd –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf
4. Disabling the Search Daemon:
sudo -usphinx searchd –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf –stop
5. Checking whether the search engine is functioning correctly (making a request to already created indexes):
sudo -usphinx search –config /etc/sphinx/searchsuite.yasha.web.ra/sphinx.conf aviator (instead of ‘aviator’ you can use any other word).
Working with API
include_once Mage::getBaseDir('lib') . DS . 'Sphinx' . DS . 'sphinxapi.php'; $instance = new SphinxClient(); $instance->SetServer(‘localhost’, 9312); $instance->SetConnectTimeout(30); $instance->SetArrayResult(true); $instance->setFieldWeights(array('data_index1' => 5, 'data_index2' => 4, 'data_index3' => 3, 'data_index4' => 2, 'data_index5' => 1)); $instance->SetLimits(0, 1000, 1000); $instance->SetFilter('store_id', array(1, 0)); $result = $instance->Query('*'.$queryText.'*', ‘catalogsearch_fulltext');
Bottom Line
In this tutorial, I’ve tried to outline the main aspects of setting up and configuring Sphinx.
As you can see, by using this search engine, you can easily add a custom search to your Magento website.
Questions?
Feel free to leave a comment and I’ll get back to you. 🙂
Hello, Is there any way to use MSSQL instead of MySQL?
Hello Ece,
Thanks a lot for your comment. In fact, yes, it is possible to use MSSQL instead of MySQL. You’d need to change source type to MSSQL. For more details, I’d advise going to the official forum: https://sphinxsearch.com/forum/view.html?id=5678
Have I managed to help you out?