FullText index in MySQL


MySQL is one of the most used relational database management systems. We all know that it is fast and easy to use. In MySQL, storage engines are the components that handle the SQL operations. InnoDB is that widely used storage engine in MySQL. One of the reasons that make SQL operations fast is the index. If you create the index correctly, your query can be very fast. We can create an index on int, char type(one word) columns, but we can also create a FULLTEXT index on the text-based columns where we save multi-word strings. In this blog, I will explain how the FULLTEXT index works.
We can create a full-text index while creating a table or altering the table and creating the index. InnoDB full-text indexes have an inverted index design. What is the inverted index? Inverted index stores a list of words and mapping of that word to the document in which that word appears. When we create the FULLTEXT index, InnoDB creates a hidden column FTS_DOC_ID. This FTS_DOC_ID is used in the index table to map the word to a document. What is an index table? When we create a FULL-TEXT index, then a set of index tables also get created automatically.

Image description
The index table is associated with the indexed table by the hex value. For example, in the above image, the indexed table table_id is 2243 and, in hex, it is 8c3. So index table FTS_00000000000008c3
The first six index table saves the mapping of the word and document id(FTS_DOC_ID) in which that word appears. When any new document(data in a column that has a FULLTEXT index) is inserted, it is tokenized and an individual word(known as a token) is inserted in the index table along with the doc_id of that document and the position of the word in the document. We don’t insert the stop words. StopWords are the list of commonly-used words that we can ignore for the purpose of the FULLTEXT index. YOu can get the default stop words from the table INNODB_FT_DEFAULT_STOPWORD. The words are partitioned among six index tables based on the character set sort weight of each word. If any word is already present in the index table then we update and include the doc_id of the new document. If we do this full process for each document, it may result in numerous insertions in the index table that can make concurrent access to these tables a point of contention. To avoid this problem InnoDB uses the cache. InnoDB temporarily caches index table insertion for recent rows, and holds the insertion until the cache is full. Once the cache is full, InnoDB does batch flushes to disk. The caching and batch flushing behaviour avoids frequent updates to index tables that could result in concurrent access issues during busy insert and update times. The batching technique also, avoids multiple insertions for the same word and minimizes duplicate entries. Instead of flushing each word individually, insertions for the same word are merged and flushed to the disk as a single entry, improving insertion efficiency while keeping index tables as small as possible. INNODB_FT_CACHE_INDEX_TABLE holds the newly inserted rows in a FULLTEXT index.

Index deletion
Deleting a record that has a full-text index column could result in numerous small deletions in the index tables, making concurrent access to these tables a point of contention. To avoid this problem, the DOC_ID of a deleted document is logged in a special FTS_DELETED[1] table whenever a record is deleted from an indexed table, and the indexed record remains in the full-text index. Before returning query results, information in the FTS_DELETED table is used to filter out deleted DOC_ID. The benefit of this design is that deletions are fast and inexpensive. The drawback is that the size of the index is not immediately reduced after deleting records. To remove full-text index entries for deleted records, run OPTIMIZE_TABLE on the indexed table with innodb_optimize_fulltext_only=ONto rebuild the full-text index.

[1]: FTS_00000000000008c3_DELETED in above image contain the document IDs (DOC_ID) for documents that are deleted but whose data is not yet removed from the full-text index. FTS_00000000000008c3_DELETED_CACHE is in-memory version of FTS_00000000000008c3_DELETED. We first save it in cache and once cache is full, flush it to FTS_*_DELETED table.

Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

Write typesafe APIs with zodios

Next Post

Setting up the medusa server

Related Posts