Friday, November 09, 2007 3:26 PM
Full Text Indexing Overhead
I think that there's a tendency among DBAs and SQL Server wonks to be a bit dismissive of Full Text Indexing. I think that in a lot of cases people tend to think of it in terms of adding significant overhead.
For a long time I actually looked at it as being a more powerful, but more EXPENSIVE, form of searching. Then I actually had the chance to work with it (in a past life) a good deal, and found that it didn't really add that much overhead in a lot of cases. In fact, I found that it was actually pretty cool technology and afforded a lot of functionality with no noticeable performance hits. Though, I found that indexes frequently crashed a lot and needed an extra bit of 'love' to keep them functioning correctly. (This was all on SQL Server 2000.)
But in some cases it can yield insane performance benefits
Fast-forward to today. I'm currently working with a client that has a decent sized database (about 20GB). More than 5GB of that DB is made up of a 'Books' table - which has about 8M rows. The table has both and author and title domain, and each of them is pretty wide (nvarchar(300) and above). This table provides the backing for key functionality, and users frequently search for authors or titles using LIKE and wildcards. We all know that's nasty - as it forces a scan.
These searches were actually causing a table-scan, which was throwing off excessive reads, and generating massive amounts of CPU utilization. To the point where you could literally SEE each of these queries show up in task manager - as the CPU spiked to handle the underlying scan. The image below helps highlight the severity of the problem - with each of the scanning queries represented by the spike you can see:
I threw an index against each of the columns, figuring that a search on something like '%JRR%Tolkien%' against such a big table would likely prefer to use an index scan instead of a table scan. Both of those indexes required 533MB of Disk to create. And, sure enough, SQL Server used those indexes as soon as they appeared, but it only cut the reads in about 1/4th of what they used to be, and still kept CPU utilization about where it had been before - meaning that you could still SEE these queries being executed by their tell-tale CPU spikes.
Enter Full Text Indexing
I figured that since Full Text Indexing actually tokenizes index data, instead of merely storing it in a B-Tree, that we'd end up with a much smaller index structure. I also hoped, that if we could get a smaller index to work from, we'd see a commensurate decrease in the amount of reads/churn going on on the server.
Creating a Full Text Index on 8M+ rows took a fair while. But I was happy to see that it was only 133MB (compared to the 533MB of the 'traditional' indexes). That made me hopeful that we'd at least see around a 1/4th reduction in the amount of reads.
Turns out I got way more than I had anticipated. The actual execution plan for the query using LIKE and % was coming in at a cost of around 27 or 28 on a consistent basis. It was also taking 5-7 seconds for each query. (Yeah, that puts it almost into the realm of making it a 'query from hell' - but definitely something you don't want people just firing off in an 'ad-hoc' fashion like they actually NEED to do with this query in question.)
With the full text index, the actual execution plan drops down to .27 - making it fully 100 times faster than its traditional counterpart. Execution time dropped down to 0 seconds (in Management Studio) as well. In past cases where I had used Full Text Indexing, I was only dealing with a table with less than 300K rows. Suffice it to say I was pretty tickled at the insane performance boost provided on an 8M row table with semi-wide domains (author and title) being indexed.
So, keep those insane performance benefits in mind the next time you're tempted to be a bit dismissive of the 'overhead' that Full Text Indexing might add to your environment.