Exchange Search Indexing Service Pegs CPU and Memory
“The new information technology… Internet and e-mail… have practically eliminated the physical costs of communications.” -Peter Drucker
If only that were true.
A few weeks back CPU and memory usage on all of my mailbox servers pegged almost simultaneously. Customers started calling within minutes complaining of slow response times. Miraculously, mail was still flowing, but RPC, Outlook Anywhere, OWA, and ActiveSync connections all slowed to a crawl.
After taking several minutes to start, the task manager showed that msftefd.exe (the Microsoft Exchange Search Index Service) was taking 20+ gb of RAM. After what seemed like another eternity waiting for the Services applet, I stopped the search service on all of the servers, and Perfmon showed all counters returning to normal almost immediately.
After restarting all of the mailbox servers, they seemed to run fine for a while, but not for a long while. Within a couple of hours, it happened again. This time I reset the index for all databases. To do that:
- Stop the Microsoft Exchange Search Indexing service.
- Delete the CatalogData-… folder for each database.
- Restart the Indexing service.
That wasn’t the best plan. I didn’t realize how looong it would take to rebuild the indexes. And, unfortunately, it didn’t do any good. A couple of days later, everything went balmy again. By this time it was clear that there was something wrong with one or more of the databases. The event log showed some ambiguous errors related to several mailboxes on one database, so I disabled indexing on just that one database (Set-MailboxDatabase Database01 -IndexEnabled $false) and then restarted the Indexing service.
Everything went great for about 15 minutes.
Next I checked for corruption in all mailboxes using this PowerShell cmdlet against each database:
New-MailboxRepairRequest -Database Database01 -CorruptionType AggregateCounts,ProvisionedFolder,SearchFolder,FolderView -DetectOnly
That scan completed with no corruption detected, which wasn’t very surprising. New-MailboxRepairRequest only checks for folder-level corruption, not item-level corruption, which is where the problem almost certainly was. As far as I have been able to find, there is no way to scan a mailbox for item-level corruption without exporting it to a pst file and using scanpst.exe. That was the last thing I wanted to do.
So I disabled indexing on all databases, restarted the Indexing service on all mailbox servers, and then re-enabled indexing on one database at a time. After enabling indexing on a database, I used the MSExchange Search Indices\Number of Mailboxes Left to Crawl performance counter so I’d know when the indexing was complete. I continued with this until one of the databases pegged the CPU and memory usage about two minutes after enabling indexing. I immediately disabled indexing on that database, waited for things to calm down again, and then continued with the next database. After all the other database indexes were in a Healthy state, I knew that the problem was isolated to the one database.
But which mailbox?
My boss and I were both looking at the problem when he noticed that two mailboxes had about ten thousand more items than any other mailbox in the DB. I created a new database (NewDB), indexing disabled, then moved both of these mailboxes to it. Once the move requests were completed, I enabled indexing on NewDB, and msftefd blew up within 10 seconds!
One of these mailboxes was much larger (size, not number of messages), so I thought it was the most likely to have corruption. More stuff means more stuff to corrupt, right? No, not really. I moved the smaller mailbox out again and reset the index on NewDB. It took a while, but indexing completed successfully. Next, I swapped. I moved the smaller mailbox back to NewDB and the larger to OldDB, then reset the index again.
Instant gratification! This was the guilty mailbox, and we had discovered it half by luck after days of trial and error. But now what? The only way I could think to scan it for corruption was to export it to a pst (New-MailboxExportRequest -Mailbox <username> -FilePath <unc pathname>), purge the mailbox, and then reimport the pst file with the BadItemLimit flag (New-MailboxImportRequest -Mailbox <username> -FilePath <unc pathname> -BadItemLimit <xx>). This would remove corrupted items during the import process, but there was no way to guarantee that it would fix the problem with the index service.
While I was trying to decide what to do next, I re-enabled indexing on OldDB. Some 15 hours later, it completed indexing successfully, so I knew this was the only mailbox having a problem. NewDB was the only database left to index, and it contained only one mailbox.
One mystery was the large number of items in the mailbox in relation to the normal size of the contents. It couldn’t just be a coincidence. What would cause this one mailbox to contain ten thousand more items without taking up any more space than most other mailboxes? So I gave myself full control and opened it up.1 I didn’t have to look long. The first thing I saw was a long list of NDRs. Thousands of them, in fact. Looking closer, it appeared that the mailbox owner was experiencing an NDR loop caused by monitoring software installed on a non-domain server. The Exchange server didn’t recognize the loop because it was being obfuscated by routing through a mail relay. I don’t know why this would be a problem for the indexing service, but it makes sense to eliminate the obvious potential problems first.
I blocked all incoming mail from email@example.com and contacted the mailbox owner to remove all of the NDRs. When he was done, I purged his Recoverable Items (Search-Mailbox -Identity <username> -SearchDumpsterOnly -DeleteContent). Then I re-enabled search indexing on NewDB and held my breath… Nothing happened. I took one breath and checked the index status (Get-MailboxDatabaseCopy NewDB\MailboxServer01). The status was “Crawling” and the CPU and memory usage were remaining stable. I started PerfMon and started watching the “Number of Mailboxes Left to Crawl” counter. It said 1, of course, and stayed there for about 30 minutes. I checked the index status again. This time, it said “Healthy”!
All databases were indexed. Problem finally solved.
1 Don’t do this unless the law and your organization’s rules allow for it. In most cases, a mail system administrator will be able to access anything required to do his job as long as he doesn’t snoop, but don’t assume anything.
Where did you look to find the faulty mailbox?
Hi John. I searched for mailboxes with a high item count.
Get-Mailbox -ResultSize Unlimited | Get-MailboxStatistics | Select DisplayName, ItemCount, TotalItemSize | Export-CSV C:\temp\results.csv
I compared any unusually high item counts with the mailbox’s size. Any mailboxes that have a very high item count to size ratio would be suspect.
There may be a better method, but I suspect anything that involves searching for message or subject contents might not work since the search indexes are broken.
You could probably also scan the message tracking logs for large numbers of NDR-ish messages directed to a single mailbox.
This was a great post. Awesome details. I love reading real life stories with a good outcome. Very nice job!
I realise this is a bit old, but two tips:
1. New-MailboxRepairRequest will scan (and repair) a mounted mailbox that’s corrupted: https://technet.microsoft.com/en-us/library/ff625226(v=exchg.160).aspx
2. Rather than waiting around for the user to delete all those NDRs once you crack it open and see what was going on, use Search-Mailbox with the -DeleteContent switch to get rid of the garbage. With Search-Mailbox, you can filter by something like subject line or sender to get the items you want to get rid of. Do the -LogOnly without -DeleteContent to check the filter syntax first! https://technet.microsoft.com/en-us/library/dd298173(v=exchg.160).aspx
Both of these commands are valid for Exchange versions from 2010.