I recently got into a situation where customer reported that the CPU usage on the BizTalk SQL Server in production environment is shooting to 100% and it’s impacting the processing.
Customer had also enabled the performance counters on BizTalk server and from their perfmon log they came to know that there is Message Publishing throttling state = 6
I started looking a bit deeper into the BizTalkMsgBoxDb with the help of MsgBoxViewer and found that the tracking data from the trackingdata_0_X tables was not being moved to the BAMPrimaryImport. I then checked to make sure that the tracking host was running and it was, it then became a bit more puzzling, the trackingdata_1_X tables were being move to the DTA database, so part of the tracking was working.
TrackingData_0_x: These four tables store the Business Activity Monitoring (BAM) tracked events in the BizTalkMsgBoxDb database for TDDS to move the events to the BAMPrimaryImport database.
TrackingData_1_x: These four tables store the tracked events in the BizTalkMsgBoxDb database for TDDS to move the events to the BizTalkDTADb database.
So basically both the MessageBox database and Tracking database have these sets of table with same structure.
So when you first start the tracking host, it looks at the sequence number in the TDDS_StreamStatus table of DTA database.
Let’s say for example the sequence number is 10039.
So now the TDDS will look for the sequence number greater than 10039 in the MessageBox tracking tables and if gets it will move the data from MessageBox tracking table to DTA or BAM tracking tables. So that’s the way it maintains the order.
But there are some issue here. For example let’s say if you change the sequence number to less than 10039 i.e. 10038 in MessageBox tracking table, then TDDS will never see this row because it always look for the sequence number greater than the sequence number in the TDDS_StreamStatus table of DTA.
There are couple of common reasons for the mismatch sequence numbers
- If you run the clean up script(Msgbox_cleanup_logic.sql) on MessageBox database. This cleanup will truncate the tracking tables in MessageBox Database and hence the sequence number will go out of sequence.
- When there is a burst of heavy load that might create a huge backlog of tracking data in the MsgBox tables.Consequently, TDDS and the jobs will struggle to catch up with the back logs and that can cause the CPU spike.
Once you have the mismatch of sequence number, then data will not move from MessageBox and hence its size will start increasing which can cause database throttling.
Also the query in the sp [dbo].[MsgBoxPerfCounters_GetTrackingDataSize] runs in a continuous loop and takes the high CPU cycles because of large records in TrackingData table.
So when I looked at the current sequence number in the table TDDS_StreamStatus they were much smaller than the Sequence Numbers in the trackingdata_0_X table in the BizTalkMsgBoxDb. So this was the case where a burst of heavy load was put on the system that might have created a huge backlog of Tracking data in the MessageBox tables.
Consequently, TDDS and the jobs were struggling to catch up with the back logs and that should have caused the CPU spike.
After some hours of slow processing the difference in the sequence number went down and hence the CPU usage.
Instead of waiting, I could have tried mitigation strategies, increasing the Message count in DB or multiplier value for Tracking data to stop the Publishing throttling state.
Also in this scenario one should look into the TDDS_FailedTrackingData table where error gets populated whenever there is a tracking failure.