So…today I had a very interesting issue. All of a sudden Nagios went red with the message: CRITICAL : – total: 232.82 Gb – used: 212.67 Gb (91%) – free 20.15 Gb (9%) on one of our SQL server. I was like..”NO! It cannot be. I clearly remember that this machine had over 150 Gb of free space yesterday!”
I log in to the machine, and automatically go to the data folder of SQL server thinking that some customer has a HUGE! log file on his database; so, right click – Select All -> Properties… only 50 Gbs of data…hmm…not here!
At that moment I notice that the server isnt “jumpy” as usual, so I start Performance monitor and see that the Avg. Disk Queue is off the chart! That definatelly isnt right…
Since that machine is using Windows 2003, theres no too good way to see what application is using so much disk I/O ( beside FileMon ) , I start the FileMon and se that the SQL service is writing HUGE amount of data to its ERRORLOG file.
By navigating to the folder, I see that the ERRORLOG file is over 120 Gb big so opening it in Notepad, Context or some other file was out of the question, since it would kill the machine completely.
So, lets turn to powershell once more. In order to be sure to not kill the machine, since its in production , I have moved the file to an empty machine to opened it there.
Powershell : cat errorlog +ENTER 🙂 after that powershell started to list the entire content of the file, and it took about 15 mins until it reached the end. AMAZING
By examining the log, I notice ( it was hard to NOT notice it ) litterally millions of these entries:
Could not allocate space for object ‘dbo.tblComments’.’tblComments’ in database ‘DATABASEname’ because the ‘PRIMARY’ filegroup is full. Create disk space by deleting unneeded files, dropping objects in the filegroup, adding additional files to the filegroup, or setting autogrowth on for existing files in the filegroup.
What happened here?
The site that was using that database is poorly written and was allowing everyone to post comments on the articles without any security check, so it soon reached the size limit set on the database. By further digging on the issue, I saw that the bots are posting about 4-5 comments each second which comes to about 18000 comments/hour…multiply that by 15-16 hours it took to fill the error log to that size, we roughly come to 280000 comments…that’s A LOT of comments! 🙂
After disabling the site and sending the email to the customer, situation on the machine looks like this :
The big drop on the green line is the moment where I disabled the problematic web site.