Andrew Alumkal

SQL Server Engineering

Fixing highly fragmented VLF’s in OLTP with zero downtime

Plagued by poor recovery times, we recently decided to follow some ‘best practices’ and keep our VLF counts small on our core OLTP databases. Although most of the recovery time is spent on the redo phase, we do see some delay between the ‘Starting up database’ and Analysis phases (pre-discovery). Although this isn’t the silver bullet to all our problems, we decided to fix the issue anyway.

Google “Fix SQL Server VLF” and every hit tells you to simply shrink your log as small as possible and regrow it back; maybe even to do it during a downtime since growing the log (when there’s no more free space) is a blocking operation. But what if you’re log looks like this?

35% of our VLFs were concentrated on the first 1.4% of log (under 2GB). 20% VLFs in the first 300MB of log. A lot of our initial VLFs are around 250KB in size (they were grown in cringe-worthy 1MB increments for the first several MB’s).

I’m not going to spend time going over transaction log architecture and internals, but you can find most of that information here. If you’re interested in the query used to pull the results above, you can also find it here.

Won’t shrinking just fix the issue?

For most cases, yes, however it doesn’t work when your fragmentation is concentrated at the beginning of your log file. Let’s start by taking a look at this super high tech diagram:

This log file has a ton of tiny VLFs making up the first 10GB of log, followed by medium/larger VLFs – highly skewed log fragmentation in a small percentage of the log. This log is 50GB in size with around 1GB currently used space (keep in mind that we don’t really have 5-10 GB sized VLFs, its just for the sake of the diagram). Anyway, you would think that shrinking the log would just lop off the inactive VLFs (grey) on both sides, successfully getting rid of the fragmentation and leaving only the active VLF (green) – and maybe a couple extra as a buffer?

Turns out, it doesn’t work that way. If you run a shrink command, attempting to shrink this 50GB file to 1GB (the current used space for this log) would end up looking like this:

VLFs above 40GB removed

The log would still end up being close to 40GB, although only 1 GB actually is used. This is due to the placement of the current active VLF. The next few transactions will now start wrapping around to the beginning of the log, filling up the tiny VLFs almost immediately:

Active VLF wrap around

Once all active transactions are complete on the previously larger VLF, and checkpoint + log backup is performed, the 30GB VLF section is now marked for reuse and is basically inactive. During this time, we still have new transactions filling up the log going forwards:

After checkpoint and log backup

Run the shrink again, and you would finally be able to shrink the log (up to the last used VLF), still leaving you with a highly fragmented log. It would end up looking like this:

Let’s take a look at this in SQL Server:

Prior to shrink – inactive VLFs present after the active VLFs
After shrink – only inactive VLFs after the active VLFs are removed. Log will start wrapping around from the start of the log after this.

Quick recap of the problem

When you’re stuck in the case of highly fragmented VLFs in the beginning of your log file, it’s almost impossible to fix by shrinking the log in an active OLTP DB.

  • You would need to get it extremely small <50-100MB to get rid of most of your VLFs. Even then, you may still have a bunch in the first 100MB (we had close to 100 VLFs)
  • Getting your log as small as possible means active transactions may trigger log growth, which causes blocking.
  • Dropping your log file (after creating a second file) is also not an option – you’ll be greeted with this error: “The primary data or log file cannot be removed from a database.”

The solution

  1. Create a second log file with ample room to handle several minutes of transactions
  2. Run shrinkfile with emptyfile option on the old log file to force all new transactions to move to the new log
    • DBCC SHRINKFILE (N’TestDB_log’ , EMPTYFILE)
  3. Monitor the log to make sure that the active portion of the log moves to the new fileid – monitoring scripts can be found here. You will need to run a couple checkpoints followed by log backups during this time
  4. Once all the active portion of log moves to the new file, and a log backup is taken, run the exact same shrinkfile command again in step #2
    • The log file should now empty, and shrink down to 4-8 VLFs
  5. Immediately after the shrink, grow back the first log:
    • ALTER DATABASE [TestDB] MODIFY FILE ( NAME = N’TestDB_log’, SIZE = 8192MB , FILEGROWTH = 1024MB );
  6. Grow back the file in increments back to its original size (more on this later)
  7. Run the empty file command on the new log file to direct transactions back to the old log
    • DBCC SHRINKFILE (N’TestDB_log_2_TEMP’ , EMPTYFILE)
  8. Repeat step #3 – wait until transactions move back to the old log file. This may take several minutes / MBs of transactions to move over. I generated some log by running a few, small online index rebuilds to speed up the process. Remember to run a couple checkpoints + log backups
  9. Once the active portion of the log is moved over back to the old log and log backup is taken, you can now drop the new file
    • ALTER DATABASE TestDB REMOVE FILE [TestDB_log2_TEMP]

This process should be completely online and there’s more than enough room in the log to avoid blocking growth operations.

Note on step 6 – you’ll want to grow your log in proper increments so that you create the recommended VLFs. This script by the SQL Tiger team is a great starting point as it does most of the work for you by generating the log growth scripts. However, this script doesn’t seem to be updated for VLF growth changes in SQL versions 2014+. It doesn’t take into account that if your growth is less than 1/8th the size of the log, it only creates 1 VLF for that growth. So if your log is 200GB in size and you grow by 8GB, you end up with a 8GB VLF. To get around this, I’ve created a script that will increase your growth increments proportionally to the size of your log so that you have more consistent VLF sizes. See sample below:

Proportionally increase growth increments to avoid very large VLFs. Growing to ~275GB log

Next Post

© 2024 Andrew Alumkal

Theme by Anders Norén