Andrew Alumkal

SQL Server Engineering

Accelerated Database Recovery – Deep Dive Part 2

Several rounds of testing were done to test the impact of ADR on a database – mainly around tempdb, log generation, cleanup and its impact on other features such as CDC, CheckDB, and index rebuilds. Some initial testing was already performed earlier around log truncation and rollback times, and were not repeated this time.

Testing set up:

  • SQL 2019 CU4
  • Azure Standard D8s v3 machine (8vcpus, 32 GB memory)
  • Always On database with 1 read replica
  • RCSI / Snapshot enabled
  • ADR enabled on a separate filegroup on the database
  • Most tests were performed by disabling log backups to see the true impact on log generated
  • Most tests had an open snapshot isolation transaction to ensure versions were not cleaned immediately after data modifications committed

Transaction log impact

Test was performed by updating 100k rows on a large table

Since PVS is persisted and needs to be recovered in case of a failure, all operations against the PVS are logged, increasing the amount of log used.

Version generation comparison – PVS vs. Tempdb version store​​​​​​​

The versions generated by ADR and tempdb are pretty much identical. So it seems like there is no overhead in the amount of versions generated by either approach.

Time to clear version store

his shows the cleanup time of 17GB of version store. The cleanup was measured from the time of commit of the last open snapshot transaction to when the version store dropped to zero. As you can see, the PVS cleanup is definitely slower than tempdb.

Time to clear – Committed vs. aborted transactions in PVS

This shows the cleanup time of 8GB of version store. As you can see, the impact of having a long running aborted transaction on cleanup time is massive. ADR provides the benefit of instant rollback, but will have the overhead of cleanup since all aborted data needs to be overwritten with the last committed data from the version store. Only then can the versions be cleaned up from the PVS.

As seen above, aborted transactions also have the additional overhead of transaction log growth as the clean-up process works, since it’s a completely logged operation to update aborted transactions back to its original state.

Version store behavior on secondary replica – TempDB vs PVS

In a traditional set-up, there is usually an impact on TempDB when running queries on read-only AG replicas.

When data is updated on the primary Replica:

  1. If Snapshot/RCSI is enabled on the primary, versions will be generated for updated data in tempdb of the primary replica server.
  2. If a read-only replica is present, versions are also automatically generated on the tempdb of the replica as the redo thread applies changes on data pages. Read-only replicas use snapshot isolation for read workloads.
  3. Large, long running data modifications can drastically increase tempdb size on both primary and secondary replicas as versions need to maintained for reads on their respective servers
  4. The time to clear for each of these tempdb’s are dependent on the activity of that server. For example, if no query requires the versions on the primary server, the version store will be cleared immediately on the primary replica.
  5. If a query on the secondary replica is still running, the version store will remain as long as the read-only workload transactoin is open –  with no impact on the primary server.

When ADR is enabled, tempdb is not used for either primary or secondary replicas. PVS version store will stay active if there are queries using those versions on either replica. Once versions are no longer needed for reads on any replica, versions can be cleared from the PVS. So keep in mind that long running queries on any replica now directly impact the size of the database on the primary.

As of now, I don’t have details on how this is managed between replicas. I had assumed this may have been based on watermarks reported by PVS and AG DMV’s as mentioned in the troubleshooting section for ADR, however I saw that the PVS was not being cleared even when the watermarks were the same. Force cleanup was also performed using EXEC sp_persistent_version_cleanup, but did not work. As soon as the snapshot transaction on the secondary was committed, the PVS automatically cleared.

In-Row / Off-Row version store

This is a topic I had no luck on 🙁  I have not been able to identify how SQL Server decides to use an in-row / off-row version store, and what features, table size, settings may impact it from reverting to off-row version store.

The test database I have used have consistently only used the off-row version store on every test. The in-row version store is an optimisation technique to store the version diff along with the row, instead of in the PVS (as long as these are small updates – what does small mean?).

According to the whitepaper, deletes are also optimised to not be versioned anymore – they are just marked as deleted and the data stays on page until it’s no longer needed. This was also not true for me as the PVS size grew for every delete test performed.

Update and delete tests were performed on very small tables with two integer columns – updates just incrementing one int value by 1. Tests were performed on large tables / small tables / batch deletes / single deletes and every test have resulted in off-row versions.

However, after multiple tests, I had created a standalone database, and started running the exact same workload+data and was able to produce in-row diffs. I then added on every feature to make it the same as my previous test database (Full recovery, RCSI+Snapshot enabled, Add to AG) and it consistently used an in-row diff. In order to narrow down the feature that might be affecting my test database, I created two more test databases from scratch the next day – which again only generated off-row diffs for the exact same data+workload. So i have been getting very inconsistent results and have not yet been able to point to what’s causing this issue.

I was able to learn something from the database that did have the in-row diffs working. In-row diffs are created as long as there are no RCSI/Snapshot queries concurrently reading the data. As soon as there is an active query, SQL uses an off-row diff.

Unfortunately, even when I had no other queries on my other test databases, they still consistently used Off-row version store.

A note from Bob Ward in his SQL Server 2019 Revealed book:

This is concerning since off-row versions apparently are not the ideal situation. My test of incrementing an integer is definitely not a ‘substantial’ change, so not sure why the engine chooses to use off-row.

Change Data Capture + ADR

Unfortunately, enabling CDC on a database prevent aggressive log truncation from working on the database. From my testing, the PVS still creates versions and instant rollbacks work, however the log truncation reverts to the normal behaviour.

With ADR, during a long running transaction, as long as there are no other impediments to a log truncating (AG redo, etc.), log_reuse_wait_desc from sys.databases usually state ‘LOG_BACKUP’. When CDC is enabled, it reverts to ‘ACTIVE_TRANSACTION’

How to break aggressive log truncation with ADR:

EXEC sys.sp_cdc_enable_db 

GO

CheckDB / Index Rebuilds

From my testing, there were no noticeable effects on CheckDB and Resumable Index Rebuild operations when enabling ADR on the database. These processes do not have an impact on the PVS either. CheckDB and index rebuilds were performed with similar load with ADR On/Off.

I did test the new Resumable Index Create feature on SQL 2019 and it turned out to be very disappointing. I aborted an index create after 2 minutes of runtime (90% completion). After attempting to resume from 90%, the index create continued on for more than 20 minutes – this happened several times, at which point I gave up and moved on. Keep in mind this happened even with no other workload running on the database. ADR had no impact on this behaviour.

The same index with a resumable index rebuild worked just as expected.

Troubleshooting notes

It would be great if there was a ‘PVS_reuse_wait_desc’ like we have for the Tlog to troubleshoot cleanup issues with the PVS, however it doesn’t exist ☹

Microsoft provides a short troubleshooting section, but it remains a fairly manual approach for now. Basic steps are to:

  • Check for long running active transactions on primary / secondary replicas – the transaction id is provided in sys.dm_tran_persistent_version_store_stats
  • Check the watermark in sys.dm_hadr_database_replica_states and compare it to the PVS DMV. If the watermark is lower for one of the replicas, it could be holding PVS cleanup back. Also, check for AG redo queue.
  • Aborted versions from the PVS need to be applied back to its original data pages – this can be tracked by looking at the start and end times of the cleaner on the PVS DMV.
  • If none of these work, check the errorlog for ‘VersionCleaner’ issues

Some of these checks can be rolled into a single proc to report “Why is PVS not clearing?” I used a poorly written version of this for my testing to identify the sessions holding back PVS / AG watermarks and worked fairly well.

The Perf counters below can be used to track the in-row / off-row versions generated. 

Next Post

Previous Post

© 2024 Andrew Alumkal

Theme by Anders Norén