Problem reading data from secondary when reorganizing clustered index

https://dba.stackexchange.com/questions/192993

10-10-2020
|

문제

We have an AOAG in SQL Server 2014 SP2 CU5 (3 nodes). There is a database with Read Committed Snapshot Isolation level ON. We have a large table compressed. Some of our bigger queries on this table are performed into the secondary.

Then there is a night job on the primary node to reorganize indexes on several tables. When it hits the clustered index of the mentioned table we get the following error:

Transaction aborted when accessing versioned row in table 'xxxx' in database 'yyyy'. Requested versioned row was not found because the readable secondary access is not allowed for the operation that attempted to create the version.

At some point, the big queries were performing the reads with the hint READUNCOMMITTED. I thought that it was the cause of this error so I removed them. But the error is still there.

Any ideas?

Current setup:

02 secondary is on synchronous mode
03 secondary on asynchronous mode

Table details

RowCounts: 122.567.668
TotalSpaceMB: 18.460
UsedSpaceMB: 18.238

Definitions:

CREATE TABLE [dbo].[big_table](
[ID] [int] NOT NULL IDENTITY(1, 1),
1 [int] NULL,
2 [datetime] NULL,
3 [int] NULL,
4 [int] NULL CONSTRAINT [DF_ccc_bUnits] DEFAULT ((0)),
5 [money] NULL,
6 [money] NULL,
7 [int] NULL,
8 [int] NULL CONSTRAINT [DF_ccc_MinDays] DEFAULT ((0)),
9 [int] NULL,
10 [int] NULL,
11 [float] NULL,
12 [money] NULL,
13 [int] NULL,
14 [int] NULL,
15 [nvarchar] (200) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
16 [money] NULL,
17 [money] NULL,
18 [int] NULL,
19 [int] NULL,
20 [money] NULL,
21 [money] NULL,
22 [money] NULL,
23 [money] NULL,
24 [money] NULL,
25 [datetime] NOT NULL CONSTRAINT [DFcccadded] DEFAULT (getdate()),
26 [nvarchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
27 [money] NOT NULL CONSTRAINT [DFcccBrf] DEFAULT ((0)),
29 [money] NOT NULL CONSTRAINT [DFcccHB] DEFAULT ((0)),
30 [money] NOT NULL CONSTRAINT [DFcccFB] DEFAULT ((0)),
31 [money] NOT NULL CONSTRAINT [DFcccAllBoards] DEFAULT ((0)),
32 [money] NOT NULL CONSTRAINT [DFcccChildBrf] DEFAULT ((0)),
33 [money] NOT NULL CONSTRAINT [DFcccChildHB] DEFAULT ((0)),
34 [money] NOT NULL CONSTRAINT [DFcccChildFB] DEFAULT ((0)),
35 [money] NOT NULL CONSTRAINT [DFcccChildAllBoards] DEFAULT ((0)),
36 [int] NULL CONSTRAINT [DFcccShow_1] DEFAULT ((0)),
37 [timestamp] NOT NULL,
38 [money] NULL,
39 [money] NULL,
40 [money] NULL,
41 [money] NULL,
42 [money] NULL,
43 [money] NULL,
44 [money] NULL,
45 [money] NULL,
46 [int] NOT NULL CONSTRAINT [DFcccReleaseHour] DEFAULT ((0)),
47 [int] NULL,
48 [int] NULL,
49 [money] NULL,
50 [money] NULL,
51 [float] NULL
) ON [PRIMARY]
WITH (DATA_COMPRESSION = PAGE)
GO
CREATE UNIQUE CLUSTERED INDEX [IXccc] ON [dbo].[big_table] (1, 2) WITH (FILLFACTOR=90, DATA_COMPRESSION = PAGE) ON [PRIMARY]
GO
ALTER TABLE [dbo].[big_table] ADD CONSTRAINT [PKccc] PRIMARY KEY NONCLUSTERED ([ID]) WITH (DATA_COMPRESSION = PAGE) ON [secondary]
GO
CREATE UNIQUE NONCLUSTERED INDEX [IXcccstamp] ON [dbo].[big_table] (36) INCLUDE (1, 2) WITH (FILLFACTOR=100) ON [PRIMARY]
GO

해결책

So, after running out of possible solutions we opened a support case to Microsoft. They asked to run a tool to gather some info while the process was running and afterwards they analyzed it. Here is their answer:

You select command is running fine if it starts before reorganize index job is initiated
Select command is failing if it starts after re-organize job is initiated.
Found the above behavior is expected behavior in AG.
- Even though the read operations do not take shared locks because of row versioning, these operations take schema stability (Sch-S) locks, which can block redo operations that are applying DDL changes. DDL operations include ALTER/DROP tables and Views but not DROP or ALTER of stored procedures.
- In our case while re-organize index is running on primary redo operation for the same is being performed on secondary replica and is acquiring Sch-M (Schema modification locks), when select command is trying to access the same replica it is not able to acquire Sch-S (Schema Stability) locks as it is already occupied by redo thread which is having Sch-M locks.
- In this scenario your application is generating errors including time out errors.
To avoid this kind of situations it is recommended to schedule re-organize index task in off business hours

We don't have "off business hours", we run 24/7/365. Is not a definitive answer, but at least we know the root cause of this issue. So the approach will be to temporarily change connection string so the task that is failing will read from primary AG node instead of secondary AG node the day the reindex run.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 dba.stackexchange