PostgreSQL – Monitoring Replication

By Enrique 21/09/2023 21/03/2024 PostgreSQL, Database Administration, High Availability / Disaster Recovery

What is replication lag? How can we monitor Replication lag? Replication lag does not occur in most setups; however, it is important to monitor the entire endpoints of replication to ensure that our data is safe.

What is Replication Lag? Read More

The data can be replicated from the primary/master node to the replica/slave node after the transaction has been committed on the primary node. The replica is never ahead of the primary. It is usually a little behind. This delay is called lag.

Monitoring Primary Server

The system view to monitor Replication on the Primary node is pg_stat_replication.

\d pg_stat_replication
                    View "pg_catalog.pg_stat_replication"
      Column      |           Type           | Collation | Nullable | Default 
------------------+--------------------------+-----------+----------+---------
 pid              | integer                  |           |          | 
 usesysid         | oid                      |           |          | 
 usename          | name                     |           |          | 
 application_name | text                     |           |          | 
 client_addr      | inet                     |           |          | 
 client_hostname  | text                     |           |          | 
 client_port      | integer                  |           |          | 
 backend_start    | timestamp with time zone |           |          | 
 backend_xmin     | xid                      |           |          | 
 state            | text                     |           |          | 
 sent_lsn         | pg_lsn                   |           |          | 
 write_lsn        | pg_lsn                   |           |          | 
 flush_lsn        | pg_lsn                   |           |          | 
 replay_lsn       | pg_lsn                   |           |          | 
 write_lag        | interval                 |           |          | 
 flush_lag        | interval                 |           |          | 
 replay_lag       | interval                 |           |          | 
 sync_priority    | integer                  |           |          | 
 sync_state       | text                     |           |          | 
 reply_time       | timestamp with time zone |           |          |

Important Fields in pg_stat_replication

Let’s talk about the fields in detail.

pid: This represents the process ID of the wal_receiver process in charge of this streaming connection. If you check the process table in your operating system, you should find a PostgreSQL process with exactly this number
usesysid: OID of user which is used for Streaming replication.
usename: This (not username; mind the missing “r”) stores the name of the user related to usesysid. It is what the client has put into the connection string
application_name: Application name connected to master
client_addr: This will tell you where the streaming connection comes from. It holds the IP address of the client.
client_hostname: Hostname of standby.
client_port: This is the TCP port number the client is using for communication with the particular WAL sender. -1 will be shown if local UNIX sockets are used.
backend_start: Start time when SR connected to Master.
backend_xmin: The transaction ID reported by hot_standby_feedback (which is the oldest transaction ID on a slave). It can make sense to monitor the difference between the current transaction ID on the master and the oldest one reported by the slave as well, to check whether there are unusually high differences
state: Current WAL sender state i.e streaming
sent_lsn: This represents the last transaction log position sent to the connection.
write_lsn: Last transaction written on disk at standby.
flush_lsn: This is the last location that was flushed to the standby system. Mind the difference between writing and flushing here. Writing does not imply flushing.
replay_lsn: Last transaction flush on disk at standby.
write_lag: Elapsed time during committed WALs from primary to the standby (but not yet committed in the standby)
flush_lag: Elapsed time during committed WALs from primary to the standby (WAL’s has already been flushed but not yet applied)
replay_lag: Elapsed time during committed WALs from primary to the standby (fully committed in standby node)
sync_priority: Priority of standby server being chosen as synchronous standby
sync_state: Sync State of standby (is it async or synchronous).

Below is a sample query output with a single replica:

When you see an entry in the system view, that means that there is one ACTIVE stream. Otherwise, there is none. You won’t see any entry.

If the values of all *_lsn columns are the same, that means that the Replica has caught up 100%.

sent_lsn: Last write-ahead log location sent on this connection (How much WAL has been sent over the network already?).

write_lsn: Last write-ahead log location written to the operating system (without flushing).

flush_lsn: Last write-ahead log location flushed to disk by this standby server.

replay_lsn: Last write-ahead log location replayed into the database on this standby server (Data that is already visible to end-users).

The following is an illustration of the data flow.

Monitoring Replica Servers

Replica server has incoming replication and remains in recovery mode so that it can replay the WALs/records as they come in. There is a view called pg_stat_wal_receiver.

\d pg_stat_wal_receiver
                      View "pg_catalog.pg_stat_wal_receiver"
        Column         |           Type           | Collation | Nullable | Default 
-----------------------+--------------------------+-----------+----------+---------
 pid                   | integer                  |           |          | 
 status                | text                     |           |          | 
 receive_start_lsn     | pg_lsn                   |           |          | 
 receive_start_tli     | integer                  |           |          | 
 written_lsn           | pg_lsn                   |           |          | 
 flushed_lsn           | pg_lsn                   |           |          | 
 received_tli          | integer                  |           |          | 
 last_msg_send_time    | timestamp with time zone |           |          | 
 last_msg_receipt_time | timestamp with time zone |           |          | 
 latest_end_lsn        | pg_lsn                   |           |          | 
 latest_end_time       | timestamp with time zone |           |          | 
 slot_name             | text                     |           |          | 
 sender_host           | text                     |           |          | 
 sender_port           | integer                  |           |          | 
 conninfo              | text                     |           |          |

Monitoring Replication Slots

The view pg_replication_slots gives one row for each replication slot in the primary. A replication slot guarantees that the WAL will not vanish if the replica is lagging behind. A primary will recycle its WAL as soon as it doesn’t need it on its own anymore if there are no replication slots.

\d pg_replication_slots
             View "pg_catalog.pg_replication_slots"
       Column        |  Type   | Collation | Nullable | Default 
---------------------+---------+-----------+----------+---------
 slot_name           | name    |           |          | 
 plugin              | name    |           |          | 
 slot_type           | text    |           |          | 
 datoid              | oid     |           |          | 
 database            | name    |           |          | 
 temporary           | boolean |           |          | 
 active              | boolean |           |          | 
 active_pid          | integer |           |          | 
 xmin                | xid     |           |          | 
 catalog_xmin        | xid     |           |          | 
 restart_lsn         | pg_lsn  |           |          | 
 confirmed_flush_lsn | pg_lsn  |           |          | 
 wal_status          | text    |           |          | 
 safe_wal_size       | bigint  |           |          | 
 two_phase           | boolean |           |          |

Cheers!

Deploy TiDB on AWS EKS (Elastic Kubernetes Service)

By Enrique 05/10/2021 06/11/2021 MySQL, High Availability / Disaster Recovery

This blog post describes how to deploy a TiDB cluster on AWS Elastic Kubernetes Service (EKS). TiDB on Kubernetes is the standard way to deploy TiDB on public clouds

Install AWS, kubectl & eksctl CLI’s

Install AWS CLI

MAC – Install and configure AWS CLI

# Download Binary
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"

# Install the binary
sudo installer -pkg ./AWSCLIV2.pkg -target /

# Verify the installation
aws --version

Reference: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-mac.html

Windows 10 – Install and configure AWS CLI

The AWS CLI version 2 is supported on Windows XP or later.
The AWS CLI version 2 supports only 64-bit versions of Windows.
Download Binary: https://awscli.amazonaws.com/AWSCLIV2.msi
Install the downloaded binary (standard windows install)

# Verify the installation
aws --version

Reference: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2-windows.html

Configure AWS Command Line using Security Credentials

Go to AWS Management Console –> Services –> IAM
Select the IAM User: <user>
**Important Note:** Use only IAM user to generate **Security Credentials**. Never ever use Root User. (Highly not recommended)
Click on **Security credentials** tab
Click on **Create access key**
Copy Access ID and Secret access key
Go to command line and provide the required details

aws configure

Test if AWS CLI is working after configuring the above:

aws ec2 describe-vpcs

Install kubectl CLI

Kubectl binaries for EKS please prefer to use from Amazon
This will help us to get the exact Kubectl client version based on our EKS Cluster version. You can use the below documentation link to download the binary.
Reference: https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html

MAC – Install and configure kubectl

# Download the Package
mkdir kubectlbinary
cd kubectlbinary
curl -o kubectl https://amazon-eks.s3.us-west-2.amazonaws.com/1.16.8/2020-04-16/bin/darwin/amd64/kubectl

# Provide execute permissions
chmod +x ./kubectl

# Set the Path by copying to user Home Directory
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$PATH:$HOME/bin
echo 'export PATH=$PATH:$HOME/bin' >> ~/.bash_profile

# Verify the kubectl version
kubectl version --short --client
Output: Client Version: v1.16.8-eks-e16311

mkdir kubectlbinary
cd kubectlbinary
curl -o kubectl.exe https://amazon-eks.s3.us-west-2.amazonaws.com/1.16.8/2020-04-16/bin/windows/amd64/kubectl.exe

```
- Update the system **Path** environment variable
```

# Verify the kubectl client version
kubectl version --short --client
kubectl version --client

Install eksctl CLI

eksctl on Mac

# Install Homebrew on MacOs
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

# Install the Weaveworks Homebrew tap.
brew tap weaveworks/tap

# Install the Weaveworks Homebrew tap.
brew install weaveworks/tap/eksctl

# Verify eksctl version
eksctl version

eksctl on windows or linux

For windows and linux OS, you can refer below documentation link.
Reference: https://docs.aws.amazon.com/eks/latest/userguide/eksctl.html#installing-eksctl

Getting started with Amazon EKS – eksctl

https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html

TiDB Architecture

TiDB is designed to consist of multiple components. These components communicate with each other and form a complete TiDB system. The architecture is as follows:

TiDB server

The TiDB server is a stateless SQL layer that exposes the connection endpoint of the MySQL protocol to the outside. The TiDB server receives SQL requests, performs SQL parsing and optimization, and ultimately generates a distributed execution plan. It is horizontally scalable and provides the unified interface to the outside through the load balancing components such as Linux Virtual Server (LVS), HAProxy, or F5. It does not store data and is only for computing and SQL analyzing, transmitting actual data read request to TiKV nodes (or TiFlash nodes).

Placement Driver (PD) server

The PD server is the metadata managing component of the entire cluster. It stores metadata of real-time data distribution of every single TiKV node and the topology structure of the entire TiDB cluster, provides the TiDB Dashboard management UI, and allocates transaction IDs to distributed transactions. The PD server is “the brain” of the entire TiDB cluster because it not only stores metadata of the cluster, but also sends data scheduling command to specific TiKV nodes according to the data distribution state reported by TiKV nodes in real time. In addition, the PD server consists of three nodes at least and has high availability. It is recommended to deploy an odd number of PD nodes.

Storage servers

TiKV server

The TiKV server is responsible for storing data. TiKV is a distributed transactional key-value storage engine. Region is the basic unit to store data. Each Region stores the data for a particular Key Range which is a left-closed and right-open interval from StartKey to EndKey. Multiple Regions exist in each TiKV node. TiKV APIs provide native support to distributed transactions at the key-value pair level and supports the Snapshot Isolation level isolation by default. This is the core of how TiDB supports distributed transactions at the SQL level. After processing SQL statements, the TiDB server converts the SQL execution plan to an actual call to the TiKV API. Therefore, data is stored in TiKV. All the data in TiKV is automatically maintained in multiple replicas (three replicas by default), so TiKV has native high availability and supports automatic failover.

TiFlash server

The TiFlash Server is a special type of storage server. Unlike ordinary TiKV nodes, TiFlash stores data by column, mainly designed to accelerate analytical processing.

Create a EKS cluster and a node pool

It is recommended to create a node pool in each availability zone (at least 3 in total) for each component when creating an EKS.

References:

https://aws.amazon.com/blogs/containers/amazon-eks-cluster-multi-zone-auto-scaling-groups/

https://aws.github.io/aws-eks-best-practices/reliability/docs/dataplane/#ensure-capacity-in-each-az-when-using-ebs-volumes

Save the following configuration as the cluster.yaml file. Replace ${clusterame} with your preferred cluster name, and specify your preferred region.

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: ${clusterName}
  region: ap-northeast-1

nodeGroups:
  - name: admin
    desiredCapacity: 1
    privateNetworking: true
    labels:
      dedicated: admin

  - name: tidb-1a
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1a"]
    labels:
      dedicated: tidb
    taints:
      dedicated: tidb:NoSchedule
  - name: tidb-1d
    desiredCapacity: 0
    privateNetworking: true
    availabilityZones: ["ap-northeast-1d"]
    labels:
      dedicated: tidb
    taints:
      dedicated: tidb:NoSchedule
  - name: tidb-1c
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1c"]
    labels:
      dedicated: tidb
    taints:
      dedicated: tidb:NoSchedule

  - name: pd-1a
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1a"]
    labels:
      dedicated: pd
    taints:
      dedicated: pd:NoSchedule
  - name: pd-1d
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1d"]
    labels:
      dedicated: pd
    taints:
      dedicated: pd:NoSchedule
  - name: pd-1c
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1c"]
    labels:
      dedicated: pd
    taints:
      dedicated: pd:NoSchedule

  - name: tikv-1a
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1a"]
    labels:
      dedicated: tikv
    taints:
      dedicated: tikv:NoSchedule
  - name: tikv-1d
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1d"]
    labels:
      dedicated: tikv
    taints:
      dedicated: tikv:NoSchedule
  - name: tikv-1c
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1c"]
    labels:
      dedicated: tikv
    taints:
      dedicated: tikv:NoSchedule

Create Cluster

eksctl create cluster -f cluster.yaml

Deploy TiDB Operator

This section describes how to deploy a TiDB Operator on AWS EKS.

Install Helm (Prerequisite)

MAC – Install Helm

brew install helm

Windows 10 – Install Helm

choco install kubernetes-helm

Create CRD

TiDB Operator uses Custom Resource Definition (CRD) to extend Kubernetes. Therefore, to use TiDB Operator, you must first create the TidbCluster CRD, which is a one-time job in your Kubernetes cluster.

Create a file called crd.yaml. Copy the configuration from the link below.

https://raw.githubusercontent.com/pingcap/tidb-operator/master/manifests/crd.yaml

Build the TiDBCluster CRD by executing the command below.

kubectl apply -f crd.yaml

Add the PingCAP repository

helm repo add pingcap https://charts.pingcap.org/

Expected output:

“pingcap” has been added to your repositories

Create a namespace for TiDB Operator

kubectl create namespace tidb-admin

Expected output:

namespace/tidb-admin created

Install TiDB Operator

helm install --namespace tidb-admin tidb-operator pingcap/tidb-operator --version v1.2.1

To confirm that the TiDB Operator components are running, execute the following command:

kubectl get pods --namespace tidb-admin -l app.kubernetes.io/instance=tidb-operator

Deploy a TiDB cluster and the Monitoring Component

This section describes how to deploy a TiDB cluster and its monitoring component in AWS EKS.

Create namespace

kubectl create namespace tidb-cluster

Note: A namespace is a virtual cluster backed by the same physical cluster. This document takes tidb-cluster as an example. If you want to use a different namespace, modify the corresponding arguments of -n or –namespace.

Deploy

Download the sample TidbCluster and TidbMonitor configuration files:

curl -O https://raw.githubusercontent.com/pingcap/tidb-operator/master/examples/aws/tidb-cluster.yaml && \
curl -O https://raw.githubusercontent.com/pingcap/tidb-operator/master/examples/aws/tidb-monitor.yaml

Execute the command below the deploy TiDB cluster and its monitoring component.

kubectl apply -f tidb-cluster.yaml -n tidb-cluster

kubectl apply -f tidb-monitor.yaml -n tidb-cluster

After the yaml file above is applied to the Kubernetes cluster, TiDB Operator creates the desired TiDB cluster and its monitoring component according to the yaml file.

Verify Cluster & Nodes

View cluster status

kubectl get pods -n tidb-cluster

When all the Pods are in the Running or Ready state, the TiDB cluster is successfully started.

List worker nodes

List Nodes in current kubernetes cluster

kubectl get nodes -o wide

Verify Cluster, NodeGroup in EKS Management Console

Go to Services -> Elastic Kubernetes Service -> ${clustername}

Verify Worker Node IAM Role and list of Policies

Go to Services -> EC2 -> Worker Nodes

Verify CloudFormation Stacks

Verify Control Plane Stack & Events

Verify NodeGroup Stack & Events

Below are the associated NodeGroup Events

Access the Database

You can access the TiDB database to test or develop your application after you have deployed a TiDB cluster.

Prepare a bastion host

The LoadBalancer created for your TiDB cluster is an intranet LoadBalancer. You can create a bastion host in the cluster VPC to access the database.

Select the cluster’s VPC and Subnet and verify whether the cluster name is correct in the dropdown box.

You can view the cluster’s VPC and Subnet by running the following command:

eksctl get cluster -n tidbcluster -r ap-northeast-1

Allow the bastion host to access the Internet. Select the correct key pair so that you can log in to the host via SSH.

Install the MySQL client and connect

sudo yum install mysql -y

Connect the client to the TiDB cluster

mysql -h ${tidb-nlb-dnsname} -P 4000 -u root

kubectl get svc basic-tidb -n tidb-cluster

${tidb-nlb-dnsname} is the LoadBalancer domain name of the TiDB service. You can view the domain name in the EXTERNAL-IP field by executing kubectl get svc basic-tidb -n tidb-cluster.

kubectl get svc basic-tidb -n tidb-cluster

Check TiDB version

select tidb_version()\G

Create test table

use test;

create table test_table (id int unsigned not null auto_increment primary key, v varchar(32));

select * from information_schema.tikv_region_status where db_name=database() and table_name='test_table'\G

Query the TiKV store status

select * from information_schema.tikv_store_status\G

Query the TiDB cluster information

select * from information_schema.cluster_info\G

Access the Grafana Monitoring Dashboard

Obtain the LoadBalancer domain name of Grafana

kubectl -n tidb-cluster get svc basic-grafana

In the output below, the EXTERNAL-IP column is the LoadBalancer domain name.

You can access the ${grafana-lb}:3000 address using your web browser to view monitoring metrics. Replace ${grafana-lb} with the LoadBalancer domain name.

Upgrade

To upgrade the TiDB cluster, edit the spec.version by executing the command below.

kubectl edit tc basic -n tidb-cluster

Scale out

Before scaling out the cluster, you need to scale out the corresponding node group so that the new instances have enough resources for operation.

This section describes how to scale out the EKS node group and TiDB components.

Scale out EKS node group

When scaling out TiKV, the node groups must be scaled out evenly among the different availability zones. The following example shows how to scale out the tikv-1a, tikv-1c, and tikv-1d groups of the ${clusterName} cluster to 2 nodes.

eksctl scale nodegroup --cluster ${clusterName} --name tikv-1a --nodes 2 --nodes-min 2 --nodes-max 2

Scale out TiDB components

After scaling out the EKS node group, execute kubectl edit tc basic -n tidb-cluster, and modify each component’s replicas to the desired number of replicas. The scaling-out process is then completed.

Deploy TiFlash/TiCDC

TiFlash is the columnar storage extension of TiKV.

TiCDC is a tool for replicating the incremental data of TiDB by pulling TiKV change logs.

In the configuration file of eksctl (cluster.yaml), add the following two items to add a node group for TiFlash/TiCDC respectively. desiredCapacity is the number of nodes you desire.

  - name: tiflash-1a
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1a"]
    labels:
      dedicated: tiflash
    taints:
      dedicated: tiflash:NoSchedule
  - name: tiflash-1d
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1d"]
    labels:
      dedicated: tiflash
    taints:
      dedicated: tiflash:NoSchedule
  - name: tiflash-1c
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1c"]
    labels:
      dedicated: tiflash
    taints:
      dedicated: tiflash:NoSchedule

  - name: ticdc-1a
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1a"]
    labels:
      dedicated: ticdc
    taints:
      dedicated: ticdc:NoSchedule
  - name: ticdc-1d
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1d"]
    labels:
      dedicated: ticdc
    taints:
      dedicated: ticdc:NoSchedule
  - name: ticdc-1c
    desiredCapacity: 1
    privateNetworking: true
    availabilityZones: ["ap-northeast-1c"]
    labels:
      dedicated: ticdc
    taints:
      dedicated: ticdc:NoSchedule

Depending on the EKS cluster status, use different commands:

If the cluster is not created, execute eksctl create cluster -f cluster.yaml to create the cluster and node groups.

If the cluster is already created, execute eksctl create nodegroup -f cluster.yaml to create the node groups. The existing node groups are ignored and will not be created again.

Deploy TiFlash/TiCDC

To deploy TiFlash, configure spec.tiflash in tidb-cluster.yaml:

spec:
  ...
  tiflash:
    baseImage: pingcap/tiflash
    replicas: 1
    storageClaims:
    - resources:
        requests:
          storage: 100Gi
    tolerations:
    - effect: NoSchedule
      key: dedicated
      operator: Equal
      value: tiflash

Deploy TiFlash/TiCDC

To deploy TiCDC, configure spec.ticdc in tidb-cluster.yaml

kubectl -n tidb-cluster apply -f tidb-cluster.yaml

Finally, execute kubectl -n tidb-cluster apply -f tidb-cluster.yaml to update the TiDB cluster configuration.

View Cluster Status

kubectl get pods -n tidb-cluster

Delete EKS Cluster & Node Groups

This section describes how to delete EKS cluster and Node Groups.

List EKS Clusters

eksctl get clusters -r ap-northeast-1

Delete Clusters

eksctl get clusters -r ap-northeast-1

Delete Clusters

eksctl delete cluster tidbcluster -r ap-northeast-1

OR;

eksctl delete cluster --region=ap-northeast-1 --name=tidbcluster

Cheers!

MySQL/MariaDB – Identifying and Avoiding Deadlocks

By Enrique 02/06/2021 03/06/2021 MariaDB, MySQL, Performance Optimization, High Availability / Disaster Recovery

A deadlock is a special blocking scenario when two or more competing transactions are waiting for each other to free locks. Each process, while holding its own resources, attempts to access a resource that is locked by the other process..

Simulating a Deadlock Scenario

Transaction 1

START TRANSACTION;
SELECT * FROM departments WHERE dept_no = 'd008' LOCK IN SHARE MODE;

Transaction 2 (wait)

START TRANSACTION;
UPDATE departments
SET dept_name = 'Research & Development'
WHERE dept_no = 'd008';

Transaction 1 (deadlock)

UPDATE departments
SET dept_name = 'R&D'
WHERE dept_no = 'd008';

Identify and Analyze Deadlocks

Execute the command below in MySQL/MariaDB.

SHOW ENGINE INNODB STATUS \G

mysql> SHOW ENGINE INNODB STATUS \G
*************************** 1. row ***************************
  Type: InnoDB
  Name:
Status:
=====================================
2021-06-02 00:40:29 0x7f99d005e700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 56 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 5 srv_active, 0 srv_shutdown, 4498 srv_idle
srv_master_thread log flush and writes: 0
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 2
OS WAIT ARRAY INFO: signal count 3
RW-shared spins 9, rounds 9, OS waits 0
RW-excl spins 0, rounds 0, OS waits 0
RW-sx spins 0, rounds 0, OS waits 0
Spin rounds per wait: 1.00 RW-shared, 0.00 RW-excl, 0.00 RW-sx
------------------------
LATEST DETECTED DEADLOCK
------------------------
2021-06-02 00:40:08 0x7f99b74f8700
*** (1) TRANSACTION:
TRANSACTION 51038, ACTIVE 13 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 11, OS thread handle 140298596771584, query id 180 localhost instadm updating
UPDATE departments
SET dept_name = 'Research & Development'
WHERE dept_no = 'd008'

*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 3 page no 4 n bits 80 index PRIMARY of table `employees`.`departments` trx id 51038 lock_mode X locks rec but not gap waiting
Record lock, heap no 13 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 64303038; asc d008;;
 1: len 6; hex 00000000c75a; asc      Z;;
 2: len 7; hex 02000000fc0151; asc       Q;;
 3: len 22; hex 5265736561726368202620446576656c6f706d656e74; asc Research & Development;;


*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 3 page no 4 n bits 80 index PRIMARY of table `employees`.`departments` trx id 51038 lock_mode X locks rec but not gap waiting
Record lock, heap no 13 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 64303038; asc d008;;
 1: len 6; hex 00000000c75a; asc      Z;;
 2: len 7; hex 02000000fc0151; asc       Q;;
 3: len 22; hex 5265736561726368202620446576656c6f706d656e74; asc Research & Development;;


*** (2) TRANSACTION:
TRANSACTION 51039, ACTIVE 28 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 2 row lock(s)
MySQL thread id 9, OS thread handle 140298597062400, query id 181 localhost instadm updating
UPDATE departments
SET dept_name = 'R&D'
WHERE dept_no = 'd008'

*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 3 page no 4 n bits 80 index PRIMARY of table `employees`.`departments` trx id 51039 lock mode S locks rec but not gap
Record lock, heap no 13 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 64303038; asc d008;;
 1: len 6; hex 00000000c75a; asc      Z;;
 2: len 7; hex 02000000fc0151; asc       Q;;
 3: len 22; hex 5265736561726368202620446576656c6f706d656e74; asc Research & Development;;


*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 3 page no 4 n bits 80 index PRIMARY of table `employees`.`departments` trx id 51039 lock_mode X locks rec but not gap waiting
Record lock, heap no 13 PHYSICAL RECORD: n_fields 4; compact format; info bits 0
 0: len 4; hex 64303038; asc d008;;
 1: len 6; hex 00000000c75a; asc      Z;;
 2: len 7; hex 02000000fc0151; asc       Q;;
 3: len 22; hex 5265736561726368202620446576656c6f706d656e74; asc Research & Development;;

*** WE ROLL BACK TRANSACTION (1)

The output will show many info about the latest deadlock, and why it occurred. Take a close look at the portion where it indicates WAITING FOR THIS LOCK TO BE GRANTED (shows which lock the transaction is waiting for) and HOLD THE LOCK(S) (shows the locks that are holding up this transaction).

Preventing Deadlocks

Keep transactions small and quick to avoid clashing.
Commit transactions right after making a set of related changes to make them less prone to clashes.
Accessing resources in the same physical sequence.
- For example, two transactions need to access two resources. If each transaction accesses the resources in the same physical sequence, then the first transaction will successfully obtain locks on the resources without being blocked by the second transaction. The second transaction will be blocked by the first while trying to obtain a lock on the first resource. The outcome will just be a typical blocking scenario instead of a deadlock.

Cheers!

MariaDB – Setup GTID Replication using MariaBackup Step-by-Step

By Enrique 16/11/2020 14/07/2021 MariaDB, High Availability / Disaster Recovery, Database Design

In this blog post, I’m going to show you how to setup GTID replication using Mariabackup.

There are 2 main benefits of using global transaction:

Failover is easier than with file-based replication.
the state of the slave is recorded in a crash-safe way.

Here are the general steps:

Enable binary logging on the master
Enable GTID
Create a replication user on the master
Set a unique server_id on the slave
Take backup from the master
Restore on the slave
Execute the CHANGE MASTER TO command
Start the replication

1. Ensure that the server_id value and bind_address are configured differently in my.cnf in each of the server that will be part of the replication.

In this example, we will configure a 2 node master-slave setup. The bind-address is the hostname IP.

vi my.cnf

In Master:

server_id=1
bind-address=192.168.1.115

In Replica:

server_id=2
bind-address=192.168.1.131

2. Enable binary logging and GTID strict mode in both servers

show global variables like 'log_bin';

+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| log_bin       | ON    |
+---------------+-------+

show global variables like '%gtid_strict_mode%';

+------------------+-------+
| Variable_name    | Value |
+------------------+-------+
| gtid_strict_mode | ON    |
+------------------+-------+

If bin logging is not enabled, you may do so by adding the line below in my.cnf


log-bin = db1-bin

# Or specify an different path

log-bin = /mariadb/bin/logs/bin_logs/bin_log

Restart DB service for the change to take effect.

sudo service mysql stop

sudo serivice mysql start

If gtid is not enabled, add the line below in my.cnf, then enable it globally.

Add this line in cnf

gtid_strict_mode=1

set global gtid_strict_mode=1;

3. Create a user in Master.

The replica is going to use this user connection to read the binary logs on the master and then put those into the relay logs on the replica.

CREATE USER 'repl'@'%' IDENTIFIED BY 'P@$$w0rd';
GRANT RELOAD, SUPER, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'repl'@'%';

4. Install qpress (As root) in both Master and Slave for compression/decompression.

sudo yum install https://repo.percona.com/yum/percona-release-latest.noarch.rpm -y

yum install qpress -y

5. In Master, Take a full backup of the database using Mariabackup.

mariabackup --defaults-file=/etc/my.cnf --backup --compress \
--target-dir=/mariadb/backup/full_backup --user=username \
--password=pass --backup --compress --parallel=4

6. In Replica, create a directory where we will place the backup from Master

mkdir -p /mariadb/backup/rep

7. In Master, use scp to transfer the entire backup image to the replica.

# Go to the directory where you placed the backup 
cd /mariadb/backup

scp -rp full_backup mysql@192.168.1.131:/mariadb/backup/rep

8. In Replica, Stop DB Service


sudo service mysql stop

# Verify that DB service has been stopped
ps -ef| grep mysqld

9. In Replica, Remove all contents in Data Directory.

mkdir -p /mariadb/data/old_data
mv /mariadb/data/* /mariadb/data/old_data/
rm -rf /mariadb/data/old_data/

# Ensure that the data directory is empty
cd /mariadb/data
ls -la

10. In Replica, copy the backup image to data directory.

cp -rp /mariadb/backup/rep/full_backup /mariadb/data

11. In Replica, decompress and prepare backup image.


MEMORY=`grep -w innodb_buffer_pool_size /mariadb/bin/etc/my.cnf | cut -d'=' -f2`

mariabackup --decompress --parallel=4 --remove-original --use-memory=$MEMORY --target-dir=full_backup

12. In Replica, prepare backup.


MARIADB_VERSION=`rpm -qa | grep MariaDB-server | grep -v debuginfo | cut -d'-' -f3 | cut -d'.' -f2`

if [ $MARIADB_VERSION -le 3 ] ; then mariabackup --prepare --apply-log-only --use-memory=$MEMORY --target-dir=full_backup; fi

mariabackup --prepare --use-memory=$MEMORY --target-dir=full_backup

13. In Replica, cleanup data directory, and move all files from full_backup to data directory.

ls -1v /mariadb/data | grep -v $FULL | xargs rm -rf
mv /mariadb/data/$FULL/* /mariadb/data

14. In Replica, Delete the full_backup directory.


cd /mariadb/data
rm -rf /mariadb/data/full_backup

15. In Replica, Rotate error logs.

cd /mariadb/bin/logs/db_logs
mv error_mariadb.log error_mariadb_`date +%Y%m%d_%H%M%S`.log
touch error_mariadb.log

16. In Replica, Start DB Service.


sudo service mysql start

# Check if there are any errors
cat /mariadb/bin/logs/db_logs/error_mariadb.log

17. In Replica, Check GTID. Take note of the GTID, because we need to use it for the next step.

cat /mariadb/data/xtrabackup_info | grep -i GTID

18. In Replica, Login in to the DB, and set the global gtid_slave_pos.

stop slave; reset slave; reset slave all; reset master; set global gtid_slave_pos='above_GTID_number_from_step18';

19. In Replica, still logged in to the DB, execute the change master to command.

change master to master_host='10.92.146.102', master_port=6603, master_user='repl', master_password='P@$$w0rd', master_connect_retry=10, master_use_gtid=slave_pos;

20. In Replica, set enable read_only.

set global read_only=1;

21. Start slave.

start slave;

22. Check replication status.

show slave status\G

Cheers!

GTID Replication – Differences between MySQL and MariaDB

By Enrique 03/11/2020 04/11/2020 MariaDB, High Availability / Disaster Recovery, Database Design

In this blog post, I’m gonna highlight the differences in GTID replication between MySQL and MariaDB

I will not go thru step-by-step setup, because they are many resources out there that do.

The main reason to use GTID is that it makes it easier to track and compare replicated transaction between master and replica; hence, allowing simpler failover and recovery.

Here are the differences between MySQL and MariaDB

1. Composition of GTID

In MariaDB, it is composed of three separated dashed numbers like x-y-z

x: first number – domain ID
y: second number – server ID
z: third number – sequence number

In MySQL, there are 2 parts:

source_id
transaction_id

2. Enabling GTIDs

To enable GTID, we have to set the following parameters in my.cnf

In MySQL:

gtid_mode
enforce_gtid_consistency

gtid-mode=ON
enforce-gtid-consistency

In MariaDB:

gtid_strict_mode

gtid_strict_mode=1

enforce_gtid_consistency does not exist in MariaDB

3. CHANGE MASTER TO statement

In order for the Replica to identify its Master (data source), and to use GTID-based auto-positioning, we need execute the CHANGE MASTER TO statement. In MySQL, we use the MASTER_AUTO_POSITION option to tell the replica that transactions will be identified by GTIDs.

Example In MySQL:

change master to
master_host = '192.168.1.120',
master_port=3306, 
master_user = 'repl',
master_password = 'password',
master_auto_position=1;

In MariaDB, A slave is configured to use GTID by CHANGE MASTER TO master_use_gtid=slave_pos. The replication will start at the position of the last GTID replicated to slave when the slave connects to the master. Refer to the official documentation for more info https://mariadb.com/kb/en/gtid/

Example In MariaDB:

change master to 
master_host='192.168.1.120', 
master_port=3306, 
master_user='repl', 
master_password='password', 
master_use_gtid=slave_pos;

Cheers!

MariaDB – MaxScale Set up, Failover, Switchover, Load Balancing

By Enrique 05/07/2020 07/10/2021 MariaDB, High Availability / Disaster Recovery

In this blog post, I will show you how to:

Setup MariaDB MaxScale
Perform Administration and Monitoring
Perform a Failover
Perform a Switchover

SETUP ENVIRONMENT

First, we will set up the environment. We will be having 1 master and 2 slaves. On top of those, we will have our MariaDB MaxScale server.

To make it easier to setup the environment, you may use the vagrant file below.

# -*- mode: ruby maxscale222
# vi: set ft=ruby :

# Vagrantfile API/syntax version. Don't touch unless you know what you're doing!
VAGRANTFILE_API_VERSION = "2"

Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|

 #: adding ansible stuff
 config.ssh.insert_key = false

 #: maxscale box
 config.vm.define :maxscale do |maxscale|
 maxscale.vm.hostname="maxscale"
 maxscale.vm.box = "domflannery/centos7"
 maxscale.vm.network "public_network"
 end
 
  ######: MASTER / SLAVE SERVERS :######
 #: master, async && semisync replication
 
    config.vm.define :mariadb1, autostart: true do |mariadb1|
	mariadb1.vm.hostname="mariadb1"
	mariadb1.vm.box = "domflannery/centos7"
	mariadb1.vm.network "public_network"
 end

 #: slave01, async && semisync replication

   config.vm.define :mariadb2, autostart: true do |mariadb2|
   mariadb2.vm.hostname="mariadb2"
   mariadb2.vm.box = "domflannery/centos7"
   mariadb2.vm.network "public_network"
 end
 
  #: slave02, async && semisync replication

   config.vm.define :mariadb3, autostart: true do |mariadb3|
   mariadb3.vm.hostname="mariadb3"
   mariadb3.vm.box = "domflannery/centos7"
   mariadb3.vm.network "public_network"
 end
 
end

I chose public networks that is set automatically (through DHCP) for demo purposes. The VMs IP addresses will be in the same range as the host’s IP address.

Once all the virtual boxes are up, we will login to each of them to get the assigned public IP address. First, login to each server using local IP (127.0.0.1) and the respective port numbers (refer to the chart below). The login and password will be both vagrant

Hostname	IP address	Port
maxscale	127.0.0.1	2222
mariadb1	127.0.0.1	2200
mariadb2	127.0.0.1	2201
mariadb3	127.0.0.1	2202

I got the port numbers above by looking at the logs after I issued vagrant up. Below is an example.

Get the assigned IP address using the command below (ifconfig will not work in this version of centos)

ip a

After getting the public IP addresses of each server, configure them in your putty. Give each connection sensible names. (Also, don’t include the /24 of the IP address. )

Preview(opens in a new tab)

INSTALL MARIADB

Install MariaDB in mariadb1, mariadb2, and mariadb3. I will be using MariaDB 10.4 on Centos7. If you want to use a different flavor of Linux, then refer to this link to get the correct repositories info.

Let’s create a repo file.

vim MariaDB.repo

[mariadb]
name = MariaDB
baseurl = http://yum.mariadb.org/10.4/centos7-amd64
gpgkey=https://yum.mariadb.org/RPM-GPG-KEY-MariaDB
gpgcheck=1

Move the file to /etc/yum.repos.d/

sudo mv -vi MariaDB.repo /etc/yum.repos.d/

Once the file is in place, we are ready to install MariaDB (The capitalization of the package names is important).

sudo yum install MariaDB-server MariaDB-client -y

SETUP MARIADB REPLICATION (USING GTID)

Get the master and replicas ready.

vi /etc/my.cnf

Delete the existing lines; then enter the following lines below (or just copy paste them).

[client-server]
!includedir /etc/my.cnf.d

[client]
port = 3306
socket = /var/lib/mysql/mysql.sock

[mysqld]
user = mysql
# pid-file = /var/run/mysqld/mysqld.pid

# Network
port = 3306
socket = /var/lib/mysql/mysql.sock
bind-address = 0.0.0.0

# Data directory
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp

# Connections
max_connections = 100
connect_timeout = 5
wait_timeout = 600
max_allowed_packet = 16M
thread_cache_size = 128
sort_buffer_size = 4M
bulk_insert_buffer_size = 16M
tmp_table_size = 32M

# Query Cache
query_cache_limit = 128K
query_cache_size = 64M
query_cache_type = 1
log_warnings = 2

# Slow log
slow_query_log
slow_query_log_file = /var/lib/mysql/mariadb-slow.log
long_query_time = 1
log_slow_verbosity = query_plan,explain

# Replication
server-id=1
gtid_strict_mode=1
log-bin = /var/lib/mysql/bin-mariadb.log
expire-logs-days=8
sync_binlog = 1
slave_compressed_protocol
transaction-isolation = READ-COMMITTED
binlog_format = row


# Error log
log_error = mariadb_error.log

# Deadlocks
innodb_print_all_deadlocks = ON

# Performance Schema
performance_schema=on

# User Statistics
userstat = 1

# Denies naming the connection
skip_name_resolve

Do the same (edit config file) in mariadb2 and mariadb3, but change the server_id to a different value for each server.

After editing the my.cnf file on all 3 servers, start mariadb service

systemctl start mariadb

STARTING UP REPLICATION

Create a user in the master, and grant “replication slave” to the user. The replica is going to use this user connection to read the binary logs on the master and then put those into the relay logs on the replica.

CREATE USER 'repl'@'%' IDENTIFIED BY 'password';
GRANT REPLICATION SLAVE ON *.* to 'repl'@'%';

In a real-world production environment, you would create a user that can only connect from the IP address of your slaves, but for the sake of demonstration, we’ll just use the wild card character (%)

Point your slaves to the master by executed the commands below.

change master to
master_host='192.168.1.220',
master_user='repl',
master_password='password',
master_use_gtid=slave_pos;

Run the command below to start the slave

start slave;

Check slave status

show slave status \G

Both Slave_IO_Running and Slave_SQL_Running parameters should be Yes.

Execute on Master (only) to grant privileges to maxuser. This user will be used by maxscale.

grant reload, show databases, super, replication slave, replication client on *.* to 'maxuser'@'localhost' identified by 'P@$$w0rd';
grant reload, super, replication slave, replication client ON *.* TO 'repl'@'%';
grant select on mysql.user to 'maxuser'@'localhost';
grant select on mysql.db to 'maxuser'@'localhost';
grant select on mysql.tables_priv to 'maxuser'@'localhost';
grant select on mysql.roles_mapping to 'maxuser'@'localhost';
grant select on mysql.columns_priv to 'maxuser'@'localhost';
grant select on mysql.proxies_priv to 'maxuser'@'localhost';
flush privileges;"

grant reload, show databases, super, replication slave, replication client on *.* to 'maxuser'@'' identified by 'P@$$w0rd';
grant reload, super, replication slave, replication client ON *.* TO 'repuser'@'%';
grant select on mysql.user to 'maxuser'@'';
grant select on mysql.db to 'maxuser'@'';
grant select on mysql.tables_priv to 'maxuser'@'';
grant select on mysql.roles_mapping to 'maxuser'@'';
grant select on mysql.columns_priv to 'maxuser'@'';
grant select on mysql.proxies_priv to 'maxuser'@'';
flush privileges;"

SET UP MAXSCALE

Create maxscale ID

useradd -m -d /home/maxscale -u 6603 -s /bin/bash -c "MaxScale Admin Functional ID" -U maxscale

Set password for maxscale ID

echo 'password' | passwd --stdin maxscale

Execute MariaDB Package Repository Setup

The command below will setup 3 different repositories in a single repository configuration file. The repositories are the following:

MariaDB Repository
MariaDB MaxScale Repository
MariaDB Tools Repository

You may read about the setup and usage of the command from this link.

curl -LsS https://downloads.mariadb.com/MariaDB/mariadb_repo_setup | sudo bash

Install MariaDB Client

yum install MariaDB-common MariaDB-compat MariaDB-client -y

Install MaxScale

yum install maxscale -y

Generate Encrypted Password.

You can read about encrypting passwords from this link.

# Generate .secrets file
rm -f /var/lib/maxscale/.secrets
maxkeys /var/lib/maxscale

# Generate encrypted password
ls -altr /var/lib/maxscale/.secrets
chown maxscale:maxscale /var/lib/maxscale/.secrets

maxpasswd  /var/lib/maxscale/ password
42535F5E5FBC62D068623CD69F62CC00E231737DFC7D51FBEEC0D81D33D7E99A

maxpasswd  /var/lib/maxscale/ password2
A650AB2DC74FE3F609FE6D16BED6A38D757184B532B2A54FA9FCF63C36627B73

Configure maxscale.cnf file

vim /etc/maxscale.cnf.d/maxscale.cnf

[mariadb1]
type=server
address=192.168.1.220
port=3306
protocol=mariadbbackend

[mariadb2]
type=server
address=192.168.1.133
port=3306
protocol=mariadbbackend

[mariadb3]
type=server
address=192.168.1.208
port=3306
protocol=mariadbbackend

[MySQLMonitor]
type=monitor
module=mariadbmon
servers=mariadb1,mariadb2,mariadb3
user=repl
password=42535F5E5FBC62D068623CD69F62CC00E231737DFC7D51FBEEC0D81D33D7E99A
replication_user=repl
replication_password=42535F5E5FBC62D068623CD69F62CC00E231737DFC7D51FBEEC0D81D33D7E99A
monitor_interval=5000
auto_failover=true
auto_rejoin=true
enforce_read_only_slaves=1
backend_connect_timeout=10
backend_read_timeout=10
backend_write_timeout=10

[ReadWriteSplitService]
type=service
router=readwritesplit
servers=mariadb1,mariadb2,mariadb3
user=maxuser
password=6C620E5D01B63D33023313BCD5DD6EC4F60B619F2B7F4B9D18E034428BEDA178
master_failure_mode=fail_instantly
max_slave_replication_lag=1

[ReadWriteMasterService]
type=service
router=readconnroute
servers=mariadb1,mariadb2,mariadb3
user=maxuser
password=A650AB2DC74FE3F609FE6D16BED6A38D757184B532B2A54FA9FCF63C36627B73
router_options=master

[ReadOnlySlaveService]
type=service
router=readconnroute
servers=mariadb1,mariadb2,mariadb3
user=maxuser
password=A650AB2DC74FE3F609FE6D16BED6A38D757184B532B2A54FA9FCF63C36627B73
router_options=slave

[ReadWriteSplitListener]
type=listener
service=ReadWriteSplitService
protocol=mariadbclient
address=0.0.0.0
port=4006

[ReadWriteMasterListener]
type=listener
service=ReadWriteMasterService
protocol=mariadbclient
address=0.0.0.0
port=4007

[ReadOnlySlaveListener]
type=listener
service=ReadOnlySlaveService
protocol=mariadbclient
address=0.0.0.0
port=4008

Change permissions and owner of config files

chown maxscale:maxscale /etc/maxscale.cnf.d/maxscale.cnf
chmod 640 /etc/maxscale.cnf.d/maxscale.cnf

You may read about the MariaDB Monitor, ReadConnRoute, and ReadWriteSplit from MariaDB’s website.

Verify that MaxScale service is enabled.

systemctl is-enabled maxscale.service

Start MaxScale service.

systemctl start maxscale.service

You check the maxscale log at the path below.

tail -100 /var/log/maxscale/maxscale.log

MONITORING MAXSCALE

In your maxscale server, run the command below to check list of servers being monitored.

maxctrl list servers

FAILOVER TEST

Let’s monitor the servers continuously every second. I suggest you open a new session of your maxscale server, and run the command below.

watch -n1 maxctrl list servers

Go to mariadb1, then stop MariaDB service.

systemctl stop mariadb

After a few seconds, you will see that mariadb2 has been promoted to master.

Let’s bring up mariadb1.

systemctl start mariadb

Monitor the servers, and you will see that they are all up and running.

SWITCHOVER

Let’s make mariadb1 server the master again. We will be doing a switchover, but before we proceed, we have to verify the monitor that we are going to use.

maxctrl list monitors

In our case, it is MariaDB-Monitor.

Now, we are ready to do the switchover. Here is the command to perform it.

maxctrl call command mariadbmon switchover MariaDB-Monitor mariadb1 mariadb2

Here is the explanation of the command above.

call command indicates that it is a module command that is to be invoked.
mariadbmon indicates the module whose command we want to invoke.
switchover is the command we want to invoke.
MariaDB-Monitor is the first argument to the command, the name of the monitor as specified in the configuration file.
mariadb1 the name of the server we want to make into master
mariadb2 current master.

We now have successfully switched over the master role back to mariadb1

LOAD BALANCING TEST

In this demo, we will be using the maxscale server as our client.

In the previous section, we’ve already installed MariaDB client. Verify that is installed successfully.

rpm -qa | grep -i MariaDB

In mariadb1 (master), create an app user.

CREATE USER 'app_user'@'%' IDENTIFIED BY 'password';
GRANT ALL PRIVILEGES ON *.* TO 'app_user'@'%' WITH GRANT OPTION;

To connect to our servers via maxscale, we use the connection string below. Essentially, we have to indicate the maxscale IP address. My maxscale IP address happens to be 192.168.1.87. We also have to include the Read/Write listener port that is 4006 as indicated in the maxscale config file (/etc/maxscale.cnf)

Execute the command below in your maxscale server.

mysql -u app_user -p -h192.168.1.87 -P4006

To know which server you are connect to, run the following statement.

SELECT @@hostname;

If we keep on executing this statement repeatedly, we will notice that we will be connected to our slaves: mariadb2 and mariadb3. This is load balancing at work. MaxScale knows that we are executing a read statement; hence it is redirecting the traffic to our read-only replicas.

Once we start a transaction, MaxScale now recognizes that we are about to update a record; hence the connection will be redirected to the master node only.

START TRANSACTION;

The connection will remain in master until we commit the transaction.

COMMIT;

After we commit the transaction, the connection will again be redirected to the read-only replicas.

This is also called the Read Write Splitting.

MaxScale analyze statements, and send where appropriate. Write statements to master; and Read statements to some slaves.

That is all for this blog post.

Cheers!

SQL Server – Adding a Database to AlwaysOn Availability Group

By Enrique 06/04/2016 03/05/2020 SQL SERVER, High Availability / Disaster Recovery

I usually restore the backup file to the secondary replica WITH NORECOVERY state before I add database to AlwaysOn Availability Group (AAG).

The following are the steps to add a database to AAG.

Expand Availability Groups. Right-click Availability Database. Choose Add Database.

Click Next.

Tick the database that you want to add. You will notice that under status columm, the message is “Meets prerequisites”. This is because I have already restored the database with norecovery status in the secondary replica.

Choose Join Only.

Connect to existing secondary replicas.

Click Next.

Click Finish.

Click Close.

]

SQL Server – AlwaysOn Availability Group Errors

By Enrique 27/03/2016 03/05/2020 SQL SERVER, High Availability / Disaster Recovery

error	severity	description
957	17	Database '%.*ls' is enabled for database mirroring or has joined an availability group. The name of the database cannot be changed.
976	14	The target database, '%.*ls', is participating in an availability group and is currently not accessible for queries. Either data movement is suspended or the availability replica is not enabled for read access. To allow read-only access to this and other	If dm_hadr_availability_replica_states.role_desc=3(INVALID), then check service broker. Because Availability group is using Service Broker to communicate between SQL Server then need to check Service Broker first. Open SSMS>“Server Objects” > “Endpoints”, there are two “Service Broker” folder>Expand both, and if the state is Stopped or Disabled, then can change the state to Started.
978	14	The target database ('%.*ls') is in an availability group and is currently accessible for connections when the application intent is set to read only. For more information about application intent, see SQL Server Books Online.
979	14	The target database ('%.*ls') is in an availability group and currently does not allow read only connections. For more information about application intent, see SQL Server Books Online.
982	14	Unable to access the '%.*ls' database because no online secondary replicas are enabled for read-only access. Check the availability group configuration to verify that at least one secondary replica is configured for read-only access. Wait for an enabled r
1408	16	The remote copy of database "%.*ls" is not recovered far enough to enable database mirroring or to join it to the availability group. You need to apply missing log records to the remote database by restoring the current log backups from the principal/prim
1409	16	Database "%.*ls" requires database logs to be restored either on the future mirror database before you can enable database mirroring or on a secondary availability database before you can join it to the availability group. Restore current log backups from
1465	16	Database "%.*ls" database is not in full recovery mode on each of the server instances. The full recovery model is required for a database to participate in database mirroring or in an availability group.
1466	16	Database "%.*ls" is read-only on one of the server instances which is incompatible with participating in database mirroring or in an availability group. Set the database to read-write mode, and retry the operation.
1467	16	Database "%.*ls" database is in emergency or suspect mode on one of the partners which is incompatible with participating in database mirroring or in an availability group.
1468	16	The operation cannot be performed on database "%.*ls" because it is involved in a database mirroring session or an availability group. Some operations are not allowed on a database that is participating in a database mirroring session or in an availabilit
1469	16	Database "%.*ls" is an auto-close database on one of the partnerswhich is incompatible with participating in database mirroring or in an availability group.
1480	10	The %S_MSG database "%.*ls" is changing roles from "%ls" to "%ls" because the mirroring session or availability group failed over due to %S_MSG. This is an informational message only. No user action is required.
1488	16	Database "%.*ls" database is in single user mode which is incompatible with participating in database mirroring or in an availability group. Set database to multi-user mode, and retry the operation.
1833	16	File '%ls' cannot be reused until after the next BACKUP LOG operation. If the database is participating in an availability group, a dropped file can be reused only after the truncation LSN of the primary availability replica has passed the drop LSN of the
3104	16	RESTORE cannot operate on database '%ls' because it is configured for database mirroring or has joined an availability group. If you intend to restore the database, use ALTER DATABASE to remove mirroring or to remove the database from its availability gro
3752	16	The database '%.*ls' is currently joined to an availability group. Before you can drop the database, you need to remove it from the availatility group.	Before you can drop the database, you need to remove it from the availability group. If AG exists, try ALTER AVAILABILITY GROUP [AGname] REMOVE DATABASE [DbName]. If AG does not exist, try removing with ALTER DATABASE [DbName] SET HADR OFF. If replica_id and group_database_id in sys.databases show as null, then this may be due to earlier meta-data when database was part of an AG so stop SQL Server, rename the database files and start SQL Server. Then you should be able to remove the database.
5529	16	Failed to remove a FILESTREAM file. The database is a primary database in an availability group. Wait for the FILESTREAM data files to be hardened on every secondary availability replica. Then retry the drop file operation.
10786	16	The ALTER AVAILABILITY GROUP command failed because it contained multiple MODIFY REPLICA options: %ls. Enter a separate ALTER AVAILABILITY GROUP ... MODIFY REPLICA command for each replica option that you want to modify.
13251	10	availability group
19401	16	The READ_ONLY_ROUTING_URL '%.ls' specified for availability replica '%.ls' is not valid. It does not follow the required format of 'TCP://system-address:port'. For information about the correct routing URL format, see the CREATE AVAILABILITY GROUP docum
19403	16	The availability replica '%.ls' specified in the READ_ONLY_ROUTING_LIST for availability replica '%.ls' does not exist. Only availability replicas that belong to the specified availability group '%.*ls' can be added to this list. To get the names of ava
19405	16	Failed to create, join or add replica to availability group '%.ls', because node '%.ls' is a possible owner for both replica '%.ls' and '%.ls'. If one replica is failover cluster instance, remove the overlapped node from its possible owners and try ag
19406	10	The state of the local availability replica in availability group '%.*ls' has changed from '%ls' to '%ls'. The replica state changed because of either a startup, a failover, a communication issue, or a cluster error. For more information, see the availabi
19407	16	The lease between availability group '%.*ls' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is fail
19452	16	The availability group listener (network name) with Windows Server Failover Clustering resource ID '%s', DNS name '%s', port %hu failed to start with a permanent error: %u. Verify port numbers, DNS names and other related network configuration, then retry
19453	16	The availability group listener (network name) with Windows Server Failover Clustering resource ID '%s', DNS name '%s', port %hu failed to start with this error: %u. Verify network and cluster configuration and logs.
19454	16	The availability group listener (network name) with Windows Server Failover Clustering resource ID '%s', DNS name '%s', port %hu failed to stop with this error: %u. Verify network and cluster configuration and logs.
19455	16	The WSFC cluster does not have a public cluster network with an IPv4 subnet. This is a requirement to create an availability group DHCP listener. Configure a public network for the cluster with an IPv4 subnet, and try to create the listener.
19456	16	None of the IP addresses configured for the availability group listener can be hosted by the server '%.*ls'. Either configure a public cluster network on which one of the specified IP addresses can be hosted, or add another listener IP address which can b	Added a new IP address (with different subnet) manually to the existing Listener [to be tested].
19458	16	The WSFC nodes that host the primary and secondary replicas belong to different subnets. DHCP across multiple subnets is not supported for availability replicas. Use the static IP option to configure the availability group listener.
19460	16	The availability group listener with DNS name '%.*ls' is configured to use DHCP. For listeners with this configuration, IP addresses cannot be added through SQL Server. To add IP addresses to the listener, drop the DHCP listener and create it again config
19468	16	The listener with DNS name '%.ls' for the availability group '%.ls' is already listening on the TCP port %u. Please choose a different TCP port for the listener. If there is a problem with the listener, try restarting the listener to correct the problem
19469	16	The specified listener with DNS name, '%.ls', does not exist for the Availability Group '%.ls'. Use an existing listener, or create a new listener.
19477	16	The availability group '%.ls' already has a listener with DNS name '%.ls'. Availability groups can have only one listener. Use the existing listener, or drop the existing listener and create a new one.
19486	16	The configuration changes to the availability group listener were completed, but the TCP provider of the instance of SQL Server failed to listen on the specified port [%.*ls:%d]. This TCP port is already in use. Reconfigure the availability group listener
21872	16	The availability group associated with Virtual Network Name '%s' has no replicas.
21880	16	The virtual network name '%s' has been used to identify the redirected publisher for original publisher '%s' and database '%s'. The availability group associated with this virtual network name, however, does not include the publisher database.
21882	16	The database '%s' at the redirected publisher '%s' for original publisher '%s' and database '%s' belongs to a HADRon availability group and must be redirected to its associated HADRon Virtual Network Name.
21883	16	The query at the redirected publisher '%s' to determine whether the publisher database '%s' belonged to an availability group failed with error '%d', error message '%s'.
21884	16	The query at the redirected publisher '%s' to determine the health of the availability group associated with publisher database '%s' failed with error '%d', error message '%s'.
21887	16	The query at the redirected publisher '%s' to determine whether the publisher database '%s' belonged to an availability group failed with error '%d', error message '%s'.
21892	16	Unable to query sys.availability_replicas at the availability group primary associated with virtual network name '%s' for the server names of the member replicas: error = %d, error message = %s.',
33445	16	The database '%.*s' is a readable secondary database in an availability group and cannot be enabled for FILESTREAM non-transacted access.
33446	16	The FILESTREAM database configuration cannot be changed for database '%.*s'. The database is either a mirror database in Database Mirroring, or is in a secondary replica of an AlwaysOn availability group. Connect to the server instance that hosts the pr
33449	10	FILESTREAM File I/O access is enabled, but no listener for the availability group is created. A FILESTREAM PathName will be unable to refer to a virtual network name (VNN) and, instead, will need to refer to a physical Windows Server Failover Clustering (
33450	10	FILESTREAM File I/O access is enabled. One or more availability groups ('%ls') currently do not have a listener. A FILESTREAM PathName will be unable to refer to a virtual network name (VNN) and, instead, will need to refer to a physical Windows Server F
35202	10	A connection for availability group '%ls' from availability replica '%ls' with id [%ls] to '%ls' with id [%ls] has been successfully established. This is an informational message only. No user action is required.
35205	16	Could not start the AlwaysOn Availability Groups transport manager. This failure probably occurred because a low memory condition existed when the message dispatcher started up. If so, other internal tasks might also have experienced errors. Check the SQL
35206	10	A connection timeout has occurred on a previously established connection to availability replica '%ls' with id [%ls]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.	If there is a login failed for domain\node$ account around same time, then it may be because NT AUTHORITY\SYSTEM account lacked the required permissions to failover the availability group. To failover the group to the other node, NT AUTHORITY\SYSTEM must have permission to connect to SQL (CONNECT SQL), failover the availability group (ALTER ANY AVAILABILITY GROUP) and execute sp_server_diagnostics (VIEW SERVER STATE). The NT AUTHORITY\SYSTEM account is used to automatically execute sp_server_diagnostics. Another option is to add the account to sysadmin role in SQL Server. CREATE LOGIN [NT AUTHORITY\SYSTEM] FROM WINDOWS WITH DEFAULT_DATABASE=[master], DEFAULT_LANGUAGE=[us_english]; ALTER SERVER ROLE [sysadmin] ADD MEMBER [NT AUTHORITY\system];
35207	16	Connection attempt on availability group id '%ls' from replica id '%ls' to replica id '%ls' failed because of error %d, severity %d, state %d.
35210	16	Failed to modify options for availability replica '%.ls' in availability group '%.ls'. The specified availability group does not contain an availability replica with specified name. Verify that availability group name and availability replica name are c
35212	16	The %ls operation is not allowed by the current availability-group configuration. This operation would exceed the maximum number of %d synchronous-commit availability replicas in availability group '%.*ls'. Change one of the existing synchronous-commit
35213	16	The %ls operation is not allowed by the current availability-group configuration. This operation would exceed the maximum number of %d automatic failover targets in availability group '%.*ls'. Change one of the existing synchronous-commit replicas to th
35217	16	The thread pool for AlwaysOn Availability Groups was unable to start a new worker thread because there are not enough available worker threads. This may degrade AlwaysOn Availability Groups performance. Use the "max worker threads" configuration option
35220	16	Could not process the operation. AlwaysOn Availability Groups replica manager is waiting for the host computer to start a Windows Server Failover Clustering (WSFC) cluster and join it. Either the local computer is not a cluster node, or the local cluster
35221	16	Could not process the operation. AlwaysOn Availability Groups replica manager is disabled on this instance of SQL Server. Enable AlwaysOn Availability Groups, by using the SQL Server Configuration Manager. Then, restart the SQL Server service, and retry t
35222	16	Could not process the operation. AlwaysOn Availability Groups does not have permissions to access the Windows Server Failover Clustering (WSFC) cluster. Disable and re-enable AlwaysOn Availability Groups by using the SQL Server Configuration Manager. The
35223	16	Cannot add %d availability replica(s) to availability group '%.*ls'. The availability group already contains %d replica(s), and the maximum number of replicas supported in an availability group is %d.
35224	16	Could not process the operation. AlwaysOn Availability Groups failed to load the required Windows Server Failover Clustering (WSFC) library. Verify that the computer is a node in a WSFC cluster. You will need to restart the SQL Server instance to reload
35225	16	Could not process the operation. The instance of SQL Server is running under WOW64 (Windows 32-bit on Windows 64-bit), which does not support AlwaysOn Availability Groups. Reinstall SQL Server in the native 64-bit edition, and re-enable AlwaysOn Availab
35226	16	Could not process the operation. AlwaysOn Availability Groups has not started because the instance of SQL Server is not running as a service. Restart the server instance as a service, and retry the operation.
35228	16	The attempt to set the failure condition level for availability group '%.*ls' failed. The specified level value is out of the valid range [%u, %u]. Reenter the command specifying a valid failure condition level value.
35229	16	The attempt to set the health check timeout value for availability group '%.*ls' failed. The specified timeout value is less than %u milliseconds. Reenter the command specifying a valid health check timeout value.
35233	16	Cannot create an availability group containing %d availability replica(s).
35237	16	None of the specified replicas for availability group %.*ls maps to the instance of SQL Server to which you are connected. Reenter the command, specifying this server instance to host one of the replicas. This replica will be the initial primary replica.
35238	16	Database '%.ls' cannot be added to availability group '%.ls'. The database does not exist on this SQL Server instance. Verify that the database name is correct, then retry the operation.
35239	16	The ALTER DATABASE <database-name> SET HADR SUSPEND (or SET HADR RESUME) statement failed on database '%.ls' of availability group '%.ls''. Either the availability group does not contain the specified database, or the database has not joined the availab
35240	16	Database '%.ls' cannot be joined to or unjoined from availability group '%.ls'. This operation is not supported on the primary replica of the availability group.	TSQL to unjoin is “ALTER DATABASE db SET HADR OFF”. Note that failing over thus making it a secondary, and then retrying ALTER DATABASE HADR is not an option, because the alter database will then give error 921 that database has not been recovered yet. If this replica can be made a secondary, then fail over thus making it a secondary, and then drop replica when in this secondary role. TSQL is “ALTER AVAILABILITY GROUP AgName REMOVE REPLICA ON ‘ReplicaName’;”. If no secondary replica for failover and getting this error on primary, then “DROP AVAILABILITY GROUP AgName”. This will automatically drop listener. If do not want listener dropped, then create a temporary AG adding same name, then try the “DROP AVAILABILITY GROUP”.
35242	16	Cannot complete this ALTER DATABASE <database-name> SET HADR operation on database '%.*ls'. The database is not joined to an availability group. After the database has joined the availability group, retry the command.
35243	16	Failed to set resource property '%.ls' for availability group '%.ls'. The operation encountered SQL Server error %d. When the cause of the error has been resolved, retry the ALTER AVAILABILITY GROUP command later.
35244	16	Database '%.ls' cannot be added to availability group '%.ls'. The database is currently joined to another availability group. Verify that the database name is correct and that the database is not joined to an availability group, then retry the operati
35246	16	Failed to create the availability group. A SQL Server instance name could not be validated because the dynamic link library (DLL) file '%ls' could not be located (Windows System Error %d). Verify that the specified server instance exists. If it exists, th
35247	16	Failed to create the availability group. A SQL Server instance name could not be validated because the dynamic link library (DLL) file '%ls' could not be loaded (Windows System Error %d).
35249	16	An attempt to add or join a system database, '%.*ls', to an availability group failed. Specify only user databases for this operation.
35254	16	An error occurred while accessing the availability group metadata. Remove this database or replica from the availability group, and reconfigure the availability group to add the database or replica again. For more information, see the ALTER AVAILABILITY G
35257	16	AlwaysOn Availability Groups Send Error (Error code 0x%X, "NOT OK") was returned when sending a message for database ID %d. If the partner is running and visible over the network, retry the command using correctly configured partner-connection parameters
35259	16	Database '%.*ls' is already participating in a different availability group.
35260	16	During an attempted database recovery, an availability database manager was not found for database id %d with availability group ID %d and group database ID %ls. Recovery was terminated. The most likely cause of this error is that the availability group
35261	16	Attempt to perform an AlwaysOn Availability Groups operation on a system database, '%ls', failed. System databases are not supported by AlwaysOn Availability Groups.
35262	17	Skipping the default startup of database '%.*ls' because the database belongs to an availability group (Group ID: %d). The database will be started by the availability group. This is an informational message only. No user action is required.	Check SQL and Windows logs for additional messages if expecting database to start.
35264	10	AlwaysOn Availability Groups data movement for database '%.ls' has been suspended for the following reason: "%S_MSG" (Source ID %d; Source string: '%.ls'). To resume data movement on the database, you will need to resume the database manually. For infor	Link redirects to http://msdn.microsoft.com/en-us/library/ff877956.aspx and the mention steps should be followed to resume data movement.
35265	10	AlwaysOn Availability Groups data movement for database '%.*ls' has been resumed. This is an informational message only. No user action is required.
35266	10	AlwaysOn Availability Groups connection with %S_MSG database established for %S_MSG database '%.*ls' on the availability replica with Replica ID: {%.8x-%.4x-%.4x-%.2x%.2x-%.2x%.2x%.2x%.2x%.2x%.2x}. This is an informational message only. No user action is
35267	10	AlwaysOn Availability Groups connection with %S_MSG database terminated for %S_MSG database '%.*ls' on the availability replica with Replica ID: {%.8x-%.4x-%.4x-%.2x%.2x-%.2x%.2x%.2x%.2x%.2x%.2x}. This is an informational message only. No user action is r
35275	16	A previous RESTORE WITH CONTINUE_AFTER_ERROR operation or being removed while in the SUSPECT state from an availability group left the '%.*ls' database in a potentially damaged state. The database cannot be joined while in this state. Restore the database
35276	17	Failed to allocate and schedule an AlwaysOn Availability Groups task for database '%ls'. Manual intervention may be required to resume synchronization of the database. If the problem persists, you might need to restart the local instance of SQL Server.
35279	16	The attempt to join database '%.*ls' to the availability group was rejected by the primary database with error '%d'. For more information, see the SQL Server error log for the primary replica.
35280	16	Database '%.ls' cannot be added to availability group '%.ls'. The database is already joined to the specified availability group. Verify that the database name is correct and that the database is not joined to an availability group, then retry the ope
35281	16	Database '%.ls' cannot be removed from availability group '%.ls'. The database is not joined to the specified availability group. Verify that the database name and the availability group name are correct, then retry the operation.
35282	16	Availability replica '%.ls' cannot be added to availability group '%.ls'. The availability group already contains an availability replica with the specified name. Verify that the availability replica name and the availability group name are correct, t
35283	16	Availability replica '%.ls' cannot be removed from availability group '%.ls'. The availability group does not contain an availability replica with the specified name. Verify that the availability replica name is correct, then retry the operation.
35284	16	Availability replica '%.ls' cannot be removed from availability group '%.ls', because this replica is on the local instance of SQL Server. If the local availability replica is a secondary replica, connect to the server instance that is currently hostin
35287	16	AlwaysOn Availability Groups transport for availability database "%.*ls" failed to decompress the log block whose LSN is %S_LSN. This error can be caused by a corrupt network packet or a compression version mismatch. The database replica has been put in
35288	16	AlwaysOn Availability Groups log apply for availability database "%.*ls" has received an out-of-order log block. The expected LSN was %S_LSN. The received LSN was %S_LSN. The database replica has been put into the SUSPENDED state. Resume the availabil
35299	10	Nonqualified transactions are being rolled back in database %.*ls for an AlwaysOn Availability Groups state change. Estimated rollback completion: %d%%. This is an informational message only. No user action is required.
41039	16	An availability group replica already exists on the Windows Server Failover Clustering (WSFC) node '%.*ls'. Each WSFC node can contain only one replica of an availability group. Please choose another WSFC node to host the new replica.
41040	16	Failed to remove the availability group replica '%.ls' from availability group '%.ls'. The availability group does not contain a replica with the specified name. Verify the availability group and replica names and then retry the operation.
41042	16	The availability group '%.*ls' already exists. This error could be caused by a previous failed CREATE AVAILABILITY GROUP or DROP AVAILABILITY GROUP operation. If the availability group name you specified is correct, try dropping the availability group a	If AG still exists, DROP AVAILABILITY GROUP. Open regedit, backup the entry from the old Availability Group that starts with HADR_Ag_???? key, then delete this key [to be tested]. Try with new name for Availability Group.
41043	16	For availability group '%.*ls', the value of the name-to-ID map entry is invalid. The binary value should contain a Windows Server Failover Clustering (WSFC) resource ID, a WSFC group ID, and their corresponding lengths in characters. The availability g
41044	16	Availability group name to ID map entry for availability group '%.*ls' cannot be found in the Windows Server Failover Clustering (WSFC) store. The availability group name may be incorrect, or the availability group may not exist in this Windows Server Fa
41045	16	Cannot add database '%.ls' to the availability group '%.ls', because there is already a database with the same name in the availability group. Please verify that the database and availability group names specified are correct.
41046	16	Cannot add replica '%.ls' to the availability group '%.ls', because there is already a replica with the same name in the availability group. Please verify the replica and availability group names specified are correct.
41048	10	AlwaysOn Availability Groups: Local Windows Server Failover Clustering service has become unavailable. This is an informational message only. No user action is required.
41049	10	AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is no longer online. This is an informational message only. No user action is required.
41050	10	AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required.
41051	10	AlwaysOn Availability Groups: Local Windows Server Failover Clustering service started. This is an informational message only. No user action is required.
41052	10	AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering node to start. This is an informational message only. No user action is required.
41053	10	AlwaysOn Availability Groups: Local Windows Server Failover Clustering node started. This is an informational message only. No user action is required.
41054	10	AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering node to come online. This is an informational message only. No user action is required.
41055	10	AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is online. This is an informational message only. No user action is required.
41056	16	Availability replica '%.ls' of availability group '%.ls' cannot be brought online on this SQL Server instance. Another replica of the same availability group is already online on the local Windows Server Failover Clustering (WSFC) node. Each WSFC node
41058	10	AlwaysOn: The local replica of availability group '%.*ls' is starting. This is an informational message only. No user action is required.
41059	10	AlwaysOn: Availability group '%.*ls' was removed while the availability replica on this instance of SQL Server was offline. The local replica will be removed now. This is an informational message only. No user action is required.
41060	16	The Cyclic Redundancy Check (CRC) value generated for the retrieved availability group configuration data from the Windows Server Failover Clustering (WSFC) store does not match that stored with the data for the availability group with ID '%.*ls'. The av
41061	10	AlwaysOn: The local replica of availability group '%.*ls' is stopping. This is an informational message only. No user action is required.
41062	16	The ID of availability group '%.*ls' in local data store is inconsistent with that in the Windows Server Failover Clustering (WSFC) data store. The availability group may have been dropped and recreated while the SQL Server instance was offline, or while
41063	16	Windows Server Failover Clustering (WSFC) detected that the availability group resource with ID '%.*ls' was online when the availability group was not actually online. The attempt to synchronize the WSFC resource state with the availability group state f
41066	16	Cannot bring the Windows Server Failover Clustering (WSFC) resource (ID '%.*ls') online (Error code %d). The WSFC service may not be running or may not be accessible in its current state, or the WSFC resource may not be in a state that could accept the r	In DOS prompt, run “Net helpmsg” on the %d value, and proceed accordingly.
41069	16	The existence of availability group data for the availability group '%.*ls' in the Windows Server Failover Clustering (WSFC) store could not be determined. The local WSFC node may be down, or a previous CREATE AVAILABILITY GROUP or DROP AVAILABILITY GROU
41070	16	Configuration data for the availability group with Windows Server Failover Clustering (WSFC) resource ID '%.*ls' is not found in the WSFC data store. The availability group may have been dropped, or a previous CREATE AVAILABILITY GROUP or DROP AVAILABILI
41071	16	Cannot read the persisted configuration of AlwaysOn availability group with corresponding Windows Server Failover Clustering (WSFC) resource ID '%.*ls'. The persisted configuration is written by a higher-version SQL Server that hosts the primary availabi
41072	16	The ID of availability group '%.*ls' in local data store does not exist in the Windows Server Failover Clustering (WSFC) data store. The availability group may have been dropped but the current WSFC node was not notified. To resolve this error, try to r
41073	16	The database '%.ls' cannot be removed from availability group '%.ls'. This database does not belong to the availability group.
41074	10	AlwaysOn: The local replica of availability group '%.*ls' is preparing to transition to the primary role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is required	This is an informational message only. No user action is required
41075	10	AlwaysOn: The local replica of availability group '%.*ls' is preparing to transition to the resolving role in response to a request from the Windows Server Failover Clustering (WSFC) cluster. This is an informational message only. No user action is requir
41076	10	AlwaysOn: Availability group '%.*ls' is going offline because it is being removed. This is an informational message only. No user action is required.
41080	16	Failed to delete SQL Server instance name to Windows Server Failover Clustering node name map entry for the local availability replica of availability group '%.*ls'. The operation encountered SQL Server error %d and has been terminated. Refer to the SQL
41081	16	Failed to destroy the Windows Server Failover Clustering group corresponding to availability group '%.*ls'. The operation encountered SQL Server error %d and has been terminated. Refer to the SQL Server error log for details about this SQL Server error
41089	10	AlwaysOn Availability Groups startup has been cancelled, because SQL Server is shutting down. This is an informational message only. No user action is required.
41091	10	AlwaysOn: The local replica of availability group '%.*ls' is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.
41092	10	AlwaysOn: The availability replica manager is going offline because %ls. This is an informational message only. No user action is required.
41093	10	AlwaysOn: The local replica of availability group '%.*ls' is going offline because the corresponding resource in the Windows Server Failover Clustering (WSFC) cluster is no longer online. This is an informational message only. No user action is required.
41094	10	AlwaysOn: The local replica of availability group '%.*ls' is restarting because the existing primary replica restarted or the availability group failed over to a new primary replica. This is an informational message only. No user action is required.
41095	10	AlwaysOn: Explicitly transitioning the state of the Windows Server Failover Clustering (WSFC) resource that corresponds to availability group '%.*ls' to Failed. The resource state is not consistent with the availability group state in the instance of SQL
41096	10	AlwaysOn: The local replica of availability group '%.*ls' is being removed. The instance of SQL Server failed to validate the integrity of the availability group configuration in the Windows Server Failover Clustering (WSFC) store. This is expected if th
41097	10	AlwaysOn: The local replica of availability group '%.*ls' is going offline. This replica failed to read the persisted configuration because of a version mismatch. This is an informational message only. No user action is required.
41098	10	AlwaysOn: The local replica of availability group '%.*ls' is restarting, because it failed to read the persisted configuration. This is an informational message only. No user action is required.
41099	10	AlwaysOn: The local replica of availability group '%.*ls' is going offline. This replica failed to read the persisted configuration, and it has exhausted the maximum for restart attempts. This is an informational message only. No user action is required.
41100	16	The availability group '%.*ls' and/or its local availability replica does not exist. Verify that the specified availability group name is correct, and that the local availability replica has joined the availability group, then retry the operation.
41101	16	The availability group with Windows Server Failover Clustering resource ID '%.*ls' and/or its local availability replica does not exist. Verify that the specified availability resource ID is correct, and that the local availability replica has joined the
41102	10	Failed to persist configuration data of availability group '%.*ls' in the Windows Server Failover Clustering (WSFC) cluster. The local availability replica either is not the primary replica or is shutting down.
41103	10	Startup of the AlwaysOn Availability Replica Manager has been terminated, because the 'FixQuorum' property of Windows Server Failover Clustering (WSFC) is not present. The prerequisite QFE hotfix, KB 2494036, might not yet be installed on your Windows Ser
41104	16	Failover of the availability group '%.*ls' to the local replica failed because the availability group resource did not come online due to a previous error. To identify that error, check the SQL Server error log and the Windows Server Failover Cluster logs
41106	16	Cannot create an availability replica for availability group '%.*ls'. An availability replica of the specified availability group already exists on this instance of SQL Server. Verify that the specified availability group name is correct and unique, the
41107	16	Availability group '%.*ls' failed to create necessary events for the WSFC Lease mechanism. Windows returned error code (%d) when obtaining handles for Lease events. Resolve the windows error and retry the availability group operation.
41108	16	An error occurred while removing availability group '%.*ls'. The DROP AVAILABILITY GROUP command removed the availability group configuration from the local metadata. However, the attempt to remove this configuration from the Windows Server Failover Clust
41109	17	Could not enqueue a task (SQL OS error: 0x%x) for process actions for the availability group '%.*ls'. Most likely, the instance of SQL Server is low on resources. Check the SQL Server error log to determine the cause of the failure. Retry the operation l
41110	10	AlwaysOn: The availability replica manager is starting. This is an informational message only. No user action is required.
41111	10	AlwaysOn: The availability replica manager is waiting for the instance of SQL Server to allow client connections. This is an informational message only. No user action is required.
41112	16	A Windows Server Failover Clustering (WSFC) API required by availability groups has not been loaded. AlwaysOn Availability Groups replica manager is not enabled on the local instance SQL Server. If the server instance is running an edition of SQL Server t
41113	16	Cannot failover availability group '%.*ls' to this instance of SQL Server because a failover command is already pending on the local replica of this availability group. Wait for the pending failover command to complete before issuing another command on t
41114	16	Cannot create an availability group named '%.*ls' because it already exists in a system table.
41115	16	Cannot create the availability group named '%.ls' because its availability group ID (ID: '%.ls') already exists in a system table.
41116	16	Cannot create an availability group named '%.ls' with replica ID '%.ls' because this ID already exists in a system table.
41117	16	Cannot map local database ID %d to the availability database ID '%.ls' within availability group '%.ls'. This database is already mapped to an availability group.
41118	16	Cannot map database ID %d to the availability database ID '%.ls' within availability group '%.ls'. Another local database, (ID %d). is already mapped to this availability database.
41119	16	Could not find the availability group ID %d in the system table.
41121	10	The local availability replica of availability group '%.*ls' cannot accept signal '%s' in its current replica role, '%s', and state (configuration is %s in Windows Server Failover Clustering store, local availability replica has %s joined). The availabil
41122	16	Cannot failover availability group '%.*ls' to this instance of SQL Server. The local availability replica is already the primary replica of the availability group. To failover this availability group to another instance of SQL Server, run the failover c
41126	16	Operation on the local availability replica of availability group '%.*ls' failed. The local copy of the availability group configuration does not exist or has not been initialized. Verify that the availability group exists and that the local copy of the
41127	16	Attempt to set database mapping state where the local database id %d is not mapped to any availability group.
41128	16	Failed to perform database operation '%s' on database '%.ls' (ID %d) in availability group '%.ls'. The database might be in an incorrect state for the operation. If the problem persists, you may need to restart the SQL Server instance.
41129	16	Failed to schedule or execute database operation '%s' on database '%.ls' (Database ID: %d) in availability group '%.ls' (SQL OS error: %d). The instance of SQL Server may have insufficient resources to carry out the database operation. If the problem p
41130	16	Operation '%s' on a database '%.ls' (Database ID: %d) in availability group '%.ls' failed with SQL Server error %d (Error details: "%.*ls"). The operation has been rolled back. See previous error messages in the SQL Server error log for more details.
41131	10	Failed to bring availability group '%.*ls' online. The operation timed out. Verify that the local Windows Server Failover Clustering (WSFC) node is online. Then verify that the availability group resource exists in the WSFC cluster. If the problem persis
41132	16	Cannot join database '%.ls' to availability group '%.ls'. The specified database does not belong to the availability group. Verify the names of the database and the availability group, and retry the command specifying the correct names.
41133	10	Cannot remove database '%.ls' from availability group '%.ls'. Either the database does not belong to the availability group, or the database has not joined the group. Verify the database and availability group names, and retry the command.
41134	16	Cannot bring the availability group '%.*ls' online. The local instance was not the previous primary replica when the availability group went offline, not all databases are synchronized, and no force failover command was issued on the local availability r
41135	10	Startup of AlwaysOn Availability Groups replica manager failed due to SQL Server error %d. To determine the cause of this error, check the SQL Server error log for the preceding error.
41136	16	Failed to join the availability replica to availability group '%.*ls' because the group is not online. Either bring the availability group online, or drop and recreate it. Then retry the join operation.
41137	10	Abandoning a database operation '%ls' on availability database '%.ls' of availability group '%.ls'. The sequence number of local availability replica has changed (Previous sequence number: %u, current sequence number: %u). This is an informational mes
41138	17	Cannot accept AlwaysOn Availability Groups operation operation on database '%.ls' of availability group '%.ls'. The database is currently processing another operation that might change the database state. Retry the operation later. If the condition per
41139	10	Failed to set database information for availability group %.*ls. The local availability replica is not the primary replica, or it is shutting down. This is an informational message only. No user action is required.
41140	16	Availability group '%.*ls' cannot process the ALTER AVAILABILITY GROUP command, because the local availability replica is not the primary replica. Connect to the server instance that is currently hosting the primary replica of this availability group, an
41141	16	Failed to set availability group database information for availability group %.*ls. The local availability replica is not the primary, or is shutting down. This is an informational message only. No user action is required.
41142	16	The availability replica for availability group '%.*ls' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group, or the WSFC cluster was started in Force Quorum	Check if *_FailoverCluster_health_XeLogs confirm that the cluster was started in forced_quorum mode (forced_quorum True in availability_replica_automatic_failover_validation event). When a cluster is started in forced_quorum mode (net.exe start clussvc /forcequorum, OR Failover Cluster Manager>left pane>right-click Windows cluster>choose Force Cluster Start>Confirm by clicking Yes.) on primary replica, the availability group is required to be started by issuing command ALTER AVAILABILITY GROUP <AGNAME> FORCE_FAILOVER_ALLOW_DATA_LOSS;. Because this is the original primary, despite issuing force allow data loss, NO data loss occurs When a cluster is started in forced_quorum mode on secondary replica, first check sys.dm_hadr_database_replica_cluster_states.is_failover_ready on secondary. If this value is 1, and be assured no data loss will occur. If this value is not 1 (for example because it’s asynchronous commit), then can check sys.dm_hadr_database_replica_states.last_commit_time, so as to know estimate start time of data loss. The sys.dm_hadr_database_replica_states.last_hardened_lsn value may be used to consider other secondaries for failover. Then can initiate failover with ALTER AVAILABILITY GROUP <AGNAME> FORCE_FAILOVER_ALLOW_DATA_LOSS; on secondary.
41143	16	Cannot process the operation. The local replica of availability Group '%.*ls' is in a failed state. A previous operation to read or update persisted configuration data for the availability group has failed. To recover from this failure, either restart
41144	16	The local availability replica of availability group '%.*ls' is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: %d). To recover from this failure, either restart the local Windows Server Failov
41145	10	Cannot join database '%.ls' to availability group '%.ls'. The database has already joined the availability group. This is an informational message. No user action is required.
41146	16	Failed to bring Availability Group '%.*ls' online. The Windows Server Failover Clustering (WSFC) service may not be running, or it may not be accessible in its current state. Please verify the local WSFC node is up and then retry the operation.
41147	10	AlwaysOn Availability Groups was not started because %ls. This is an informational message. No user action is required.
41148	16	Cannot add or join database '%.ls' to availability group '%.ls'. The database does not exist on this instance of SQL Server. Verify the database name and that the database exists on the server instance. Then retry the operation, specifying the correc
41149	16	Operation on the availability group '%.*ls' has been cancelled or terminated, either because of a connection timeout or cancellation by user. This is an informational message. No user action is required.
41150	16	Failed to take availability group '%.*ls' offline. The Windows Server Failover Clustering (WSFC) service may not be running, or it may not be accessible in its current state. Verify the local WSFC node is up and then retry the operation.
41151	16	Error accessing the Availability Groups manager. The local Availability Groups manager has not been initialized. Wait until the Availability Groups manager is in a state that allows access, and retry the operation.
41152	16	Failed to create availability group '%.*ls'. The operation encountered SQL Server error %d and has been rolled back. Check the SQL Server error log for more details. When the cause of the error has been resolved, retry CREATE AVAILABILITY GROUP command	Check this document for error number mentioned in %d.
41153	16	Failed to create availability group '%.*ls'. The operation encountered SQL Server error %d. An attempt to roll back the operation failed. Check the SQL Server error log for more details. Then execute the DROP AVAILABILITY GROUP command to clean up any
41154	16	Cannot failover availability group '%.*ls' to this SQL Server instance. The availability group is still being created. Verify that the specified availability group name is correct. Wait for CREATE AVAILABILITY GROUP command to finish, then retry the op
41155	16	Cannot failover availability group '%.*ls' to this instance of SQL Server. The availability group is being dropped. Verify that the specified availability group name is correct. The availability group may need to be recreated if the drop operation was
41156	16	Cannot drop availability group '%.*ls' from this instance of SQL Server. The availability group is either being dropped, or the local availability replica is being removed from the availability group. Verify that the specified availability group name is
41157	16	Cannot remove the local availability replica from availability group '%.*ls' from this instance of SQL Server. The availability group is either being dropped, or the local availability replica is being disjoined. Verify that the specified availability g
41158	16	Failed to join local availability replica to availability group '%.*ls'. The operation encountered SQL Server error %d and has been rolled back. Check the SQL Server error log for more details. When the cause of the error has been resolved, retry the A	Check for SQL/Windows messages around same time.
41159	16	Failed to join local availability replica to availability group '%.*ls'. The operation encountered SQL Server error %d. An attempt to rollback the operation failed. Check SQL Server error log for more details. Run DROP AVAILABILITY GROUP command to cl
41160	16	Failed to designate the local availability replica of availability group '%.*ls' as the primary replica. The operation encountered SQL Server error %d and has been terminated. Check the preceding error and the SQL Server error log for more details about
41161	16	Failed to validate the Cyclic Redundancy Check (CRC) of the configuration of availability group '%.*ls'. The operation encountered SQL Server error %d, and the availability group has been taken offline to protect its configuration and the consistency of
41162	16	Failed to validate sequence number of the configuration of availability group '%.*ls'. The in-memory sequence number does not match the persisted sequence number. The availability group and/or the local availability replica will be restarted automatical
41163	16	An error occurred while waiting for the local availability replica of availability group '%.*ls' to transition to the primary role. The operation encountered SQL OS error %d and has been terminated. Verify that the Windows Server Failover Clustering (WS
41164	16	An error occurred while waiting for the local availability replica of availability group '%.*ls' to transition to the resolving role. The operation encountered SQL OS error %d and has been terminated. Verify that the Windows Server Failover Clustering (
41165	16	A timeout error occurred while waiting to access the local availability replica of availability group '%.*ls'. The availability replica is currently being accessed by another operation. Wait for the in-progress operation to complete, and then retry the
41166	16	An error occurred while waiting to access the local availability replica of availability group '%.*ls'. The operation encountered SQL OS error %d, and has been terminated. Verify that the local availability replica is in the correct state, and then retr
41167	16	An error occurred while attempting to access availability replica '%.ls' in availability group '%.ls'. The availability replica is not found in the availability group configuration. Verify that the availability group and availability replica names are
41168	16	An error occurred while attempting to access availability replica with ID '%.ls' in availability group '%.ls'. The availability replica is not found in the availability group configuration. Verify that the availability group name and availability repl
41169	16	An error occurred while attempting to access the availability group database with ID '%.ls' in availability group '%.ls'. The availability database is not found in the availability group configuration. Verify that the availability group name and avail
41170	10	Post-online processing for availability group '%.*ls' has been terminated. Either post-online processing has already completed, the local availability replica is no longer the primary replica, or the availability group is being dropped. This is an infor
41171	16	Failed to create availability group '%.*ls', because a Windows Server Failover Cluster (WSFC) group with the specified name already exists. The operation has been rolled back successfully. To retry creating an availability group, either remove or rename
41172	16	An error occurred while dropping availability group '%.*ls' from Windows Server Failover Clustering (WSFC) cluster and from the local metadata. The operation encountered SQL OS error %d, and has been terminated. Verify that the specified availability gr	In DOS prompt, run “Net helpmsg” on the %d value, and proceed accordingly.
41173	16	An error occurred while removing the local availability replica from availability group '%.*ls'. The operation encountered SQL OS error %d, and has been terminated. Verify that the specified availability group name is correct, and then retry the command
41176	10	Failed to acquire exclusive access to local availability group configuration data (SQL OS error: %d). If the problem persists, you might need to restart the instance of SQL Server.
41177	16	The availability replica of the specified availability group '%.*ls' is being dropped. Wait for the completion of the drop command and retry the operation later.
41178	16	Cannot drop availability group '%.*ls' from this SQL Server instance. The availability group is currently being created. Verify that the specified availability group name is correct. Wait for the current operation to complete, then retry the command if
41179	16	Cannot remove the local availability replica from availability group '%.*ls' from this instance of SQL Server. The availability group is currently being created. Verify that the specified availability group name is correct. Wait for the current operati
41180	16	Attempt to access non-existent or uninitialized availability group with ID '%.*ls'. This is usually an internal condition, such as the availability group is being dropped or the local WSFC node has lost quorum. In such cases, and no user action is requir
41181	16	The local availability replica of availability group '%.*ls' did not become primary. A concurrent operation may have changed the state of the availability group in Windows Server Failover Cluster. Verify that the availability group state in Windows Serv
41182	16	Failed to set the local availability replica of availability group '%.*ls' as joined in Windows Server Failover Clustering (WSFC) database. Either the local availability replica is no longer the primary, or the WSFC service is not accessible. Verify tha
41183	16	Failed to modify availability replica options for availability group '%.*ls'. Before the availability group configuration could be updated, the operation encountered SQL Server error %d. The operation has been rolled back. Refer to the SQL Server error
41184	16	Failed to modify availability replica options for availability group '%.*ls'. The availability group configuration has been updated. However, the operation encountered SQL Server error %d while applying the new configuration to the local availability re
41185	10	Replica option specified in ALTER AVAILABILITY GROUP '%.*ls' MODIFY DDL is same is cached availability group configuration.
41186	16	Availability group '%.*ls' cannot process an ALTER AVAILABILITY GROUP command at this time. The availability group is still being created. Verify that the specified availability group name is correct. Wait for CREATE AVAILABILITY GROUP command to finis
41187	16	Availability group '%.*ls' cannot process an ALTER AVAILABILITY GROUP command at this time. The availability group is being dropped. Verify that the specified availability group name is correct. The availability group may need to be recreated if it was
41188	16	Availability group '%.*ls' failed to process %s-%s command. The operation encountered SQL Server error %d before the availability group configuration could be updated, and has been rolled back. Refer to the SQL Server error log for details. Verify that
41189	16	Availability group '%.*ls' failed to process the %s-%s command. The availability group configuration has been updated. However, the operation encountered SQL Server error %d while applying the new configuration to the local availability replica, and has
41190	16	Availability group '%.*ls' failed to process %s-%s command. The local availability replica is not in a state that could process the command. Verify that the availability group is online and that the local availability replica is the primary replica, the
41191	16	The local availability replica of availability group '%.*ls' cannot become the primary replica. The last-known primary availability replica is of a higher version than the local availability replica. Upgrade the local instance of SQL Server to the same
41192	17	Creating and scheduling a worker task for AlwaysOn Availability Groups failed due to lack of resources (SQL OS error %d). Processing of new actions might be delayed or stalled until the resource limits are resolved. Reduce the memory or thread count on
41193	10	Cannot join database '%.ls' to availability group '%.ls'. The database is in the process of being removed from the availability group. When the remove-database operation completes, the database will no longer be joined to the availability group. Then
41194	16	An error occurred while waiting for the local availability replica for availability group '%.*ls' to complet post-online work. The operation encountered SQL OS error %d and has been terminated. Verify that the Windows Server Failover Clustering (WSFC) c
41195	16	Availability group '%.*ls' failed to process the WSFC lease-renewal command. The local availability replica lease is no longer valid to process the lease renewal command. Availability replica lease expired. This is an informational message only. No user a
41196	16	Failed to create availability group '%.*ls', because a Windows Server Failover Cluster (WSFC) group with the specified name already exists. An attempt to rollback the operation failed. Check the SQL Server error log for more details. To manually clean
41199	16	The specified command is invalid because the AlwaysOn Availability Groups feature is not supported by this edition of SQL Server. For information about features supported by the editions of SQL Server, see SQL Server Books Online.
41402	16	The WSFC cluster is offline, and this availability group is not available. This issue can be caused by a cluster service issue or by the loss of quorum in the cluster.
41403	16	Availability group is offline.
41404	16	The availability group is offline, and is unavailable. This issue can be caused by a failure in the server instance that hosts the primary replica or by the WSFC availability group resource going offline.
41405	16	Availability group is not ready for automatic failover.
41406	16	The availability group is not ready for automatic failover. The primary replica and a secondary replica are configured for automatic failover, however, the secondary replica is not ready for an automatic failover. Possibly the secondary replica is unavail
41408	16	In this availability group, at least one secondary replica has a NOT SYNCHRONIZING synchronization state and is not receiving data from the primary replica.
41410	16	In this availability group, at least one synchronous replica is not currently synchronized. The replica synchronization state could be either SYNCHONIZING or NOT SYNCHRONIZING.
41412	16	In this availability group, at least one availability replica does not currently have the primary or secondary role.
41414	16	In this availability group, at least one secondary replica is not connected to the primary replica. The connected state is DISCONNECTED.
41424	16	This secondary database is not joined to the availability group. The configuration of this secondary database is incomplete. For information about how to join a secondary database to an availability group, see SQL Server Books Online.
41428	16	This secondary replica is not joined to the availability group. For an availability replica to be successfully joined to the availability group, the join state must be Joined Standalone Instance (1) or Joined Failover Cluster (2). For information about ho

SQL Server AlwaysOn (SQL Server Configuration Manager)

An error occurred while loading the AlwaysOn High Availability properties [return code: 0x80070005].

You may receive this error if you don't have the appropriate permission. Try right click ‘SQL Server Configuration Manager’ and select ‘Run as Administrator’.

The AlwaysOn Availability Groups feature requires the x86(non-WOW) or x64 Enterprise Edition of SQL Server 2012 (or later version) running on Windows Server 2008 (or later version) with WSFC hotfix KB 2494036 installed. This SQL Server edition and/or Windows Server System does not meet one or more of the requirements

Verify OS is Windows 2008 or later version.
If OS is Windows 2008 or Windows 2008 R2, install Windows hotfix KB 2494036, if not already installed.

SQL Server AlwaysOn (New Availability Group wizard errors)

The current WSFC cluster quorum vote configuration is not recommended for this availability group.

For more information, see the following topic in SQL Server Books Online: http://go.microsoft.com/fwlink/?LinkId=224159.

When validating WSFC quorum vote configuration, the AlwaysOn Availability Group Wizard shows a warning if any of the following conditions are true:
- The cluster node that hosts the primary replica does not have a vote. If this vote is not required for that environment, then this message may be ignored. Powershell command to check vote is “Get-ClusterNode | f1 Name, Nodeweight”.
- A secondary replica is configured for automatic failover and its cluster node does not have a vote. If this vote is not required for that environment, then this message may be ignored.
- KB2494036 is not installed on all cluster nodes that host availability replicas. This patch is required to add or remove votes for cluster nodes in multi-site deployments. However, in single-site deployments, it is usually not required and you may safely ignore the warning.
- While setting up the availability group in the wizard, you configured a replica to asynchronous availability mode. This warning is only reported in SQL 2012 RTM, not in SP1 of SQL 2012 RTM.
- While setting up the availability group in the wizard, you configured a replica for manual failover mode. This warning is only reported in SQL 2012 RTM, not in SP1 of SQL 2012 RTM.
Check the link. Redirects to http://msdn.microsoft.com/en-us/library/hh270280.aspx

Checking for the database files on the secondary replica resulted in an error. (Microsoft.SqlServer.Management.HadrTasks)

------------------------------

Program Location:

at Microsoft.SqlServer.Management.Hadr.TestDatabaseFileExisting.DoWork()

at Microsoft.SqlServer.Management.TaskForms.SimpleWorkItem.Run()

===================================

The following database files already exist on the server instance that hosts secondary replica HA-VM3\TEST1:

\\path\file.mdf

\\path\file_log.ldf

(Microsoft.SqlServer.Management.HadrTasks)

------------------------------

Program Location:

at Microsoft.SqlServer.Management.Hadr.TestDatabaseFileExisting.DoWork()

Check if database-file paths are identical on source and on destination servers. If different, manual (rather than using AG wizard) restore may be required with WITH MOVE syntax.

SQL Server AlwaysOn (performance issues)

983

Unable to access database '%.*ls' because its replica role is RESOLVING which does not allow connections. Try the operation again later.

Check SQL ERRORLOGs, event logs, etc. for network, storage related messages etc.

3402

The database '%ls' is marked %ls and is in a state that does not allow recovery to be run.

If the second %ls is RESTORING, then one option is below.

1) Asked the customer to stop the SQL Server Service.

2) Had the customer rename the actual files on disk with a "_old" suffix so that the sql server would not find them and have the db come up suspect.

3) Asked the customer to start the SQL Server Service.

Once the server came back up we were able to delete the database.

4) Had the customer create a new database with the same name and location as the original database. Used the output from sysaltfiles to verify this.

5) Had the customer stop the SQL Server Service.

6) Had the customer rename these new database files by appending a "_new" on the end of the file.

7) Had the customer take the old database files with the "_old" suffix and changed them back to the original name.

8) Had the customer start the SQL Server Service.

Slow synchronization. Waittime for HADR_SYNC_COMMIT grows anywhere from 500ms to 900ms (compared to less than 15-20 ms).

If KB2723814 not applied, then try the KB workaround of suspend secondary replica and then resume, so that AlwaysOn knows that availability mode has changed back to synchronous commit.

SQL Server AlwaysOn errors (other)

19471	16	The WSFC cluster could not bring the Network Name resource with DNS name '%ls' online. The DNS name may have been taken or have a conflict with existing name services, or the WSFC cluster service may not be running or may be inaccessible. Use a different	Check for associated errors like error 19476.
19476	16	The attempt to create the network name and IP address for the listener failed. The WSFC service may not be running or may be inaccessible in its current state, or the values provided for the network name and IP address may be incorrect. Check the state of the WFSC cluster and validate the network name and IP address with the network administrator.	Check for associated errors like error 41009.
41009	16	The Windows Server Failover Clustering (WSFC) resource control API returned error code %d. The WSFC service may not be running or may not be accessible in its current state, or the specified arguments are invalid. For information about this error code,	Net helpmsg on the %d value, and proceed accordingly. May also be accompanied by Microsoft-Windows-FailoverClustering messages in event log. Check cluster log. If there error is in cluster in Windows Azure, then it may be because Windows Azure currently does not support more than one IP address per VM, which is required by the availability group listener to resolve to the availability group owner node. This is documented in http://msdn.microsoft.com/en-us/library/jj870962.aspx (High Availability and Disaster Recovery for SQL Server in Windows Azure Virtual Machines).
41015	16	Failed to obtain the Windows Server Failover Clustering (WSFC) node handle (Error code %d). The WSFC service may not be running or may not be accessible in its current state, or the specified cluster node name is invalid. For information about this erro	Check if Windows cluster service is running. Net helpmsg on the %d value, and proceed accordingly.

SQL Server AlwaysOn errors (network)

TCP Provider, error: 0 - An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.

NetStat output may show hundreds of entries in TIME_WAIT state leading to buffer/port exhaustion

Add registry setting for MaxUserPort http://support.microsoft.com/kb/196271
Add registry setting for TCPTIMEWAITDELAY.
App/IIS restart, machine reboot are additional options.

SQL Server AlwaysOn errors (Windows)

5057		The cluster IP address is already in use.	Checked if IP is already in use Check permission for cluster network object CNO.
5942		The resource failed to come online due to the failure of one or more provider resources.	Depending on the resource type, this generally requires involvement of Windows team. If this is a network name or an IP address resource, then try creating a Client Access Point with same name and IP if possible, and see it comes online. If it does not come online, then this requires investigation by Windows team.

SQL Server AlwaysOn errors (cluster log)

[RES] Network Name <SPEPD_Tier2>: Unable to update password for computer account DCR-SPEPD on DC \\ DC-04.xxxx.org, status 5.

[RHS] Online for resource SPEPD_Tier2 failed.

Status 5 is Access Denied
In Windows AD, check for CNO (cluster network object) permissions.
For example, below to grant the "create computer objects" to the cluster virtual host (virtual cluster name) for the OU in Active Directory on "Computers" and "DomainServer" OU that was created. [To be tested]

Instructions [to be tested]

1) Log into the Domain Controller machine.

2) Select Start - Administrative Tools - Active Directory users and computer.

--> Active Directory users and computer screen displays.

3) Display-select advanced features.

4) Select the appropriate domain - Computers.

5) Verify that the cluster virtual hosts.

6) Select the appropriate domain - Computers - right click - Properties.

--> Computers Properties screen appears.

7) Security tab-click the Advanced button.

--> Advanced Security for Computers "screen appears.

8) If you have access permissions tab - cluster virtual host registration registration no "Edit" button if you choose the "add" button.

+ Case "add": conducted a check in the "computer", click the "object type" button, search for cluster virtual hosts.

--> [Permission entries on Computers] screen is displayed.

9) Target: Select this object and all child objects check the permission "create computer objects" and click the "OK" button.

--> Returns to the advanced security for Computers "screen.

10) Click "OK" button after clicking "the apply" button.

--> Returns to the Computers Properties screen.

11) And click the "OK" button and exit the Active Directory users and computer screen.

INFO [RES] Network Name: [NNLIB] NetUserAdd object aglisten11 on DC: \\DC.contoso.com, result: 8557

ERR [RES] Network Name: [NNLIB] Failed to create Computer Object aglisten11 in the Active Directory, error 8557

net helpmsg 8557 says "Your computer could not be joined to the domain. You have exceeded the maximum number of computer accounts you are allowed to create in this domain. Contact your system administrator to have this limit reset or increased."

SQL Server AlwaysOn (patching/updates in SQL 2012)

AG on Standalone Instance

Patching steps:

Patch the secondary replica (B)
Bring the secondary replica online (will be new version)
Log the original synchronization configurations for each replica. Change the secondary replica and primary replica to “Synchronous Commit” mode, waiting for the secondary replica (B) to be “synchronized.”
- This will ensure there is no data loss during failover.
- You can check dashboard or dmv: dm_hadr_database_replica_states for the status
Issue a failover through SSMS or T-SQL to failover the AG to the secondary replica. So now the new primary is B, the new secondary is A
Patch the original primary replica (A)
Bring the original primary replica (A) online (will be new version )
Wait for A to become “Synchronized”
Failover the AG back to A
Change each replica’s synchronization mode to the original configurations you logged in step3.

Caveat list:

Before you patch, you can still keep automatic/manual failover setting no change. Just a reminder: if during patching time, primary is down, automatic failover may fail if the secondary hasn’t completed the patching.

If primary and secondary replicas are in multisubnet, your client may experience a little bit longer time of disconnection or timeout during failover.
Please do remember to switch back to your original synchronization mode.

AG on FCIs (Failover Cluster Instances)

FCI on AG deployment Example

Primary replica(A): FCI1
- Node1 – active
- Node2 - passive
Secondary Replica(B): FCI2
- Node3 – active
- Node4 - passive

Choose between one of the following two options:

Longer downtime, less steps

(FCI rolling patching).

Patching steps (basic):

Patch Node4
Move FCI2 from Node3 to Node4
Patch Node3
Move FCI2 from Node4 to Node3
Patch Node2
Move FCI1 from Node1 to Node2
Patch Node1
Move FCI1 from Node2 to Node1

Optimized downtime, more steps

(Leverage AG failover patching).

Patching steps (optimized):

Patch Node4
Move FCI2 from Node3 to Node4
Patch Node3
Move FCI2 from Node4 to Node3
Log the original synchronization configurations for each replica. Change the secondary replica and primary replica to “Synchronous Commit” mode, waiting for the secondary replica (B) to be “synchronized”.
Manually failover AG from FCI1 to FCI2 (now the new primary is on FCI2, the new secondary is on FCI1)
Patch Node2
Move FCI1 from Node1 to Node2
Patch Node1
Move FCI1 from Node2 to Node1
Manually failover AG from FCI2 back to FCI1
Change each replica’s synchronization mode to the original configurations you logged in step #5.

Caveat list:

** Please use basic patching steps if primary and secondary are on different data centers/subnets **

FCI rolling patch guarantees zero data lost
Use synchronous commit secondary on a high latency network would impact OLTP performance
AG failover cross subnets might cause up to 20 seconds delay on client first connections.

SQL Server – Add New Replica in AlwaysOn Availability Group

By Enrique 23/03/2016 03/05/2020 SQL SERVER, High Availability / Disaster Recovery

Expand AlwaysOn High Availability > Expand Availability Groups >Expand Availability Groups Name. Right click Availability Replica.

Click Next.

Click Add Replica…

Click Connect All…

Input the instance name of your new replica. Click Connect.

Change ‘Readable Secondary’ to Yes. But if you don’t want your secondary replica to be readable, then just accept the default ‘No’.

Click Endpoints Tab just to verify that the new endpoint has been created.

Go to Backup Preference tab. I just indicated ‘20’ under Backup Priority for the new Replica. We already have an existing secondary Replica already taking the DB backups.

Choose Join Only because I already took a full backup of the primary database and restored it to the new replica WITH NORECOVERY option.

Validation

Verify Database Synchronization.

SQL Server – AlwaysON Availability Group Replica is Not Synchronizing (Error: 976, Severity: 14, State: 1.)

By Enrique 15/02/2016 03/05/2020 SQL SERVER, High Availability / Disaster Recovery

You can lose communication between the primary and secondary replicas for some reason.

There is a policy on SQL Server to check the data synchronization state of the database replica. The policy is in an unhealthy state when the data synchronization state is NOT SYNCHRONIZING or the state is not SYNCHRONIZED for the synchronous-commit database replica.

Based on SQL Server logs, I observed the following messages:

The target database, ‘YourDatabase’, is participating in an availability group and is currently not accessible for queries. Either data movement is suspended or the availability replica is not enabled for read access. To allow read-only access to this and other databases in the availability group, enable read access to one or more secondary availability replicas in the group. For more information, see the ALTER AVAILABILITY GROUP statement in SQL Server Books Online. (Microsoft SQL Server, Error: 976)

According to Technet, this issue can be caused by the following:

The availability replica might be disconnected.
The data movement might be suspended.
The database might not be accessible.
There might be a temporary delay issue due to network latency or the load on the primary or secondary replica.

Further information: https://technet.microsoft.com/en-us/library/hh245199.aspx

Resolution:
You can fix it, using the following T-SQL command to force the resume synchronization:

ALTER DATABASE YourDatabase SET HADR RESUME;

Using SQL Server Management Studio

In Object Explorer, connect to the server instance that hosts the availability replica on which you want to resume a database, and expand the server tree.
Expand the AlwaysOn High Availability node and the Availability Groups node.
Expand the availability group.
Expand the Availability Databases node, right-click the database, and click Resume Data Movement.
In the Resume Data Movement dialog box, click OK.

Facebook

Google

Twitter

Category: High Availability / Disaster Recovery

PostgreSQL – Monitoring Replication

Deploy TiDB on AWS EKS (Elastic Kubernetes Service)

MySQL/MariaDB – Identifying and Avoiding Deadlocks

MariaDB – Setup GTID Replication using MariaBackup Step-by-Step

GTID Replication – Differences between MySQL and MariaDB

MariaDB – MaxScale Set up, Failover, Switchover, Load Balancing

SQL Server – Adding a Database to AlwaysOn Availability Group

SQL Server – AlwaysOn Availability Group Errors

SQL Server AlwaysOn (SQL Server Configuration Manager)

SQL Server AlwaysOn (New Availability Group wizard errors)

SQL Server AlwaysOn (performance issues)

SQL Server AlwaysOn errors (other)

SQL Server AlwaysOn errors (network)

SQL Server AlwaysOn errors (Windows)

SQL Server AlwaysOn errors (cluster log)

SQL Server AlwaysOn (patching/updates in SQL 2012)

AG on Standalone Instance

AG on FCIs (Failover Cluster Instances)

SQL Server – Add New Replica in AlwaysOn Availability Group

SQL Server – AlwaysON Availability Group Replica is Not Synchronizing (Error: 976, Severity: 14, State: 1.)

Recent Posts

Archives

Categories

SQL Server AlwaysOn (SQL Server Configuration Manager)

SQL Server​​ AlwaysOn (New Availability Group wizard errors)

SQL​​ Server AlwaysOn (performance issues)

SQL​​ Server AlwaysOn errors (other)

SQL Server AlwaysOn​​ errors (network)

SQL Server AlwaysOn errors (Windows)

SQL Server AlwaysOn errors (cluster log)

SQL Server AlwaysOn (patching/updates in SQL 2012)

AG on Standalone Instance

AG on FCIs (Failover Cluster Instances)

Recent Posts

Archives

Categories

Tag Cloud

SQL Server AlwaysOn (New Availability Group wizard errors)

SQL Server AlwaysOn (performance issues)

SQL Server AlwaysOn errors (other)

SQL Server AlwaysOn errors (network)