如何阻止一个节点上的死锁使整个集群崩溃?-数据库问题

How to stop a deadlock on one node from crashing entire cluster?(如何阻止一个节点上的死锁使整个集群崩溃?)

本文介绍了如何阻止一个节点上的死锁使整个集群崩溃?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我在 MariaDB 下运行一个 3x 节点 Galera 集群.该应用程序在 PHP 中使用 mysqli 扩展.

I'm running a 3x node Galera Cluster under MariaDB. The application is in PHP using the mysqli extension.

我偶尔会遇到死锁写入.我正在努力改进我的应用程序以处理或避免这种故障，但同时我需要集群在发生这种情况时保持正常运行.

Very occasionally I get a Deadlock on write. I'm working on improving my application to handle or avoid that kind of failure, but in the mean time I need the cluster to stay up when this happens.

问题是一旦发生死锁，集群中的不仅仅是一个节点，而是所有三个节点都崩溃了.发生死锁的节点遭受 MySQL 服务器已消失错误，并且在 max_connect_errors 开始永久拒绝连接后，因此需要手动重新启动服务器.

The problem is that as soon as the deadlock occurs, not just one, but all three nodes in the cluster crash. The node where the deadlock originates suffers the MySQL server has gone away error and after max_connect_errors starts refusing connections permanently, thus requiring a manual server restart.

我不明白为什么其他节点也会关闭.它们都以WSREP 尚未准备好节点供应用程序使用"开始错误，这意味着整个应用程序崩溃，没有数据库节点接受连接.

What I don't get is why the other nodes go down too. They both start erroring with "WSREP has not yet prepared node for application use" which means the entire application crashes with no database nodes accepting connections.

当一个节点遭遇罕见的死锁时，如何确保集群的其余部分保持正常运行?

How can I ensure that the rest of the cluster stays up when one node suffers an albeit rare deadlock?

更新:

一个月后，另一个死锁导致了类似的问题.再一次，一个节点会破坏一切.

A month later and another deadlock causes a similar problem. Again, one node brings down everything.

第一个连接出现死锁(在提交阶段)，因此应用程序尝试重播事务.这挂了将近一分钟，然后再次失败.

The first connection gets a deadlock (at commit phase) so the application tries to replay the transaction. This hangs for almost a minute and fails again.

在第一个连接恢复失败后，所有其他连接开始失败，并出现 (1205) "Lock wait timeout exceeded" 导致整个集群无用.

After the first connection fails to recover, all other connections start failing with (1205) "Lock wait timeout exceeded" rendering the entire cluster useless.

我应该补充一点，应用程序不使用锁.然而，它本身就是一个结，它只是与常规事务查询有关.

I should add that the application does not use locks. However it got itself tied in a knot, it's just with regular transactional queries.