Since last year 6 Since the month , We noticed that GitLab.com The database on will mysteriously stop for a few minutes , This will cause the user to see 500 A mistake . After weeks of hard investigation , We finally found the cause of this situation ： When a long transaction is in progress, pass SAVEPOINTSQL Inquire about Starting a sub transaction can cause serious damage to the database copy . Here's what happened 、 How do we find problems and what measures we have taken to solve them .
What is? SAVEPOINT？
To understand SubtransControlLock（PostgreSQL 13 Put it Rename it to SubtransSLRU）, We must first understand that sub transactions are PostgreSQL How does China work . stay PostgreSQL in , Transactions can be through BEGIN Statement start , Sub transactions can be processed through subsequent transactions SAVEPOINT Query start . When a transaction or sub transaction requires a transaction ID（ abbreviation XID） when , Usually before the client modifies the data , PostgreSQL Each transaction is assigned a transaction ID（ abbreviation XID）.
for example , Suppose you run an online store and the customer places an order . Before the order is completed , The system needs to ensure that the user's credit card account exists . stay Rails in , A common pattern is to start a transaction for the order and call find_or_create_by. for example ：
Order.transaction do CreditAccount.transaction(requires_new: true) do CreditAccount.find_or_create_by(customer_id: customer.id) end # Fulfill the order # ... rescue ActiveRecord::RecordNotUnique retry end
If two orders are placed at about the same time , You don't want to create duplicate accounts that cause one of the orders to fail . contrary , You want the system to say ：“ Oh , Just created an account ; Let me use it .”
This is where the sub business comes in handy ：requires_new: true If the application is already in a transaction , It tells Rails Start a new sub transaction . The above code is converted into several SQL call , As shown below ：
--- Start a transaction BEGIN SAVEPOINT active_record_1 --- Look up the account SELECT * FROM credit_accounts WHERE customer_id = 1 --- Insert the account; this may fail due to a duplicate constraint INSERT INTO credit_accounts (customer_id) VALUES (1) --- Abort this by rolling back ROLLBACK TO active_record_1 --- Retry here: Start a new subtransaction SAVEPOINT active_record_2 --- Find the newly-created account SELECT * FROM credit_accounts WHERE customer_id = 1 --- Save the data RELEASE SAVEPOINT active_record_2 COMMIT
On the top of the 7 That's ok INSERT sentence , If the customer account has been created , You may fail , And the database uniqueness constraint will prevent duplicate entries . If there is no first SAVEPOINT and ROLLBACK block , The whole transaction will fail . Through this sub transaction , The transaction can retry normally and find an existing account .
What does this have to do with long Affairs ？
Long affairs are usually bad , Because they occupy the connection , But they lead to a subtly different problem on the copy . On the copy ,SAVEPOINT A single snapshot during a long transaction can cause sub snapshots to overflow . please remember , We have more than 64 In the case of sub transactions , This reduces performance .
There are three options for the solution ：
- Complete elimination SAVEPOINT.
- Eliminate all long-running transactions .
- take Andrey Borodin Patch for application To PostgreSQL And increase the sub transaction cache .
We chose the first option because we can easily remove most of the uses of sub transactions . We took many-ways .
- Why not eliminate all long running transactions ？
In our database , It is impractical to eliminate all long-running transactions , Because we think many of them are through Automatic database cleanup the , but We can't reproduce this yet . We are partitioning the table and the database , But this is a more time-consuming problem than deleting all sub transactions .
- PostgreSQL How about the patch ？
Although we tested Andrey Of PostgreSQL Patch , But we are concerned about deviation from the official PostgreSQL The version feels uncomfortable . Besides , Maintaining the customized patch version during the upgrade will give us the foundation framework The team adds a significant maintenance burden . Our self-management customers will not benefit , Unless they use a patched database .
Since deleting all SAVEPOINT After query , We never see the previous problems again . If you run with a read-only copy PostgreSQL, We strongly recommend that you also delete all sub transactions , Until further notice .
PostgreSQL It's a great database , Its well annotated code makes it possible to understand its limitations under different configurations .