Why did gitlab spend a month eliminating PostgreSQL sub transactions?

Jiedao jdon 2021-10-14 06:39:33

Since last year 6 Since the month , We noticed that GitLab.com The database on will mysteriously stop for a few minutes , This will cause the user to see 500 A mistake . After weeks of hard investigation , We finally found the cause of this situation : When a long transaction is in progress, pass SAVEPOINTSQL Inquire about Starting a sub transaction can cause serious damage to the database copy . Here's what happened 、 How do we find problems and what measures we have taken to solve them .



To understand SubtransControlLock(PostgreSQL 13 Put it   Rename it to SubtransSLRU), We must first understand that sub transactions are PostgreSQL How does China work . stay PostgreSQL in , Transactions can be through BEGIN Statement start , Sub transactions can be processed through subsequent transactions SAVEPOINT Query start . When a transaction or sub transaction requires a transaction ID( abbreviation XID) when , Usually before the client modifies the data , PostgreSQL Each transaction is assigned a transaction ID( abbreviation XID).

for example , Suppose you run an online store and the customer places an order . Before the order is completed , The system needs to ensure that the user's credit card account exists . stay Rails in , A common pattern is to start a transaction for the order and call  find_or_create_by. for example :

Order.transaction do
  CreditAccount.transaction(requires_new: true) do
    CreditAccount.find_or_create_by(customer_id: customer.id)
  # Fulfill the order
  # ...
rescue ActiveRecord::RecordNotUnique

If two orders are placed at about the same time , You don't want to create duplicate accounts that cause one of the orders to fail . contrary , You want the system to say :“ Oh , Just created an account ; Let me use it .”

This is where the sub business comes in handy :requires_new: true  If the application is already in a transaction , It tells Rails Start a new sub transaction . The above code is converted into several SQL call , As shown below :

--- Start a transaction
SAVEPOINT active_record_1
--- Look up the account
SELECT * FROM credit_accounts WHERE customer_id = 1
--- Insert the account; this may fail due to a duplicate constraint
INSERT INTO credit_accounts (customer_id) VALUES (1)
--- Abort this by rolling back
ROLLBACK TO active_record_1
--- Retry here: Start a new subtransaction
SAVEPOINT active_record_2
--- Find the newly-created account
SELECT * FROM credit_accounts WHERE customer_id = 1
--- Save the data
RELEASE SAVEPOINT active_record_2

On the top of the 7 That's ok INSERT sentence , If the customer account has been created , You may fail , And the database uniqueness constraint will prevent duplicate entries . If there is no first SAVEPOINT and ROLLBACK block , The whole transaction will fail . Through this sub transaction , The transaction can retry normally and find an existing account .

What does this have to do with long Affairs ?

Long affairs are usually bad , Because they occupy the connection , But they lead to a subtly different problem on the copy . On the copy ,SAVEPOINT A single snapshot during a long transaction can cause sub snapshots to overflow . please remember , We have more than 64 In the case of sub transactions , This reduces performance .


There are three options for the solution :

  1. Complete elimination SAVEPOINT.
  2. Eliminate all long-running transactions .
  3. take Andrey Borodin Patch for application To PostgreSQL And increase the sub transaction cache .

We chose the first option because we can easily remove most of the uses of sub transactions . We took many-ways .

  • Why not eliminate all long running transactions ?

In our database , It is impractical to eliminate all long-running transactions , Because we think many of them are through Automatic database cleanup the , but We can't reproduce this yet . We are partitioning the table and the database , But this is a more time-consuming problem than deleting all sub transactions .

  • PostgreSQL How about the patch ?

Although we tested Andrey Of PostgreSQL Patch , But we are concerned about deviation from the official PostgreSQL The version feels uncomfortable . Besides , Maintaining the customized patch version during the upgrade will give us the foundation framework The team adds a significant maintenance burden . Our self-management customers will not benefit , Unless they use a patched database .



Since deleting all SAVEPOINT After query , We never see the previous problems again . If you run with a read-only copy PostgreSQL, We strongly recommend that you also delete all sub transactions , Until further notice .

PostgreSQL It's a great database , Its well annotated code makes it possible to understand its limitations under different configurations .


Please bring the original link to reprint ,thank
Similar articles