- brief introduction
Database systems that support transactions, such as sqlite One of the important properties of the atom is that it submits (atomic commit). That is to say, all the write operations to the database in a transaction are executed , Or none at all . It looks like writing to different parts of the database happens instantaneously .
actually , It takes a while for the contents of the disk to change , Write operations can't be instantaneous . So ,sqlite There is a set of logic inside to ensure the atomicity of transaction operation , Even if the system crash Or a power failure doesn't destroy atomicity .
This article introduces the techniques and strategies for ensuring atomic operation , Only applicable to rollback mode. If the database is in WAL mode Run under , The strategy is different from this article .
- Assumptions about hardware
- The minimum unit of hard disk write is sector (sector).
Data smaller than one sector cannot be modified . If necessary , The whole sector should be read out , And then modify part of it , And write the whole sector to .
The size of the sector is 3.3.14 before , Writing dead in code is 512 byte . With the development of hardware , The size of the sector has grown to 4k Bytes . So in 3.3.15 In later versions , Provides a function to work with the file system , To get the size of the sector . However, due to the unix and Windows The size of the file sector is not returned in the system , So this function still returns 512 byte . But this function can work in embedded systems .
- Writes to sectors are not atomic , But it's linear .
Linear here means that at the beginning of a write operation , It starts at one end of the sector , Write bit by bit , To the other end of the sector . The direction of the write operation can be from the beginning to the end of the sector , It can also start from the end of the sector . If during a write operation , The system is powered down , Then part of this sector will have changed , Part of it hasn't changed .
SQLite The key to the hypothesis is if part of the sector changes , Then there must be a change at the beginning or end of the sector .
stay 3.5.0 In later versions , I added a new one VFS（ Virtual file system ） The interface of .VFS yes SQLite The only interface to interact with the file system .SQLite by Unix and Windows Provides the default VFS Realization , And can let the user implement a custom VFS Realization .
VFS Interface , There's a function called xDeviceCharacteristics. This function interacts with the file system , And provide some features of the file system , For example, whether the sector write operation is atomic . If this sector is atomic in write operation , that SQLite Will take advantage of these features . However Unix and Windows Default xDeviceCharacteristics Functions don't provide this information .
- The operating system buffers file writes .
So when the request for a write operation returns , The data has not been written into the database file . Besides , It's also assumed that the operating system does write operations reorder.
therefore ,SQLite Will be called at key points flush or fsync operation .SQLite Suppose this operation does not return until the data is written to the file .
However , Somewhat Windows Version and Unix Version of fluse or fsync The operation is not like this . It looks like , stay commit In the process of power failure , Can cause database file corruption .
- file size The change in content comes before the change in content .
That is to say, the size of the file changes first , In this way, the sub file will contain some garbage data , Then the data is written to the file .
After writing the file size , Power down occurs before data is written .SQLite Some other work has been done to ensure that the database file will not be damaged in this case .
If VFS Of xDeviceCharacteristics Method to determine before changing the file size , The data has been written to the file , that SQLite Will take advantage of this feature . However, the default implementation does not confirm this feature .
- File deletion is atomic
The file is powered down during deletion , So after the reboot , The file is either not deleted at all , Or completely delete .
If you restart , The file is only partially deleted , That would damage the database .
- Powersafe Overwrite When a program writes data to a file , Data outside the scope of the write will not be changed , Even if it happens crash Or power down .
If it doesn't work ： If the write operation occurs only in the first few bytes of the sector . Because the minimum unit of write operation is sector . Power down after the first few bytes are written , Restart time , Check the data in this sector , Find out wrong , I'll use it all 0 Or all 1 To cover . In this way, data outside the scope of the write operation is modified .
Modern disks can detect power down , Then the remaining power will be used to write the data of this sector .
- The minimum unit of hard disk write is sector (sector).
Single file commit The process
The initial state
The middle part is the disk buffer of the system .
Get read lock
The read lock needs to be acquired before the write operation , Get the basic data of the database , So that we can analyze SQL sentence .
Note that the shared lock is only for the system's disk cache , Instead of disk files . File locks are actually some of the kernel functions of the system flag. In system crash Or after a power failure , The lock will fail . Usually, the process that created the lock exits, which also causes the lock to fail .
Reading data from a database
Read to the system disk cache first , Read back to user space . If the cache is hit , The user space is read directly from the disk cache .
Note that the entire database is not read into memory .
obtain reserved lock
Before making changes to the database , To get reserved lock. Allow other processes with shared read locks to operate , But a database file can only be given to reserved lock. It ensures that only one process can write to the database at the same time .
Create a rollback log file
The log file is the original one in the database file to be changed page.
It also includes a head , The size of the original database file is recorded .page Of number Written to every database page in .
Note that the file is not written to disk at this time .
Change the database's... In user space page
Each database link has its own copy of the database file . therefore , At this time, other database connections can still read normally .
Swipe the log file to disk
In most systems , It needs to be done twice flush or fsync operation . Swipe file data into the file for the first time , The second time is used to change header In the log file page number , And put header Swipe in the file .
obtain exclusive lock
obtain exclusive The lock is divided into two steps . First get a pending lock , Make sure there are no new write and read operations . Then wait for the other read processes to finish , Release read lock , Finally get exclusive lock .
Write data in user space to a database file
At this point, it can be determined that no other database connection is reading files from the database . This step is usually only written to the disk cache , Will not write to the database file .
Write changes to disk file
call fsync or flush operation . This one and writing log files to disk take up one transaction The most time in the world .
Delete log file
SQLite gives the appearance of having made no changes to the database file or having made the complete set of changes to the database file depending on whether or not the rollback journal file exists.
Deleting log files is not atomic , But from the user's point of view , This operation is atomic . Ask the operating system if the file exists , The answer is yes or no.
In some systems , Deleting a file is a time-consuming operation .SQLite It can be configured to change the size of the file to 0 Or use 0 To cover the head of the log file . In both cases , Log files cannot be recovered , therefore SQLite Think commit Already completed .
Release the lock
In this picture , The database content in user space has been cleared . In the latest version , Optimized . First in the database page in , A counter is maintained , Every write operation , Will add one to this counter . If the counter doesn't change , This database connection can reuse the database content in user space .
Due to a commit Operation takes time . In the process , If it happens crash Or power down , It is necessary to roll back to ensure the completion of database transactions 『 instantaneous 』 Of . Use the database log file to roll back before the database transaction occurs .
- There are rollback log files
- The rollback log file is not empty
- Does not exist in the database file reserved lock（ It will be lost after power failure ）
- The header format of the log file is not corrupted
- The log file does not contain the name of the main log file （ Users submit multiple files ） Or contains the main log file , The master log file exists With log files , We can recover the database .
- The initial conditions Suppose it's in 10 Step by step , There was a blackout . After restart , The database file actually only writes 1 One and a half page, But we have the whole journal file .￼
- hot rollback journal When a new database connection is established , Will try to get shared read lock , You will also notice that there is a log file for rollback . Next, the database connection will check whether the database file is "hot journal". When a business is in commit Power down or crash, It will produce "hot journal". The criteria are as follows ：
- obtain exclusive lock It is used to prevent other processes from using this log file to roll back the database at the same time .
- Roll back unfinished changes Read log files from disk files into memory , Then write it to the database . The header of the log file stores the size information of the original database . If the original operation makes the database file larger , This information is used to truncate the database . After this step , The size of the database 、 The content is the same as before the transaction .￼
- Delete hot journal Or maybe the size was changed to 0, Maybe it's a file header use 0 Cover . All in all , No, it's not hot journal 了 .￼
- Continue with other operations
At this point, the database file has returned to normal , It can be used normally .
Multi file submission
commit Important details of the process
- Add the log to the header of the log file page Number of
Put the log file into page The data is written to the header , The initial value is 0. Therefore, when using incomplete log files for rollback , You'll find that the head is 0, There will be no operation .
stay commit Before , The contents of the log file will be flushed to disk , And make sure there's no junk data . here , Will be stored in the log file page The number of brushes to disk again . Log file header and page Not in the same sector , So even if the power goes off , It doesn't destroy the page.
The situation mentioned above only happens in "synchronous pragma" yes FULL. If it is normal, that page The number and page The contents of will be synchronized to the disk file . Even if you brush in the code first page The content of , Brush in again page Number of , Because the system changes the order of operation , It may also lead to page The number of is correctly written to disk ,page The contents of are not written to disk properly .
- every last page Use checksums
SQLite In every one of them page We've got one 32bit Checksum . If there is a page The checksums of are not satisfied , Then the entire rollback process does not take place .
If synchronous pragma yes FULL, In theory, there's no need for checksums . However, tests and tests have no side effects , So no matter synchronous pragma What is it? , There are checksums in log files .
- Always record the whole sector
If page Its size is 1k, The size of the sector is 4k. In order to change a certain page The data of , The data of the whole sector must be recorded in the log file ; When writing data to a database file , The entire sector must also be written to .
- Dealing with junk data when writing log files
Appending data to the database log file is ,SQLite Suppose that the database log file size It gets bigger first , Then the data is written . If a power failure occurs between these two steps , There will be garbage in the log file . If the log file is used for recovery , Just cover the correct content in the original database .
SQLite Use two strategies to deal with this situation .
- Pre commit cache overflow
If before submission , The modified content has exceeded the user space cache , Then you must write the completed operation to the database file first , And do something else .
Buffer overflow will cause reserved Lock up to exclusive lock , So the concurrency is reduced . And there's the extra flush or fsync operation , These operations are time consuming . Try to avoid buffer overflow .
Performance analysis shows that SQLite Spend most of your time on disk IO. So if you can reduce the number of disks IO Words , You can improve SQLite Performance of . Here are some SQLite Some methods used to improve the performance on the premise of ensuring the atomicity of transactions .
- You don't have to change the database after the transaction header The counter in . Reduce one file write for log file and database master file .
- There is no need to detect... At the beginning and end of a transaction header The counter in , And you don't have to empty the cache .
- When the business is over , You can override the log file header Instead of deleting the log file . Reduced some file operations , For example, change the directory entry of the database file 、 Release the disk sector corresponding to the log file .
- Cache between transactions
The old version of SQLite in , At the end of the transaction , Will be able to SQLite Removed from user space . The reason is that other operations change the contents of the database . The next time you read the same content , Still need to read data from disk cache or disk to user space .
In the new version (3.3.14) after , The database cache in the user control remains . At the same time in the database header(24 To 27 byte ) Maintenance counters in , Add one at a time . The next time this process reads the database , Just judge if the counter has changed , If there is no change , Then use caching .
- Exclusive access mode (Exclusive Access Mode)
3.3.14 New after version ,SQLIte After the transaction is closed , The mutex remains , So the database can only be connected by one database (sqlite connection) visit , But it's not turned on by default .（ Personally think that , Where there is only one database connection iOS Application , It's more reliable to open this mode .） In this mode , It has the following advantages .
- Do not log free pages to the log file （3.5.0 In the future ）
When deleting information from the database , The original record will be deleted page Add to the blank list (freelist) in . When there is a new operation in the future , Will take data from a blank list , Instead of extending the database file .
Some free pages contain important information , For example, the location of other free pages . But most free pages don't contain useful information , It's called a leaf （leaf） Free page （ I understand that free pages are stored in a tree , A free page is a leaf node ）.
It's not important to leave a free page , therefore SQLite Avoid writing idle pages to the log file , Can be greatly reduced IO number
- Single page update and atomic sector write
Modern disks generally guarantee that the write to a single sector is atomic . When the power goes down , The disk can use the amount of electricity in the capacitor or the angular momentum of disk rotation to complete the write operation of the current sector .
If the sector's write is atomic , database page The size of the sector is the same as the size of the sector , And the database write operation involves only one page, Then the database will skip all log and refresh operations , Write the changed content to the database file directly .
The change counter on the first page of the database will be modified separately , Because it won't have any impact on the database , Even if a power down occurs before the counter is updated .
- Security add (Safe Append) File system （3.5.0 In the future ）
SQLite Suppose when appending data to a file , The size of the file changes first , And then the content changes . In this way, garbage data will be contained in the log file after power failure .
If the file system supports size Before updating , Of documents content It must have been updated , So in case of power failure or system failure crash in the future , Log files don't have junk data .
To support this file system ,SQLite In the head of the log file, it is used to store page Number of places to store -1.SQLite Using the file size To calculate... In the file page Number of .
When commit when , We saved one flush or fsync operation . Besides , When the cache overflows , It doesn't have to be new page Number written to database log file
- Persistent log files
That is, the log file is not deleted at the end of the database transaction , In this way, you can save a file deletion and a file creation .
Enabling method PRAGMA journal_mode=PERSIST;
This will result in the existence of database log files all the time . You can also put mode Set to TRUNCATE,PRAGMA journal_mode=TRUNCATE;
PERSIST Is to set the head of the log file to 0, In the future, the operation on the database file is to overlay .TRUNCATE It's a log file size Set as 0, No call required fsync operation , The operation of database file in the future is append. In embedded systems with synchronous file systems ,append Operation ratio overrite Slow down , therefore TRUNCATE It will lead to more than PERSIST Slower behavior .
Test the atomicity of submission behavior
The possibility of causing database corruption
- Incorrect lock implementation
In a network operating system , It's very difficult to implement the locking mechanism . therefore , Try not to use it in the network operating system SQLite.
When using different locking mechanisms to get the same file , And these two locking mechanisms are not mutually exclusive , There will be mistakes .
- Incomplete disk refresh
Unix Upper fsync() System call or Windows Upper FlushFileBuffers() The call is not working properly .
- Some files have been deleted
SQLite Suppose the deletion of a file is atomic . If SQLite The deleted files are partially recovered after power down and restart , It's going to break down .
- Is written to garbage data
Other programs can send SQLite Write garbage data to the file .
Operating system bug.
- Delete or rename hot journal
Sum up and the way forward