What is a file system ？
File system is a very important component of computer , Provide consistent access and management for storage devices . In different operating systems , There will be some differences in the file system , But there are some commonalities that haven't changed much for decades ：
Data is in the form of files , Provide Open、Read、Write、Seek、Close etc. API Visit ;
Files are organized in a tree , Provides the renaming of atoms （Rename） Operation to change the location of a file or directory .
File system provides access and management methods to support the vast majority of computer applications ,Unix Of “ Everything is a document ” The concept of the concept is to highlight its important position . The complexity of the file system makes its scalability unable to keep up with the rapid development of the Internet , The greatly simplified object storage fills the vacancy in time and develops rapidly . Because the object store lacks a tree structure and doesn't support atomic renaming , It's very different from the file system , This article will not discuss it for the time being .
Stand alone file system challenges
Most file systems are stand-alone , Provide access and management for one or more storage devices in a stand-alone operating system . With the rapid development of Internet , Stand alone file systems face many challenges ：
share ： Can't provide access to applications distributed in multiple machines at the same time , Hence the NFS agreement , The single file system can be provided to multiple machines simultaneously through the network .
Capacity ： Not enough space to store data , The data has to be distributed in multiple isolated stand-alone file systems .
performance ： It can not meet the very high performance requirements of some applications , The application has to do logical splitting and read and write multiple file systems at the same time .
reliability ： Limited by the reliability of a single machine , Machine failure can cause data loss .
Usability ： Limited by the availability of a single operating system , Operation and maintenance operations such as failure or restart will lead to unavailability .
With the rapid development of Internet , These problems have become increasingly prominent , Distributed file systems have emerged to meet these challenges .
Here are some basic architectures of distributed file system that I have known , The advantages and limitations of different architectures are compared .
GlusterFS It's from the United States Gluster company-developed POSIX distributed file system （ With GPL Open source ）,2007 The first public version was released in ,2011 By the Redhat Acquisition .
Its basic idea is to integrate multiple stand-alone file systems into a unified namespace through a stateless middleware （namespace） For users . This middleware consists of a series of superimposable Converters （Translator） Realization , Each converter solves a problem , Such as data distribution 、 Copy 、 Split 、 cache 、 Locks and so on , Users can flexibly configure according to specific application scenarios . For example, a typical distributed volume is shown in the figure below ：
Server1 and Server2 It consists of 2 Replica Volume0,Server3 and Server4 constitute Volume1, They are then merged into distributed volumes with more space .
advantage ： The data files are finally saved on the stand-alone file system with the same directory structure , Never mind GlusterFS Data loss due to unavailability of .
No obvious single point problem , Can be linearly extended .
Support for a large number of small files is estimated to be good .
Challenge ： This structure is relatively static , Not easy to adjust , Each storage node is also required to have the same configuration , When data or access is unbalanced, there is no way to adjust space or load . The fault recovery ability is also relatively weak , such as Server1 When it breaks down ,Server2 The documents on the can't be in a healthy 3 perhaps 4 Add copy to ensure data reliability .
Because of the lack of independent metadata services , All storage nodes are required to have a complete data directory structure , When traversing the directory or adjusting the directory structure, you need to access all nodes to get the correct results , The scalability of the whole system is limited , It can be extended to dozens of nodes , It is difficult to effectively manage hundreds of nodes .
CephFS Began in Sage Weil My doctoral dissertation , The goal is to implement distributed metadata management to support EB Level data scale .2012 year ,Sage Weil Set up the InkTank Continue to support CephFS Development of , On 2014 By the Redhat Acquisition . until 2016 year ,CephFS To release a stable version that can be used in production environments （CephFS The metadata part of is still stand-alone ）. Now? ,CephFS Distributed metadata is still immature .
Ceph It's a layered architecture , The bottom layer is one based on CRUSH（ Hash ） Distributed object storage , The upper layer provides object storage （RADOSGW）、 Block storage （RDB） And file system （CephFS） Three API, As shown in the figure below ：
Use a set of storage system to meet the storage requirements of multiple different scenarios （ Virtual machine image 、 Massive small files and general file storage ） It's still very attractive , However, due to the complexity of the system, strong operation and maintenance capability is required to support , In fact, at present, only block storage is relatively mature and widely used , Object storage and file systems are not ideal , I heard some use cases and gave up after a period of time .
CephFS The architecture of is shown in the figure below ：
CephFS By MDS（Metadata Daemon) Realized , It is one or more stateless metadata services , From the bottom OSD Load the meta information of the file system , And cache it in memory to improve access speed . because MDS It's stateless , Multiple standby nodes can be configured to achieve HA, Relatively easy . However, the backup node has no cache , It needs to be preheated again , It is possible that the fault recovery time will be long .
Because it is slow to load or write data from the storage layer ,MDS Multithreading must be used to improve throughput , Various concurrent file system operations lead to a significant increase in complexity , Deadlock prone , Or because IO Slow performance leads to a significant decline . In order to get better performance ,MDS Often enough memory is needed to cache most of the metadata , This also limits its actual support capacity .
When there are multiple active MDS when , Part of the directory structure （ subtree ） It can be dynamically assigned to a MDS And it is entirely up to it to handle the relevant requests , To achieve the purpose of horizontal expansion . Before multiple active , Inevitably, individual locking mechanisms are needed to negotiate ownership of the subtree , And atomic renaming across subtrees through distributed transactions , These implementations are very complex . At present, the latest official documents still do not recommend the use of multiple MDS（ As a backup, you can ）.
Google Of GFS It is the pioneer and typical representative of distributed file system , From the early BigFiles Come and go . stay 2003 The design concept and details are described in detail in the paper published in , It has a great impact on the industry , Later, many distributed file systems were designed with reference to it .
seeing the name of a thing one thinks of its function ,BigFiles/GFS It is optimized for large files , Not suitable for average file size of 1MB Within the scene .GFS The architecture of is shown in the figure below ：
GFS There is one Master Node to manage metadata （ All loaded into memory , Snapshots and update logs are written to disk ）, The file is divided into 64MB Of Chunk Store in several ChunkServer On （ Use a stand-alone file system directly ）. The file can only be appended , Never mind Chunk Version and consistency issues （ You can use length as version ）. This uses completely different technologies to solve the design of metadata and data, which greatly simplifies the complexity of the system , It also has sufficient expansion capacity （ If the average file size is greater than 256MB,Master Every node GB Memory can support about 1PB The amount of data ）. Give up support POSIX Some functions of the file system （ For example, write randomly 、 Extended attributes 、 Hard link, etc ） It also further simplifies the system complexity , In exchange for better system performance 、 Robustness and scalability .
because GFS Mature and stable , bring Google It's easier to build upper tier applications （MapReduce、BigTable etc. ）. later ,Google Developed the next generation storage system with stronger scalability Colossus, Completely separate metadata from data storage , The distributed management of metadata is realized （ Automatically Sharding）, And the use of Reed Solomon Coding to reduce storage space and reduce costs .
come from Yahoo Of Hadoop Count as Google Of GFS、MapReduce Open source, etc Java Implementation ,HDFS It is also basically copied GFS The design of the , I won't repeat it here , The picture below is HDFS The architecture of the figure ：
HDFS Its reliability and scalability are very good , There are thousands of nodes and 100PB Level deployment , The performance of supporting big data applications is still very good , There are few cases of losing data （ The data is deleted by mistake because the recycle bin is not configured ）.
HDFS Of HA The plan was added later , It's complicated , That I was the first to do this HA Of the plan Facebook For a long time （ At least 3 year ） Manual failover is done in the （ Distrust automatic failover ）.
because NameNode yes Java Realized , Depends on the pre allocated heap memory size , Insufficient allocation is easy to trigger Full GC And affect the performance of the whole system . Some teams try to use it C++ Rewrote , But we haven't seen a mature open source solution yet .
HDFS There is also a lack of mature non Java client , Make big data （Hadoop Tools such as ） Other scenes （ For example, deep learning ） It's not very convenient to use .
MooseFS It's an open source distributed from Poland POSIX file system , It's also a reference GFS The architecture of , Achieved the vast majority of POSIX Semantics and API, Through a very mature FUSE After the client is mounted, it can be accessed like the local file system .MooseFS The architecture of is shown in the figure below ：
MooseFS Support for snapshots , It is convenient to use it for data backup or backup recovery .
MooseFS By C Realized ,Master Is an asynchronous event driven single thread , Be similar to Redis. But the network part uses poll Not more efficient epoll, Cause concurrency to 1000 Left and right CPU The consumption is very severe .
The open source community edition does not HA, It's through metalogger To realize asynchronous cold standby , The paid version of closed source has HA.
To support random write operations ,MooseFS Medium chunk It can be modified , Ensure data consistency through a set of version management mechanism , This mechanism is complex and prone to strange problems （ For example, after the cluster is restarted, there may be a few chunk The actual number of copies is lower than expected ）.
Above said GFS、HDFS and MooseFS They are all designed for the software and hardware environment of self built computer room , Combine the reliability of data with the availability of nodes, and solve it in the way of multi machine and multi copy . But in a virtual machine in a public or private cloud , The block device is already a virtual block device with three copies reliability design , If we do it in the way of multi machine and multi copy , The cost of data is high （ It's actually 9 A copy ）.
So we focus on the public cloud , improvement HDFS and MooseFS The architecture of , Designed JuiceFS, The architecture is shown in the following figure ：
JuiceFS Replace... With existing object storage in the public cloud DataNode and ChunkServer, Achieve a fully elastic Serverless Storage system . The object storage of public cloud has solved the problem of safe and efficient storage of large-scale data ,JuiceFS Just focus on metadata management , It also greatly reduces the complexity of metadata services （GFS and MooseFS Of master To solve the problem of metadata storage and data block health management at the same time ）. We have also made many improvements to the metadata part , It's been implemented from the beginning based on Raft High availability . To really provide a high availability and high performance service , The management and operation of metadata is still very challenging , Metadata is provided to users in the form of services . because POSIX file system API It's the most widely used API, We are based on FUSE Achieved a high degree of POSIX Compatible clients , Users can use a command line tool to JuiceFS Mount to Linux perhaps macOS in , Access as fast as a local file system .
The dotted line on the right in the figure above is responsible for data storage and access , It's about user data privacy , They are completely in the customer's own account and network environment , No contact with metadata services . We （Juicedata） There's no way to get to the customer's content （ Except metadata , Please don't put sensitive content in the file name ）.
This paper briefly introduces the architecture of several distributed file systems I know , Put them in the picture below in chronological order （ The arrow indicates that it refers to the former or the new generation version ）：
The blue files in the upper part of the figure above are mainly used for big data scenarios , What we achieve is POSIX Subset , And the green ones below are POSIX Compatible file system .
One of them is GFS The system design of metadata and data separation can effectively balance the complexity of the system , Effectively solve the problem of large-scale data storage （ It's usually big files, too ）, Better scalability . This architecture supports distributed storage of metadata Colossus and WarmStorage It has unlimited scalability .
JuiceFS As a latecomer , To study the MooseFS Implement distributed POSIX How the file system works , Also learned Facebook Of WarmStorage And so on , We hope to provide the best distributed storage experience for public cloud or private cloud scenarios .JuiceFS By storing data into object storage , It effectively avoids the double redundancy when using the above distributed file system （ Redundancy of block storage and multi machine redundancy of distributed system ） The resulting cost is too high .JuiceFS It also supports all public clouds , Don't worry about a cloud service lock , It can also smoothly migrate data between public clouds or zones .
Last , If you have a public cloud account , Come on JuiceFS Sign up for ,5 You can give your virtual machine or your own... In minutes Mac Mount a PB Class file system .
Recommended reading ：