[Hadoop 3. X] HDFS storage type and storage strategy (V) overview

Manor's big data struggle 2021-10-14 05:02:34


Current blog Hadoop Most articles stay in Hadoop2.x Stage , This series will be based on the big data of dark horse programmers Hadoop3.x A full set of tutorial , Yes 2.x There are no new features to supplement and update , One click three times plus attention , Don't get lost next time !

Article history

[hadoop3.x series ]HDFS REST HTTP API Use ( One )WebHDFS

[hadoop3.x series ]HDFS REST HTTP API Use ( Two )HttpFS

[hadoop3.x series ]Hadoop Common file storage formats and BigData File Viewer Tool use ( 3、 ... and )

[hadoop3.x] Next generation storage formats Apache Arrow( Four )

HDFS Storage type and storage policy


l Archive Storage ( File storage ) It is a solution to decouple the increasing storage capacity from the computing capacity

l You can store some information that needs to be stored 、 But the data with little computing demand is placed in low-cost storage nodes , These nodes are used to store cold data in the cluster

l According to the strategy , Hot data can be transferred to cold node storage . Adding more nodes in the cold area can make the storage independent of the computing capacity in the cluster

l The framework provided by heterogeneous storage and archive storage will HDFS The architecture is summarized as including other types of storage media , Include :SSD And memory . The user can choose to store the data in SSD Or in memory for better performance .


Storage type and storage policy

A variety of storage types

Let's consider a question : What storage types can we store data in ?


  1. Hard disk



  1. Memory

  2. NAS

Speed comparison

RAM Than SSD A few orders of magnitude . The approximate speed of an ordinary disk is 30-150MB, Faster SSD Can achieve 500MB / Actual write speed of seconds . RAM The theoretical maximum speed can reach SSD Actual performance 30 times .

The following is an actual comparison diagram :


Storage type

Before that hdfs-site.xml Middle configuration , Is to save data in Linux Local disk in .

<value>/export/server/hadoop-3.1.4/data/datanode</value> <description>
DataNode The path on the local file system where the namespace and transaction logs are stored </description>

The above configuration is the same as the following configuration :

<value>[DISK]:/export/server/hadoop-3.1.4/data/datanode</value> <description>
DataNode The path on the local file system where the namespace and transaction logs are stored </description></property>

stay HDFS in , Different storage types can be assigned to different storage media :

l DISK: The default storage type , Disk storage

l ARCHIVE: It has high storage density (PB level ), But the characteristic of small computing power , Can be used to support file storage .

l SSD: Solid state disk

l RAM_DISK:DataNode Memory space in

Introduction to storage strategy

HDFS Provide heating in the 、 warm 、 cold 、ALL_SSD、One_SSD、Lazy_Persistence And other storage strategies . In order to store files in different storage types according to different storage policies , A new concept of storage policy is introduced .HDFS The following storage policies are supported :

heat (hot)

l For mass storage and computing

l When data is often used , Keep in this policy

l When block yes hot when , All copies are stored on disk .

cold (cold)

l For storage only , Only a very limited part of the data is used to calculate

l Data that is no longer in use or needs to be archived will be transferred from hot storage to cold storage

l When block yes cold when , All copies are stored in Archive in

temperature (warm)

l Partial heat , Partially cold

l When a block is warm when , Some copies of it are stored on disk , The remaining copies are stored in Archive in

whole SSD

Store all copies in SSD in

single SSD

stay SSD Store a copy in , The remaining copies are stored on disk .

Laziness lasts

Used to write blocks with only one copy in memory . The copy is first written in RAM_Disk in , Then lazily saved on disk .

HDFS Storage strategy in

HDFS The storage policy consists of the following fields :

Strategy ID(Policy ID)

Policy name (Policy Name)

List of storage types for block placement (Block Placement)

List of fallback storage types used to create files (Fallback storages for creation)

List of fallback storage types for replicas (Fallback storages for replication)

When there is enough space , The block copy will be based on #3 A list of storage types specified in . When the list #3 When some storage types in are exhausted , Will be used separately #4 and #5 Replace out of space storage types with the list of backup storage types specified in , For file creation and copying .

The following is a typical storage policy table :

Policy ID Policy Name Block Placement (n replicas) Fallback storages for creation Fallback storages for replication
15 Lazy_Persist RAM_DISK: 1, DISK: n-1 DISK DISK
7 Hot (default) DISK: n ARCHIVE
2 Cold ARCHIVE: n

matters needing attention :

Lazy_Persistence Policies are only useful for single replica blocks . For blocks with multiple copies , All copies will be written to disk , Because only one copy is written to RAM_Disk Does not improve overall performance .

For striped erase encoded files , The appropriate storage strategy is ALL_SSD、HOST、CORD. therefore , If the user is EC File setting policies other than the above , This policy is not followed when creating or moving blocks .

Storage policy scheme

l When creating a file or directory , Its storage policy is in unspecified state . have access to :

storagepolicies -setStoragePolicy

Command to specify

l A valid storage policy for a file or directory is resolved by the following rules :

If a file or directory is specified using a storage policy , The file or directory is returned .

For unspecified files or directories , If it's the root directory , The default storage policy is returned . otherwise , Returns the valid storage policy of its parent

l have access to storagepolicies –getStoragePolicy Command to obtain a valid storage policy

To configure

l dfs.storage.policy.enabled

Enable / Disable the storage policy feature . The default value is true

l dfs.datanode.data.dir

l On each data node , Their storage types should be marked with comma separated storage locations . This allows storage policies to be based on policies

Slightly place blocks on different storage types .

On disk DataNode Storage location /grid/dn/disk 0 It should be configured as [DISK]file:///grid/dn/disk0

SSD Upper DataNode Storage location /grid/dn/ssd 0 It should be configured as [SSD]file:///grid/dn/ssd0> On the archive DataNode Storage location /grid/dn/Archive 0 It should be configured as [ARCHIVE]file:///grid/dn/archive0

take RAM_ On disk DataNode Storage location /grid/dn/ram0 Configure to [RAM_DISK]file:///grid/dn/ram0

If DataNode The storage location is not explicitly marked with a storage type , Its default storage type will be disk .


Blog home page :https://manor.blog.csdn.net
Welcome to thumb up Collection Leaving a message. Please correct any mistakes !
This paper is written by manor original , First appeared in CSDN Blog
Hadoop The series will be updated every day !

Please bring the original link to reprint ,thank
Similar articles