Analysis and relationship of various nouns in data warehouse, such as indicator label, dimension measure, natural key agent key, etc

Learn big data in five minutes 2021-09-15 08:26:41

As a data person , Is it often surrounded by various nouns , Do you have a vague understanding of many of these concepts . Some words are only one word away , But they mean completely different , Today, let's learn about some common concepts and their relationships in data warehouse construction and data analysis .

The structure of this paper is shown in the figure below :

One 、 Analysis of common concepts in data warehouse

1. Entity

Entity refers to the subject attached , Is an object we analyze , For example, we analyze the sales of goods , For example, what is the sales volume of Huawei's mobile phones in the past six months , Huawei mobile phone is an entity ; We analyze user activity , A user is an entity . Of course, entities can also exist in reality , For example, virtual business objects , Activities , Members, etc. can be regarded as an entity .

Entities exist for business analysis , As a filtered dimension of analysis , Have properties that describe yourself , It has analyzable value .

2. dimension

Dimension is the angle of view , Analyze business data , From what angle , What kind of dimension is established . Therefore, dimension is a quantity used to analyze data , For example, you need to analyze the sales of products , You can choose to analyze by commodity category , This constitutes a dimension , Put all commodity categories together , It constitutes a dimension table .

3. Measure

A metric is a numeric value on a business process node . Like sales , Price , Cost, etc. .

The measures in the fact table can be divided into three categories : Completely additive , Half can add , Do not add .

  1. Completely additive metrics are the most flexible , The most useful , For example, sales , Sales, etc , You can summarize any dimension ;

  2. Semi additive measures can summarize some dimensions , But you can't aggregate all dimensions , Difference is a common semi additive measure , In addition to the time dimension , You can add across all dimensions ;

  3. The other is completely non additive , for example : ratio . For this kind of non additive metric , A good way is , Store as many fully additive components as possible that are not additive , And before calculating the final non additive fact , Summarize these components into the final result set .

4. Particle size

Granularity is the unit of measurement in a business process , For example, goods are measured by piece record , Or is it measured by batch records .

In the construction of digital warehouse , We say this is a fact table of user granularity , Then each row of data in the table is a user , No duplicate users ; For example, there is a table of sales granularity , Then each row in the table is a sales record .

Choosing an appropriate granularity level is an important key content for the construction of data warehouse , When designing data granularity , Generally, the following factors need to be considered :

  1. Type of analysis to accept 、 The minimum granularity of acceptable data and the amount of data that can be stored ;

  2. The higher the level of granularity is defined , The less you can do a more detailed analysis in the warehouse ;

  3. If storage resources are limited , You can only use higher data granularity ;

  4. The data granularity partition strategy must ensure : The granularity of data can really meet the needs of users' decision analysis , This is the most important criterion in the data granularity partition strategy .

5. CAL

Dimension is data access logic ( How to access data ), such as Number to fetch yes 10 The average height of boys under the age of , This is the statistical caliber .

6. indicators

An indicator is a measure of caliber , That is the final result . For example, the order volume in the last seven days , Purchase conversion rate of a promotion, etc .

An indicator is specific to the calculation and implementation , It mainly consists of the following parts :

  • Index processing logic , such as count ,sum, avg

  • dimension , For example, by Department 、 Regional index statistics , Corresponding sql Medium group by

  • Business limits / Modifiers , For example, different payment channels are used to calculate the corresponding indicators , Order refund rate paid by wechat , Order refund rate paid by Alipay . Corresponding sql Medium where.

besides , Indicators themselves can also be derived 、 Derive more indicators , Based on these characteristics , Indicators can be classified :

  • Atomic index : Basic business facts , No business restrictions 、 There is no dimension . For example, the order quantity in the order table 、 The total amount of the order is an atomic indicator ;

Indicators more concerned by business parties , It has practical business meaning , You can take the indicators of data directly . For example, the store is near 1 The daily order payment amount is a derivative indicator , It will be displayed directly on the products to the merchants .

But this index can't be taken directly from the unified middle layer of the data warehouse ( Because there are no existing fact fields , Generally, they provide wide watches ). There needs to be a bridge between the middle tier of the warehouse and the business side's index requirements , So there are derived indicators

  • Derived indicators dimension + Modifiers + Atomic index . The shop is near 1 The store is the dimension of the payment amount of orders per day , near 1 Day is a modifier of time type , Payment amount is an atomic indicator ;

dimension : Observe the angle of each index ;

Modifiers : One or more values of the dimension , For example, under the dimension of gender , Men and women are 2 A modifier .

  • Derivatives : For example, the conversion rate of a promotion activity is a derivative indicator , Because of the need Promotion number index and Promotion order quantity index To calculate .

7. label

Labels are set artificially 、 According to business scenario requirements , The highly refined feature identification obtained by using a certain algorithm for the target object . It can be seen that the label is the result of artificial reprocessing , Ruwanghong 、 Bai Fu beauty 、 lolita . For ambiguous labels , We can distinguish labels internally , such as : Apple , We can define apple as fruit , Apple phone means mobile phone .

8. Natural bond

A key consisting of attributes that already exist in reality , It is unique in the business concept , And has certain business implications , For example, commodities ID, staff ID.

From a warehouse Perspective , Identifiers from business systems are natural keys , For example, the employee number in the business library .

9. Persistent key

Remain permanent and will not change . Sometimes called a supernatural persistent bond . For example, ID number is persistent key. .

Difference between natural key and persistent key : For example, I see , For example, the company's employees leave their jobs and rejoin them , His natural key, employee number, has changed , But his persistent key ID number is unchanged .

10. Surrogate key

Is a key that has no business meaning . Proxy keys have many other names : Meaningless key 、 Integer key 、 Unnatural bond 、 Manual key 、 Synthetic bond, etc .

Proxy keys are simply expressed as integers produced in a sequential sequence . Page of product line 1 The row proxy key is 1, Then the proxy key in the next line is 2, Do so . The proxy key is only used to connect the dimension table and the fact table .

11. Degenerate dimension

Degenerate dimension , It's a dimension keyword that looks like a fact table , But in fact, there is no corresponding dimension table , That is, dimension attributes are stored in the fact table , This dimension column stored in the fact table is called degenerate dimension . Like other dimensions stored in dimension tables , The degenerate dimension can also be used to filter and query the fact table 、 Realize aggregation operation, etc .

So how to define the degenerate dimension ? For example, orders id, This dimension is of great magnitude , There is no need to use a dimension table to store , When we query or filter data, we need , So this redundancy is in the fact table , This is called degenerate dimension ,citycode We will also be redundant in the fact table , however It has a corresponding dimension table , So it's not a degenerate dimension .

12. Run in

This is a common concept in data analysis , Drilling down can be understood as increasing the level of dimension , So that we can Observe data from coarse-grained to fine-grained , For example, when analyzing product sales , You can observe more fine-grained data from year to month to day along the time dimension . You can drill down from the year dimension to the month dimension 、 Dimension of day, etc .

13. Scroll up

I see. Run in , The first volume is easy to understand , The two are opposite operations , therefore Volume up can be understood as deleting some layers of dimension , The operation of observing data from fine-grained to coarse-grained or aggregating and summarizing data upward along the dimension level .

Two 、 Relationship between data warehouse concepts

1. Entity table , Fact table , Relationship between dimension tables

stay Kimball There are dimensions and facts in dimension modeling , stay Inmon There are entities and relationships in paradigm modeling , If we separate the two modeling methods, these concepts are easier to understand . However, there are many hybrid modeling methods , The two modeling methods are combined , These concepts are not easy to remember and confused , Especially the fact table and entity table , What are the differences and connections between them , Let's look at their respective concepts first :

  1. Dimension table : The dimension table can be seen as a window for users to analyze a fact , The data in it should be a description of all aspects of the facts , For example, time dimension table , Regional dimension table , Dimension table is an analytical perspective of fact table .

  2. Fact table : In fact, a fact table determines a fact through the combination of various dimensions and some index values , For example, through the time dimension , Regional organization dimension , The fact that the index value can be used to determine the value of some indexes at a certain time and place . Each piece of data in the fact table is obtained by the intersection of the data and indicator values of several dimension tables .

  3. Entity table : An entity table is a table of actual objects , The data in the entity table must be the objective thing data , For example, all kinds of goods , It exists objectively , So you can design it as an entity table . Real time tables only describe things , There are no concrete facts , Therefore, some people call the entity table a fact table without facts .

for instance : For example, there are apple phones in the mobile phone Mall , Huawei mobile phones and other mobile phones of various brands and models , These data can form a Mobile entity table , But there is no measurable data in the table . One day the Apple phone sold 15 platform , Huawei mobile phones sold 20 platform , These mobile phone sales data are facts , Form a Fact table . So you can use it Date dimension table and Regional dimension table Carry out various dimensional analysis of this fact table .

2. Difference between indicators and labels

  • Different concepts

indicators Is used to define 、 A standard or way of evaluating and describing a particular thing . such as : Number of new users 、 Cumulative users 、 User activity rate is an indicator to measure user development ;

label It's artificial 、 According to business scenario requirements , The highly refined feature identification obtained by using a certain algorithm for the target object . It can be seen that the label is the result of artificial reprocessing , Ruwanghong 、 Bai Fu beauty 、 lolita .

  • Different composition

Index name It is the naming of the characteristics of matter matter matter and quantity ; The indicator value is the indicator at a specific time 、 regional 、 Quantitative performance under , Such as human weight , The indicator name is body weight , The value of the indicator is 120 Jin ;

Tag name Usually adjectives or adjectives + The structure of nouns , Labels are generally not quantifiable , Usually isolated , In addition to the basic class label , Labels processed by certain algorithms generally have no units and dimensions . If will exceed 200 Those who weigh five kilograms are called fat people .

  • Different categories

Classification of indicators

Calculation logic according to indicators , Indicators can be divided into atomic indicators 、 Derived indicators 、 There are three types of derivative indicators ;

According to different descriptions of events , It is divided into process indicators and outcome indicators ;

Classification of labels

According to the variability of labels, they are divided into static labels and dynamic labels ;

According to the label reference and evaluation indicators , It can be divided into qualitative label and quantitative label ;

indicators The best application is monitoring 、 analysis 、 Evaluation and modeling .

label The best application is annotation 、 Characterization 、 Classification and feature extraction .

In particular, it should be noted that , Because the annotation of the result is also a label , So in the application scenario of algorithms related to natural language processing and machine learning , Tags are of great value for supervised learning , Just a simple indicator is difficult to achieve . The indicators are in task allocation 、 Role of performance management , The label can't do it .

3. Differences and relationships between dimensions and indicators

Dimension is the observation angle of data , That is, from which angle to analyze the problem , Look at problems .

The indicator is to calculate the value of this result on the basis of dimension .

A dimension is generally a discrete value , For example, each independent date or region in the time dimension , Therefore, in statistics , You can aggregate records with the same dimension , Apply aggregate function to accumulate 、 mean value 、 Maximum 、 Minimum value and other aggregate calculation .

Indicators are aggregated general calculations , That is, the result of aggregation operation , It is generally a continuous value .

4. The difference between the use of natural keys and proxy keys in the data warehouse

The data warehouse toolbox says The unique primary key of a dimension table should be a proxy key, not a natural key . Sometimes modelers are reluctant to give up using natural keys , Because they want to query the fact table with operational code , You don't want to connect with the dimension table . However , Multidimensional keys with business implications should be avoided , Because whatever assumptions we make may eventually become invalid , Because we can't control the change of Business Library .

Therefore, each connection between dimension table and fact table in data warehouse should be based on integer proxy key with no actual meaning . Avoid using natural keys as the primary keys of dimension tables .

Please bring the original link to reprint ,thank
Similar articles