Application performance front-end monitoring, byte hopping, all these years of experience here

Byte hopping terminal technology 2021-10-14 07:10:55

One 、 background

So far , There are already orders of magnitude Web project , Serving hundreds of millions of users .

With the increasing number of users , about Site experience measurement The demand for is also increasingly urgent , Users will make the product and the experience they use every day the best Web Compare sites . Think about hand optimization , Then there must be relevant monitoring data , Only in this way .

Performance is the key to retaining users .  A large number of research reports have shown the relationship between performance and business performance , Poor performance will cost your site users 、 Conversion rate and word of mouth . Error monitoring allows developers to find and fix problems at the first time , It is unrealistic to rely solely on users' problems and feedback , When the user encounters a white screen or interface error , More people may try again several times 、 Lose patience and just shut down your website .

The development team monitors the demand according to the experience of dozens of products , Gradually polished a version of performance monitoring platform . After constant tempering and precipitation , Officially released on the volcano engine Application performance monitoring   Full link version . This article will focus on what kind of monitoring platform it is , And what pain points can help enterprises solve .

Two 、 Product brief

The full link version of application performance monitoring is an enterprise level technical service platform under byte beat , Provide enterprises with the quality of application services 、 Performance and custom embedded point APM service .

Aggregation analysis based on massive data , The platform can help customers find many kinds of abnormal problems , And call the police in time , Do the distribution processing , At the same time, the platform provides rich attribution ability , Including but not limited to exception analysis 、 Multidimensional analysis 、 Custom escalation 、 Single point log query, etc , Combined with flexible reporting ability, you can understand the trend change of various indicators . More features , See the function module description of each sub monitoring service for details .


3、 ... and 、 Product highlights

This section only explains the highlights of the full link version of application performance monitoring from the perspective of the whole product , More technical highlights and advantages , We will give you a detailed description in each function module .

3.1 Lower access costs :  Non intrusive  SDK

In the access SDK when , Just initialize a few lines of code You can access successfully .

npm install @apm-insight-web/rangers-site-sdk
//  Introduce the following code at the beginning of the project 

import vemars from '@apm-insight-web/rangers-site-sdk/private'

vemars('config', {

  app_id: {{ Yours appid}},

  serverDomain: {{ Privatize the deployment server address }},

Or through a paragraph JavaScript Script , Directly through CDN Access :
<!--  Script  -->

<!--  page  <head>  Add the following code at the top of the label  -->


(function(i,s,o,g,r,a,m){i["RangerSiteSDKObject"]=r;(i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)}),(i[r].l=1*new Date());(a=s.createElement(o)),(m=s.getElementsByTagName(o)[0]);a.async=1;a.src=g;a.crossOrigin="anonymous";m.parentNode.insertBefore(a,m);i[r].globalPreCollectError=function(){i[r]("precollect","error",arguments)};if(typeof i.addEventListener==="function"){i.addEventListener("error",i[r].globalPreCollectError,true);i.addEventListener('unhandledrejection', i[r].globalPreCollectError)}if('PerformanceLongTaskTiming'in i){var g=i[r].lt={e:[]};g.o=new PerformanceObserver(function(l){g.e=g.e.concat(l.getEntries())});g.o.observe({entryTypes:['longtask']})}})(window,document,"script","{{ Yours CDN Address }}","RangersSiteSDK");




    app_id: {{ Yours app_id}},

    serverDomain: {{ Privatize the deployment server address }},



3.2 Richer abnormal site restoration capability

The application performance monitoring full link version not only helps you find all kinds of abnormal problems without dead ends , It also provides rich on-site restore capabilities , Including but not limited to stack backtracking 、 User interaction restoration, etc .



3.3 More flexible sampling methods , To save money

The application performance monitoring full link version provides you with sampling configuration , Support setting sampling by function module 、 Sample by user settings , To help you save the amount of events .


Such a perfect performance monitoring platform , There must be a mature methodology behind it . From the beginning of platform design , We have done a detailed technical scheme design and measurement standard design , Next, I will introduce these designs in more detail , And the detailed principle behind it .

Four 、 How to measure Web Experience

4.1 Site experience

First , from Site experience In terms of ,Web Vitals  Defined LCP、FID、CLS indicators , It has become the mainstream standard in the industry .

Based on the long-term experience index optimization accumulation , The latest core experience indicators focus on load 、 Interaction 、 Visual stability , Speed of loading Decide whether users can access visual images as soon as possible , Interactive speed It determines whether the user can feel that the elements on the page can be operated as soon as possible , and Visual stability Is responsible for measuring the negative impact of page visual jitter on users .

All in all, it's the following 3 Indicators :


Largest Contentful Paint (LCP)

Maximum content rendering , It's used to measure load Performance of . This indicator reports the rendering time of the largest image or text block visible in the viewport , In order to provide a good user experience ,LCP The score should be guaranteed at  2.5 second within .

First Input Delay (FID)

First input delay , For measurement Interactivity .FID The measurement is from the user's first interaction with the page ( for example , When they click on the link , Click button , Or use custom JavaScript Driven controls ) When the browser can actually start responding to the interaction , In order to provide a good user experience , The site should try to make FID Stay in  100 millisecond within .

Cumulative Layout Shift (CLS)

Cumulative layout displacement , For measurement Visual stability .CLS Is a measure of the entire life cycle of a page , The index of the maximum layout change score in each layout change . In order to provide a good user experience , The site should try to make CLS Score up to  0.1  Or lower .

4.2 Error monitoring

Again from Error monitoring Speaking of , When the page reaches hundreds of millions of visits , Regardless of pre release unit testing 、 Integration testing and manual testing have gone through more rounds , It is inevitable that some edge operation path tests will be missed , Sometimes there are even metaphysical faults that are difficult to reproduce . Even if these mistakes are only 0.1% The occurrence rate of , A site with 100 million visits will also cause users to encounter millions of failures .

Now , Perfect error monitoring system is of great use .

We are right.  JavaScript error 、 Static resource error and request error All provide macro Wrong number 、 Error rate 、 Affect the number of users 、 Affect the proportion of users Equal index , Pay attention to the existing errors and their impact on users at a glance , To help developers fix the problem as soon as possible .

At the same time, the monitoring of requests , In order to further ensure the user's experience in obtaining data , We have further refined it to The success rate of the request 、 Slow query correlation Indicators of .

5、 ... and 、SDK collection

With these measures , Let's see in detail SDK How to implement these standards .

5.1 What indicators need to be collected ?

  • RUM (Real User Monitoring) indicators , Include FP, TTI, FCP, FMP, FID, MPFID.

  • Navigation Timing  Indicators of each stage , Include DNS, TCP, DOM Analyze the indicators of other stages .

  • JS Error, After parsing, it can be subdivided into runtime exceptions 、 And static resource exceptions .

  • Request status code , After collection and reporting , You can analyze information such as request exceptions .

5.2 How to collect these indicators ?

RUM Collection of indicators , Mainly depends on  Event Timing API  Take measurements .

With FID For example, indicators , First create  PerformanceObserver  object , monitor  first-input  event Listen to the  first-input  things Pieces of after , utilize Event Timing API, Through the start processing time of the event , Minus the time of the event , That is to say FID.

// Create the Performance Observer instance.

const observer = new PerformanceObserver((list) => {

  for (const entry of list.getEntries()) {

    const FID = entry.processingStart - entry.startTime;

    console.log('FID:', FID);



// Start observing first-input entries.





Navigation Timing  indicators , Can pass  PerformanceTiming  Interfaces get them , Take the calculation of loading time as an example :

function onLoad({

  var now = new Date().getTime();

  var page_load_time = now - performance.timing.navigationStart;

  console.log("User-perceived page loading time: " + page_load_time);


JS Error  indicators , adopt  window.onerror  The callback function can listen  JavaScript Runtime error :

window.onerror = function (message, source, lineno, colno, error{

  //  Construct abnormal data format and report


adopt  unhandledrejection  Event monitoring  Promise rejections Asynchronous error :

window.addEventListener("unhandledrejection", event => {

  //  Construct abnormal data format and report


Request status code , You can override  window.fetch and  XMLHttpRequest  Object to monitor , Overwrite with  fetch  For example , Here is the simplified code :

const _fetch = window.fetch;

window.fetch = (req: RequestInfo, options: RequestInit = {}) => {

  //  Omit some logic ……

  return _fetch(req, options).then(

    //  success

    (res) => {

      //  Report successful request information

      return res;


    //  Failure

    (res) => {

      //  Report failed request information

      return Promise.reject(res);



6、 ... and 、 Server-side processing

SDK After data collection , It will be handed over to the server collect 、 Cleaning and storage Wait for the treatment .

After the server receives the data , Real time latitude analysis and supplement of data , Stack anti parsing and other cleaning tasks . According to the product functions of different platforms , Fall into different types of storage :


  • Data collection layer : The data collection layer is stateless API service , Logic is lighter . Only for SDK Authentication and verification of reported data , Unpacking, etc , Then write to the message queue Kafka For data cleaning layer consumption

  • Data cleaning layer : The data cleaning layer is the logical center of data processing . Provides stack formatting , Stack restore (SourceMap analysis ), Latitude supplement (IP -> Location , User-Agent -> Equipment information ) Wait for processing . Multidimensional analysis and statistics for the platform , Provide data support for data drilling .

  • Storage layer : The platform depends on different functional requirements , Choose different types of storage schemes , Platform query to achieve real-time second response .

    • OLAP: We choose Clickhouse As a storage solution for our data analysis .Clickhouse Powerful performance and targeted optimization within bytes , It can help us achieve 100 billion levels of data every day , The effect of second level query .

    • KV: Byte internal self-developed high performance KV Store data index information , combination HDFS Storage details . Realize the detailed tracking function such as platform single point query .

    • ES: For some custom log analysis search scenarios , We use Elastic Search As the storage , Realize more flexible log search and analysis function .

In the alarm function , We implement an abstract alarm query engine ( The bottom layer can adapt to different data sources ), Real time alarm analysis of data . We support flexible alarm rule configuration , And access to various third-party notification platforms as the media of message notification .

stay SDK Configuration level , We go through SDK Setting service , Realize the platform based sampling rate configuration function , Real time management and control of reported data .

7、 ... and 、 Visual platform display

After the acquisition and reporting described above —— Storage cleaning —— After statistical analysis , Next, you need to hand over these data to users for consumption , Visualization platform The function of is also crucial , Next, I will introduce each function of the platform in detail .

7.1 Performance analysis

Performance analysis module , It is divided into two sub modules: page loading and static resource performance .

Page loading Monitoring is to monitor the performance of front-end pages during user sessions . The indicator categories that can be viewed include : Real user performance index and page technical performance index . Monitoring through page loading , You can be aware of the user's time consumption 、 Have a comprehensive understanding of page performance .

Real user performance index That is, as mentioned above RUM And some additional indicators extended by the platform itself , Include the following indicators :

  • First draw time ( FP  : namely First Paint, Is the point in time for the first rendering .

  • First time content was drawn ( FCP  : namely First Contentful Paint, The point in time when content is rendered for the first time .

  • First effective drawing time ( FMP  : The time between the user starting page loading and the first screen of page rendering .

  • First interaction time ( FID  : namely First Input Delay, Record the page loading phase , The delay time of the user's first interaction .FID Metrics affect users' first impressions of page interactivity and responsiveness .

  • Maximum delay in interaction ( MPFID  : Page loading phase , The maximum delay time that user interaction may encounter .

  • Fully interactive time (TTI): namely Time to interactive, Recording starts with page loading , The time it takes until the page is fully interactive .

  • First load Jump out rate : User jump out rate before the first page is fully loaded .

  • Slow drive ratio : Full loading takes more than 5s Of PV Proportion .

    In the page , You can view the status of these indicators completely and clearly :




    Page technical performance index

    The index definition provided in the page technical performance index comes from :Navigation Timing   Explanation .

        Index name               describe Calculation method ( With 2.0 Norms, for example )
    DNS Inquire about DNS The stage takes time domainLookupEnd - domainLookupStart
    TCP Connect TCP The stage takes time connectEnd - connectStart
    SSL Jianlian SSL Connection time connectEnd - secureConnectionStart
    First byte network request First byte response time (ttfb) responseStart - requestStart
    Content transmission Content transmission ,Response The stage takes time responseEnd - responseStart
    DOM analysis Dom Parsing time domInteractive - responseEnd
    Resource loading Resource loading loadEventStart - domContentLoadedEventEnd
    First byte First byte responseStart - fetchStart
    DOM Ready dom ready domContentLoadedEventEnd - fetchStart


    Slow load list Lists pages that load slowly , It is convenient for you to carry out targeted optimization :


    In the slow load list , Specific URL list . Click on URL, Access to Details page Specific analysis of the URL Time consuming .




    stay Multidimensional analysis In function , You can query the dimension distribution and proportion of all session performance indicators . Through dimensional analysis , You can find and locate an exception .


    Static resource performance Monitoring of , It also provides functions similar to the above chart , And support through Waterfall diagram and time-consuming distribution diagram To browse from other perspectives .


    7.2 Abnormal monitoring

    Anomaly monitoring is mainly divided into JS Error monitoring 、 Request error monitoring and static resource error monitoring , The macro consumption dimensions of these error monitoring modules are based on Wrong number 、 Error rate 、 Affect the number of users 、 Affect the proportion of users Mainly .

    stay  JS Error monitoring in , We provide JavaScript Error monitoring and analysis capabilities , It also supports reporting custom errors . On the whole, it is divided into overview of market indicators and issue Detailed analysis .


    Yes issue Manage status and assign processors :


    You can also query this issue In every error event in , User's device information 、 Version information, etc. . Click on UUID/ conversation ID, You can jump to single point tracking , Query the user or a single time session Detailed log . As well as :

    • The error stack : The wrong confusion stack , If you upload SourceMap, You can view the original stack .

    • Bread crumbs : Record the user's operation behavior before and after the error occurs , In addition to the types of requests automatically collected by the system , It also supports the interactive event type of user-defined buried points .

    • Custom dimensions : In addition to the dimensions automatically collected by the system , Supports submitting custom dimensions .


    Static resource error and Request error All modules are similar to JS The error module is similar to , Provide an overview 、Issue Management and detail analysis .

    7.3 Single point tracing

    The function of single point tracing is to query the problems in the process of using the product for specific users . At present, it supports querying the log of a single user , namely Log query .

    By input UUID( The user ID) or Session_ID( Conversational ID) You can query the front-end log of a single user for a period of time , By restoring the user's operation path , Better locate the root cause of the event .


    7.4 Call the police

    For the monitoring platform , Perfect alarm system is also indispensable . For all kinds of data 、 Exception creates a business-related alarm mechanism , Help to find and solve problems as soon as possible .

    Alarm task in , It supports the creation of alarm policies and the management of created policies .



    It can easily connect with the alarm notification robot in the flybook , After receiving the alarm, it will be fed back to the flying book group at the first time :


    Of course , In the platform, you can also view the overview of alarms 、 History list .



    8、 ... and 、 summary

    Application performance monitoring full link version , It is the terminal technology team's byte based tiktok 、 Today's headline 、Tik Tok、 Fly book and many other super large-scale users App After years of precipitation and accumulation, it is a completely self-developed application performance monitoring product . And has the practice of multiple external customers , Such as : The tiger jump 、 Homework help 、 Zhen Yun Technology, etc , For enterprises and developers to provide "One-stop" work style  APM  service .

    at present Application performance monitoring full link version For new users The trial 30  God Limited time free service . It includes App monitor 、Web monitor 、Server monitor 、 Applet monitoring ,App Monitoring and Web monitor various 500  Ten thousand events , Server With applet monitoring Limited time, unlimited .

    More product information , Welcome to wechat group communication :


    Click to read the original text , Enjoy limited and free discounts .

    Please bring the original link to reprint ,thank
    Similar articles