This article is from Baidu intelligent cloud - Video cloud terminal technology architect —— Li Ming Lu , The speech content shared online in Baidu developer salon is sorted out . The content starts from the concept of audio and video terminal engine , This paper combs the development and technical evolution of audio and video terminal engine , This paper focuses on the key technical components of audio and video terminal engine , Shared the experience and practice in the development process .
writing / Li Ming Lu
Arrangement / Baidu Developer Center
Video playback ：https://developer.baidu.com/live.html?id=8
The theme of this sharing is ： Audio and video terminal engine optimization practice . The content is mainly divided into the following five parts ：
1. Introduction to audio and video terminal engine
2. Video cloud audio and video terminal engine development
3. The video cloud audio and video terminal engine is oriented to enterprise architecture
4. Key technical components of video cloud audio and video terminal engine
5. The video cloud audio and video terminal engine provides services for developers
01 Audio and video terminal engine
What is an audio and video terminal engine
Definition ： Audio and video terminal engine , It is integrated in mobile or embedded devices , Use Software definition Methods , Expand the original audio and video processing capacity of the terminal , Solve the normalization problem in heterogeneous audio and video scenes in an efficient way 、 Abstraction 、 Hierarchical client problems .
normalization ： It refers to refining the main logic behind different audio and video scenes .
Abstraction ： Using virtual mapping , Define the external connection between different audio and video entities and components .
hierarchical ： Through the design of top-level Architecture , Realize the endogenous relationship between different audio and video services and interfaces .
Basic architecture of audio and video engine and terminal engine
First look at the top of the picture , This link well explains the normalization mentioned in the previous chapter ： Audio and video terminals are iterated from the service , Through some methodology , And be able to return to the business .
On this link , Several common business scenarios are listed ： on demand 、 live broadcast 、 Short video 、 Real time audio and video, etc . With the continuous enrichment of these audio and video scenes , The business will become more and more complex , It is very important to refine the main logic behind different audio and video scenes .
Next , We focus on analyzing the architecture of audio and video terminal engine .
First of all, the audio and video framework should be based on the system capability provided by the mobile terminal , Include System framework and hardware capabilities .
Pictured , The system framework layer has iOS/Android Some multimedia frameworks for 、 Hardware encoding and decoding 、 The processing power of hardware is as follows CPU、GPU、NPU Processing module , And some open source image libraries .
On top of the third-party framework , Such as FFmpeg、OpenSSL（ encryption ）; It's used more in on-demand scenes Ijkplayer/Exoplayer; Technology for underlying communications ： two-way 、 Low latency WebRTC technology , Used for one-way on demand RTMP, It's quite popular at present SRT Low delay scheme ; besides , There will also be some frameworks for image processing, such as GPUImage etc. .
Above , It's some modules that combine with the scene more , Here are some of the core modules .
The data comes out from the multimedia acquisition module , It will go through one or more mixing streams （ Combined with the actual scene ）, Then transition to multimedia editing module ： Some of the capabilities of current short videos （ such as ： Transition 、 Subtitle effect addition, etc ） It is realized through the processing unit of multimedia editing , Then to the multimedia post-processing module ： for example AR Special effects , And some of the more fun interaction skills ; After the content is produced , We'll be on demand 、 Different scenes like live or short video , Using different protocols to distribute .
And then there's the consumer side , Here you can read the data directly , You can also read data indirectly as a cache to speed up the loading of secondary read data . After obtaining the data, it will be based on different encapsulation formats , Such as FLV、TS Or is it MP4 Etc. to unpack the data , Then decode and render .
Compared with other mobile end frames , The most special part of the audio and video framework is the pipeline , Because of audio and video SDK Our products are different from other products , The first thing you need is real-time processing , Data flow is constantly shuttling between modules , How to ensure the efficient transmission between each module , Here's a concept of pipeline .
that , Is there such a basic architecture , The ability to solve the current audio and video scene ？ The answer, of course, is No . in fact , In this architecture , At the software level 、 There are more or less challenges at the hardware level . say concretely , It can be roughly divided into the following ：
1. Platform capabilities are not aligned
Although many platforms now provide a lot of audio and video processing capabilities , But the capabilities of these platforms are not aligned . Here are apples iOS The multimedia framework and Android Multimedia framework processing 1080p HD video for example .
iOS System multimedia framework ：
Have AVAsset, AVCapture Interface , Through these interfaces, we can quickly obtain the raw data of high-resolution video .
Android System multimedia architecture ：
Lack of simple and efficient interfaces like apple . It usually takes two steps to decode , And for high-resolution video, decoding may fail .
2. The adaptation of codec chip is difficult
Codec chip is complex and changeable , Many manufacturers will be based on their own platform / Product customization , The standard of audio and video codec chip is aligned . For developers and cloud vendors , The cost of adaptive technology is very high .
3.CPU To GPU Performance overhead is hard to handle
Although many high-end mobile phones can provide a lot of coprocessors , The ability is also very strong . But for audio and video scenes , The processing of performance overhead has some challenges . For example, video capture , Usually in CPU Handle , But if you want to do some acceleration , Need to be in GPU Conduct . from CPU To GPU The transformation of this data is coordinated , If the solution is not particularly good , This leads to very high performance overhead .
4. OpenGL Image processing framework challenges frequently
OpenGL It's an older image system framework , At present, there are great challenges . The apple in ARM11 A self-developed method has been proposed on the processor Metal Processing framework , And put OpenGL The image processing framework is marked as obsolete in the future . if OpenGL In the future , It is bound to affect the stability of most mainstream products .
New challenges of audio and video terminal engine
Audio and video are introduced briefly SDK Framework , Next, we will introduce some problems and challenges encountered by the current framework .
1. Higher standards of sound and painting
There are a lot of business scenarios right now , In fact, they are pursuing more new ways of playing . Will pursue greater resolution , There are also some higher refresh rates , Even more extreme experience . such as ： Some scenes are like on-demand or short video ,720p The resolution of the scene can not meet the use of , Even in some sports scenes ,30fps The refresh rate of is no longer able to meet the needs of scenario development .
But audio and video SDK It's going to be largely constrained by the capabilities of the platform , Because the platforms have many differences , And related hardware are not fully popularized , So it leads to audio and video SDK In the course of development , It encountered many problems .
In addition, data interaction between multiple modules to data interaction between modules is also a great challenge , How to ensure data lossless processing and efficient transmission ？
2. Cross screen audio-visual scene
In addition to the upgrade of audio and video's own HD technology standard just now , Audio and video terminals are also slowly exploring some immersive scene experiences , This has also spawned some new platforms and new ways of interaction .
For example, some smart speakers with screens , Users can have some new requirements on this type of device ： Better scene rendering 、 Better call quality 、 Lower playback delay .
There are also popular digital people ： How to use the big screen outdoors , Even on some proprietary devices, it is presented in a more accurate way ？
These audio and video terminal data span the screen , Force the audio and video terminal engine to support more hardware platforms 、 More feature adaptations 、 Better human-computer interaction . But limited to equipment compatibility , How to ensure efficient calculation and stable transmission of data ？ These are the problems that the audio and video terminal engine needs to face .
02 Video cloud audio and video terminal engine development
After the introduction of the previous chapter , I believe you have a certain understanding of audio and video terminal engine . This chapter will comb the development process of Baidu video cloud as an audio and video engine , And select several important nodes to introduce the specific application scheme .
Video cloud audio and video terminal engine development roadmap
Baidu Intelligent Cloud raised audio and video solutions earlier , The following figure describes the main development process .
Stage 1：OneSDK Audio and video solutions
Provide push-pull flow capacity , It is widely used in thousand broadcast war, live question answering and other scenes .
Stage 2： One stop solution
With the short video now 、 Interactive live broadcast 、 And other real-time audio and video scenes . Baidu provides a more comprehensive one-stop solution , It includes special effects 、 The ability to edit and link wheat .
Stage 3：Pipeline Design
With the continuous understanding and deepening of audio and video scenes by developers , It also puts forward some higher requirements and standards . Such as ： Low latency 、 High definition . Baidu video cloud through reconstruction Pipeline programme , Further open the underlying capacity through the pipeline .
Stage 4： R & D paradigm
How to make developers use these pipelines more efficiently ？ Baidu video cloud also provides some R & D paradigms . Through R & D paradigm , Make the pipeline programmable . besides , Some simple application components are also provided , Can make developers faster 、 More efficient combination of openness and business .
at present ,SDKEngine It is also continuing technological evolution , It mainly studies the implementation of general engine technology , To better meet cross platform 、 Cross terminal audio and video requirements .
One stop solution Introduction
One stop solution , It is based on its own business development , Divide the business according to certain modules , And carry out external output in a modular way .
You can see from the above picture that , In the earliest one-stop solution , Will provide a lot of components and modules . such as ： Recording module 、 Live module 、 Interactive module 、 Special effect module, etc , And there are some dependencies between these modules .
This scheme is very efficient for developers who want to quickly access audio and video scenes , Can help developers from 0 To 1 Quickly build audio and video scenes , And provide lightweight 、 One stop interface solution .
But in fact , The needs of the customer side are constantly changing . Such as ： The business needs to do streaming or some open services that debug at a lower level and obtain the underlying capabilities , This requires the customer to inform the solution manufacturer by submitting a work order , And manufacturers need to evaluate 、 iteration , Gradually open up these capabilities , This is not very friendly to the developer experience . therefore , Interface and business decoupling is the next development direction of the engine .
Pipeline Introduction to design scheme
Pipeline Design , From the perspective of integration , Integrate heterogeneous business architecture in a pipelined manner , Integrate the underlying capabilities of audio and video , Multiplex audio and video module , Based on real-time audio and video engine , Plug in external output .
You can see from the above picture that ,Pipeline The design eliminates the previous high-level business interfaces , And open the lower interface . Take the recording module as an example ： Insert the recording module , Business related logic is abstracted , And offer things like Capture Such acquisition components . adopt Pipeline The design of the , Gradually open up the underlying capabilities .
But in the process of opening up , If you can let developers better use the underlying technology ？ And how to Pipeline Do some more flexible customization on ？ This gave birth to our new terminal engine .
Introduction to R & D paradigm scheme
For developers in Pipeline Some use problems in the design scheme , We expect the terminal engine to be programmable 、 The ability to control .
therefore , Baidu video cloud takes enterprise architecture as the starting point , The design idea is to combine platform middleware and business low code , External output capacity .
In previous designs , Terminal engines are docking APP, In the R & D paradigm scheme , The docking is for developers . This is actually an upgrade of the design concept ： take SDKEngine Generalization to R & D paradigm , Developers can according to their own situation , Use as needed or carry out secondary development .
stay Application level , Encapsulate common components , Such as ： Camera components 、 Even wheat components 、 Push flow components, etc , Developers can quickly 、 Use these components pluggably .
stay The middle layer of the platform , Achieve governance at the architecture level . For different audio and video scenes , You can build a solution 、 The ability to flexibly and automatically package the corresponding products .
meanwhile , The underlying engine capabilities are also gradually opening up , Such as ： Optimization for weak networks 、 Transmission capacity, etc .
03 The video cloud audio and video terminal engine is oriented to enterprise architecture
The last chapter introduces the development of video cloud audio and video terminal engine , Some experience has also been gained in the whole process . Here are some lessons learned from practice ：
Decentralized design concept
background ： Audio and video scenes are becoming more and more complex , More and more professional , With the development of audio and video developers , The traditional business driven design concept can not meet the developer's definition of some new scenarios .
The meaning of decentralization ：
>> Change from business as the center to service as the premise
That means , Cloud vendors should stand from the perspective of developers , Look at the development of terminal engine . How to provide what capabilities ？ How should each ability evolve itself ？ This is a particularly good feedback mechanism .
>> Turn closed architecture design into collaboration and win-win
As mentioned in the previous chapter, the R & D paradigm is such a solution , It actually opens more interfaces and underlying capabilities to developers . This is also the hope of better collaboration with developers , And better serve some new scenes .
Decentralized design concept ：
1. Law of single responsibility ： Modularize business scenarios .
2. The law of opening and closing ： That is, a high degree of autonomy , Does not affect other modules .
The module quality has been strictly tested by the manufacturer , Solve the worries of developers , And each module is pluggable , When something goes wrong , Can quickly eliminate .
3. Interface replaceable ： Interface design is inclusive .
At present, the complexity of audio and video recognition scene is far beyond the imagination of cloud manufacturers , Developers' requirements for modules or interfaces are personalized . This requires the module of audio and video terminal engine 、 The interface can be output independently , And the design is inclusive , In order to reduce the burden of developers in the use process .
4. Interface isolation ： The interface is easy to understand .
5. dependency inversion ： Interface oriented programming , Easy to use . namely ： The interface is required to be easy to use , Can be used quickly , Even in secondary development and further packaging, there is no burden .
Integrated into the programmable logic system
After the development of audio and video terminal engine , We think ,SDK Integrability is a basic capability , Only with programmable ability can we better serve developers . In this way, the client side business can be decoupled from the terminal engine , Provide standard components , Support programmable paradigm development .
As shown in the figure above ： The original live broadcast Demo, At the design level, it will provide many complex UI And business logic , At the code level , It will involve a lot of business SDK, This design is simple , But the ability is insufficient , Cannot split or refine scene , Influence developers on SDK Product identification .
after B End design and business reconfiguration , The original live broadcast Demo Split into several business scenarios ： Standard live scene 、 Interactive live broadcast scene 、 Low delay live broadcast scenes, etc , Through R & D paradigm , Release the terminal engine power . And the cohesion between each module is strong , Save developers' use costs .
03 Introduction to key technical components
Introduction to data pipeline components
The data pipeline is responsible for data transmission , It can realize high complete decoupling of data transmission between various components . Take the live streaming scenario shown in the figure above as an example , This paper introduces how data pipeline realizes high complete decoupling of data transmission .
Live streaming scenes are classified according to the coarser dimension , It can be roughly divided into the following categories ： Cameras 、 Beauty 、 Streaming class .
The camera class will use some underlying components , such as ： Collection component （ Turn on camera 、 The microphone etc. ）, The collection component will transfer the data to the camera class in the form of interface callback , So how can camera data be transferred to beauty class or other classes ？
The first is to develop data protocols , The main solution is to improve the efficiency of data between modules 、 Steady delivery . We provide AVOutput Module and AVInput The protocol of data transmission and reception , Simply record and manage the components of this chain , We call it target.
And then through Dispatcher The mechanism of , Distribute video frames reported from the production side , The video frames are continuously transmitted to each link target among , And each target Realization AVInput The protocol method . for example frame Follow type Two functions ,frame It's from raiseFrame Output The data that comes down ;type The main thing is to make some distinction between audio and video .
In addition, we support the distribution of some binary scenarios , It is mainly to upgrade some protocols in order to meet the requirements of data distribution such as live broadcast . At the end of the chain , You can see that we also achieve AVControl, Unlike before, we have a control protocol , This control protocol is mainly to control the inflow and outflow of data .
Why do you do this ？ We know audio and video SDK The most important thing is that data is constantly passing on , If a module is abnormal , There's no mechanism for data to protect it , May lead to the whole SDK The operation of the system is unstable . For example, when the live scene is distributed , We found the network jitter , At this point, we can adjust the sending rate 、 Speed etc. .
Here is a simple drawing of our data flow .
One is the way to push , It is to collect data directly from the camera module , Pass it directly to a module later .
One is the way to pull , It mainly refers to reading local files or indirectly obtaining data , For example, to read data from the network, you need to read the data first , And then it's passed on to the modules .
In fact, compared with what was said before GPU Some processing in asynchronous threads , We've also done some compatibility and protection , It is to prevent the release of objects during asynchronous processing . So we basically use GPUImage The idea of a simple agreement , Then on this basis, we add some implementation mechanisms like control protocol , Make the whole link controllable .
In this step , It is necessary to join the control agreement . We know , Terminal audio and video scene , Most of the time it is processed in real time , The functional modules of preprocessing are superimposed continuously , It brings great challenges to terminal performance , Join the control agreement , You can leave these controls to developers , By adjusting the parameters in time , Ensure efficient data transmission .
Introduction to special effect components
Special effects components will provide many interfaces , Special effects modules are typically PaaS structure , There are many models on it , And the models can be plugged in and out ; It also has another characteristic is that it consumes resources . So how to do it in audio and video SDK Make better use of this module , To provide external capacity ？
Let's take the advanced beauty interface of face effects as an example , There are many features involved in the advanced beauty , Like big eyes 、 Thin face 、 Chin, etc , And these feature points can not be solved by an iteration or a model , It may involve multiple iterations and a combination of multiple models . This brings a problem , When integrating special effects modules , If these abilities are constantly changing , For the use of modules , There is an element of insecurity and instability . So how can we solve this problem , Or shield it ？
Here is a concept ： Broker layer .
First of all, when calling capabilities , Don't call directly , But do these abilities abstract 、 encapsulation , And then these encapsulated models , Used to correlate some of the different algorithms that follow . Because developers are using SDK When , Not necessarily integrated according to everything provided by the manufacturer , Maybe only some of them are used , This may lead to some special effects SDK There is version inconsistency .
Without this agent layer , When there is a version inconsistency , For the upper layer, there may be a lot of interface adjustments and modifications , It's time-consuming and laborious . But if we act as an agent , May be to shield the stability of the upper interface , And then through our model and some abstract objects , Can drive different AR Version of the module .
Pass the above figure , We can more intuitively understand ：
The data between the camera module and the special effect module is connected through the data pipeline , When the data enters the special effects module , According to the different parameters passed in by the developer , The component will associate the relevant portrait algorithm through this feature , In order to isolate some problems faster and more safely .
At the same time, these data will be transmitted to the end side for end-to-end reasoning , At present, baidu video cloud audio and video terminal engine uses Baidu Paddle lite End side reasoning . Through the detection of core performance and indicators , Realize the optimization of graph , Finally, by rendering 、 Data conversion , Form the final special effect .
Introduction to rendering components
Rendering components are actually very easy to understand ： Given point, line and surface information and texture information , The process of generating one or more bitmaps from this information . It is also one of the key technologies of audio and video terminal engine , It is the last kilometer perceived by users , As the scene is enriched ,（ such as ：HDR Video synthesis and playback 、VR Virtual reality, etc ） The rendering module becomes the convergence point of data interaction and processing .
Baidu video cloud mainly uses OpenGL This component , Some optimization and practice have also been done on this component , These optimizations are used to improve the cross platform ease of use of rendering components .
actually ,iOS Off screen rendering and Android Off screen rendering is very different on the bottom layer , And these differences lead to different technical details in packaging and adaptation . The following is a brief introduction to the different implementation of these technical details .
1. ES 3.0
OpenGL Of 3.0 It provides developers with a lot of features . If he added VAO The concept of , and VAO Dealing with more Buffer Under the circumstances , Efficiency is higher .
It does not support VAO The platform of , We also simulated VAO The concept of , This makes the cross platform rendering component more efficiently supported on each terminal .
2. be based on OpenGL ES3.0 Render component local frame ：
First, the tradition VBO Through ES3.0 Package to VAO In the data format , The next in ES2.0 On the basis of , Ability alignment through simulation .
Shader Split ：OpenGL Yes Shader Language programming , So will Shader To break up . Such as ： Put the location information 、 Vertex information is further split , And provide simpler 、 Abstract interface , Reduce everyone's use cost . some time , Will gradually release Shader Some of the underlying development methods , Let developers better adapt some advanced methods .
Based on this , It can basically solve some basic demands for rendering components on cross platforms .
RTC Component is introduced
RTC, It puts forward higher technical standards for real-time services , Lower end-to-end delay to simulate more natural interaction . It is also used in a wide range of scenarios , such as ： Interactive Lianmai 、 Live delivery 、 Ultra low delay playback etc. .
Baidu video cloud team currently provides two RTC Component capabilities . One is Lianmai component RTCRoom, The other is for ultra-low delay scenarios RTCPlayer player . And for these two components , Also made technical optimization .
1. Optimization of transmission quality
Signaling Over UDP Faster access , sequential processing .
RTC The signal transmission is bidirectional , The traditional signaling transmission scheme is based on TCP Develop , In order to achieve the effect of lower delay , Use faster channels ：UDP. also UDP Packet loss will occur during transmission , Not in order , So we need to do some processing and control in timing , Prevent unexpected content .
2. Transmission link analysis and optimization
Transmission link tuning , Identify different networks , Uplink congestion control strategy .
In user use RCT When the component , The network environment may change in real time , For this real-time changing scenario ,RTC Components need to be quickly identified , Some strategies are needed to minimize the impact on the downlink receiving side . Refer to the above figure for the specific process .
Generally speaking, it includes ：
Based on transmission side congestion control , Partial bit rate estimation processing flow ; Uplink hierarchical transmission , Downward smooth rendering .
Among them , Rate estimation is the whole RTC A very important link in a transmission link , If the uplink transmission quality cannot be guaranteed , Any further optimization in the downlink is futile .
therefore , In the whole project , It will also be based on different scenarios , Such as ： Packet loss scene , Make a lot of optimizations for delay scenarios .
For packet loss scenarios , Adopt different grading strategies . Common packet loss occurs （ Unlimited bandwidth , Only packet loss occurs ）, It is necessary to adjust the strategy and use a higher bit rate to transmit , And cooperate with the retransmission mechanism of the server , Let jitter data transmission resume as soon as possible .
Advanced packet loss occurs （ Such as ：4G cut 3g, Enter the subway and other scenes ）, Yes, there will be a long bandwidth delay , At this time, it is necessary to make an accurate prediction in the bit rate estimation link , Don't let it change and fluctuate obviously , Ensure uplink transmission .
Downward smooth rendering is also key . For downlink packet loss scenarios , How to render quickly , Make the delay lower ？ This requires a series of transmission quality standards , Streamline unnecessary RTC, To adapt to different networks .
05 Video cloud audio and video terminal engine —— Providing services to developers
For industry users
At present, baidu video cloud audio and video terminal engine serves the majority of industry users . Including the financial industry 、 The education industry 、 Media industry, etc , For different scenarios , The access mechanism and requirements are different . This also ensures that different developers can get what they need when using our products .
For ordinary developers
Ordinary developers can use SDK The center understands the capabilities of the video cloud terminal engine , You can also access your own products to experience .
The above is all the teacher's share . Any questions are welcome in the comments area .