5 month 15 Japan , Acoustic network Agora Senior architect Gao Chun participated in WebAssambly The first offline event held by the community “WebAssembly Meetup”, And around the sound network Agora stay Web The application of real-time video portrait segmentation technology is realized , Shared practical experience . The following is a summary of the speech .
RTC The industry is developing rapidly in recent years , Online education 、 Video conferencing and other scenes are booming . The development of scene also puts forward higher requirements for technology . therefore , Machine learning is increasingly applied to real-time audio and video scenes , Like super resolution 、 Skin care 、 Real time bel canto, etc . These applications are in Web There's the same demand at the end , It's also a challenge for all audio and video developers . Fortunately WebAssembly Technology for Web High performance computing makes it possible . Let's start with Web The application of human image segmentation is explored and practiced .
What scenes are video portrait segmentation used in ？
When it comes to portrait segmentation , The first application scenario we think of , Green screen matting in film and television production . After shooting the video in a green screen environment , After post production , Replace the background with a computer-generated movie scene .
Another application scenario you should have seen . stay B On the site , You will find that the bullet screen of some videos will not block the characters on the screen , The text goes through the back of the portrait . This is also based on human image segmentation technology .
All the above technologies are implemented on the server side , And it's not a real-time audio and video scene .
And the human image segmentation technology made by our sound network is suitable for video conference 、 Online teaching and other real-time audio and video scenes . We can do it through the human image segmentation technology , You can blur the background of the video , Or replace the video background .
Video Caton , It is recommended to switch to automatic
Successfully switched to auto mode
Currently not Wi-Fi Environmental Science , Pay attention to traffic consumption
Playing with traffic , The video is expected to consume
Why do these real-time scenes need this technology ？
A recent study found that , On average 38 Minutes of conference call , Have a full 13 Minutes are wasted dealing with interruptions and interruptions . From online interviews 、 Lectures and staff training courses to brainstorming 、 Sales promotion 、IT assist 、 Customer support and Webinars , All of these situations face the same problem . therefore , Use background blur or choose one of many virtual background options from custom and preset , Can greatly reduce interference .
There's also a survey that shows , Yes 23％ Thirty percent of American employees say video conferencing makes them uncomfortable ;75％ Of them say they still prefer voice conferencing , Not video conferencing . This is because people don't want to change their living environment 、 Privacy is exposed to the public . So by replacing the video background , And then we can solve this problem .
at present , Human image segmentation in real-time audio and video scenes 、 Virtual background , Most of them run on native clients . Only Google Meet utilize WebAssembly It has been realized in Web Human image segmentation in real time video . The implementation of acoustic network is combined with machine learning 、WebAssembly、WebGL And so on .
Web The realization of real-time video virtual background
Technology components and real-time processing flow of portrait segmentation
Doing it Web When the portrait is divided , We'll also need to use these components ：
The real-time processing flow of portrait segmentation is like this . First of all, we will use W3C Of MediaStream Of API To collect . Then the data will be given to WebAssembly The engine to predict . Because machine learning costs a lot of computation , This requires that the input data should not be too large , So we need to do some scaling or normalization to the video image before input to the machine learning framework . The result of the calculation is from WebAssembly Some post-processing is needed after output , And then pass it on to WebGL.WebGL Filtering will be done through these information and the original video information 、 Stack and so on , The final result is . These results will lead to Canvas On , And then through Agora Web SDK For real-time transmission .
WebRTC： Do audio and video acquisition and transmission .
TensorFlow： As the framework of human image segmentation model .
WebAssembly： Do portrait segmentation algorithm implementation .
WebGL：GLSL Implementation of image processing algorithm , To process the segmented image of the portrait .
Canvas: Final rendered video and image results .
Agora Web SDK： Real time audio and video transmission .
The choice of machine learning framework
Before we do this kind of portrait segmentation , I will definitely consider whether there is a ready-made machine learning framework . Currently available include ONNX.js、TensorFlow.js、Keras.js、MIL WebDNN etc. . They all use WebGL or WebAssembly As the back end of the operation . But when trying these frameworks , They found some problems ：
1. Lack of necessary protection for model files . Generally, when running, the browser will load the model from the server . Then the model will be directly exposed to the browser client . This is not conducive to intellectual property protection .
2. Universal JS frame IO The design does not consider the actual scene . such as TensorFlow.js The input to is a generic array , The content will be packaged as InputTensor, And then to WebAssembly Or upload as WebGL Texture . The process is relatively complicated , When dealing with video data with high real-time requirements, the performance is not guaranteed .
3. Operator support is not perfect . The general framework is more or less lack of operators that can process video data .
For these questions , Our solution is like this ：
1. Implementation of native machine learning framework Wasm transplant .
2. For operators that are not implemented , We make it up with customization .
3. Performance aspect , We use it SIMD （ Instruction set of SIMD streams ） And multithreading to optimize .
The preprocessing of data needs to scale the image . There are two ways to do it in the front end ： One is to use Canvas2D, The other is to use WebGL.
adopt Canvas2D.drawImage() hold Video The content of the element is drawn to Canvas On , And then use Canvas2D.getImageData() To get the size of the image you need to scale .
WebGL In itself can be Video The element itself is uploaded as a parameter into the texture .WebGL Also provided from FrameBuffer Read in Video The power of data .
We also tested the performance of these two methods , As shown in the figure below , stay x86_64 window10 Under the environment of , On both browsers , Three different resolutions of video were tested in the Canvas2D and WebGL Preprocessing time overhead on . You can tell from this that when preprocessing video with different resolutions , The method that should be chosen .
Web Workers And multithreading
because Wasm Too much computing overhead , It can lead to JS Main thread blocking . And when it comes to some special situations , Like going into a cafe , There's no power nearby , Then the device will be in low energy consumption mode , Now CPU It's going down , This may cause frame loss in video processing .
therefore , We need to improve performance . ad locum , So here's what we're going to use Web Workers. We run machine learning inference operations on Web Worker On , Can effectively reduce JS Main thread blocking .
It's easy to use . To create Web Worker, It will run on another thread . Main thread pass worker.postMessage Send it a message , Give Way worker To visit .（ As shown in the following code example ）
But it may also introduce some new problems ：
To solve these two problems , We also did some analysis .
When transmitting data , Your data is JS Raw data type 、ArrayBuffer 、 ArrayBufferView、lImageData, or File / FileList / Blob, or Boolean / String / Object / Map / Set When it comes to type , that postMessage Will use structured cloning algorithm for deep copy .
We are JS Main thread and WebWorkers Between or different page Data transmission performance test between . As shown in the figure below , The test environment is x86_64 window10 Our computer . The test results are as follows ：
Our preprocessed data is about 200KB following , So you can see from the comparison above , Time will be spent on 1.68ms following . This performance overhead is almost negligible .
If you want to avoid structured copies , Then we can use SharedArrayBuffer. seeing the name of a thing one thinks of its function ,SharedArrayBuffer The principle is to make the main thread and Worker Share a memory area , Access data at the same time .
We have also tried to do portrait segmentation in the sound network SharedArrayBuffer, It also found that it would cause some problems . First of all, compatibility . At present, only Chrome 67 Version above can use SharedArrayBuffer.
Need to know , stay 2018 Before ,Chrome、Firefox Both platforms support SharedArrayBuffer, however 18 All of CPU Two serious loopholes have been exposed ,Meltdown and Spectre, They can cause data isolation between processes to be broken , So the two browsers were disabled SharedArrayBuffer. until Chrome 67 After process isolation on the site , It's only then that it's allowed to be used again SharedarrayBuffer.
Another problem is that the development is quite difficult . In order to solve the problem of resource competition Atomics Object makes front-end development as difficult as multithreading programming in native language .
WebAssembly The function and implementation strategy of the module
WebAssembly The main person in charge of the image segmentation . The main functions and implementation strategies are as follows ：
among , Machine learning models have different vectors 、 The operational framework of matrix . With TensorFlow For example , It has three operational frameworks ：XNNPACK、Eigen、ruy. In fact, their performance varies on different platforms . We also tested this . stay x86_64 window10 The test results in this environment are as follows . It's obvious that in our processing scenario XNNPACK Is the best , Because it is an optimized framework for floating-point operations .
Here we just show x86 The test results of the operation under the , It doesn't represent the end result on all platforms . because ruy This framework is TensorFlow The default computing framework on mobile platforms , It's right ARM Architecture optimization is better . So we also tested it on different platforms . I don't want to share one by one here .
Turn on WASM Multithreading can pthread Mapping to Web Workers, take pthread mutex Method mapping to Atomics Method . After multithreading is turned on , Portrait segmentation scene in 4 The performance improvement reaches the maximum value when threading , The increase has reached 26%. The more threads used, the better , Because there will be scheduling overhead .
Last , After we split the portrait , Will pass WebGL To achieve image filtering 、 Jitter elimination and picture synthesis , You'll end up with the image below .
Currently in use WebAssembly There are still some pain points in the process . First of all, I mentioned above , We will pass SIMD Instructions to optimize computational efficiency . at present WebAssembly Of SIMD Instruction optimization only supports 128-bit Data width . So at present, many people in the community have put forward , If it is possible to achieve the right 256-bit AVX2 And 512-bit AVX512 Instruction support can further improve the performance of parallel computing .
second , at present WebAssambly No direct access to GPU Of . If it can provide more direct OpenGL ES Call capability , Then we can avoid OpenGL ES To WebGL Of JSBridge Performance overhead .
Third , at present WebAssambly There's no direct access to audio and video data . from Camera and Mic The collected data needs more processing steps to reach wasm.
Last , in the light of Web End portrait segmentation , We summed up a few points ：
If you want to know more about Web Related practical experience of end portrait segmentation , Welcome to visit rtcdeveloper.com Post and communicate with us . Read more technical practice dry goods , Please visit agora.io/cn/community.
WebAssembly yes Web One of the right ways for platforms to use machine learning
Enable... In specific cases SIMD、 Multithreading brings significant performance improvement
When the basic performance and algorithm design are poor ,SIMD、 Multithreading brings little performance improvement
Use WebGL For video processing and rendering ,WebAssembly The output data needs to be kept WebGL Texture sampling format compatible
Application WebAssembly Real time video processing needs to consider the whole Web The key overhead in the process , And do appropriate optimization to improve the overall performance