U-DiVE - design and evaluation of a distributed photorealistic virtual reality environment

Mobile devices such as smartphones are increasingly being used for immersive content consumption, mostly involving 360 o¯ video and 3D audio media delivery. However these smartphones, especially low-cost ones, are still not able to provide the processing and battery power needed for a real-time rendering, visualization and interaction with photorealistic virtual reality scenes. In this context, this paper proposes and evaluates U-DiVE (Unity-based Distributed Virtual Reality Environment), a framework that decouples the processing and rendering processes from the delivery, visualization and interaction with realistic VR models. The U-DiVE framework produces a photorealistic scene using a general ray-tracing algorithm and a virtual reality camera configured to use barrel shaders to correct the lens distortion, allowing the visualization through inexpensive smartphone-based head-mount displays. The framework also includes a method to obtain the smartphone’s spatial orientation to control the user’s field of view, which is delivered via real-time WebRTC streaming. The analysis show U-DiVE allows for the real-time visualization and manipulation of realistic, immersive scenes via smartphone-based, low-cost head-mounted displays, with low end-to-end latency, considering the required continuous data processing and delivery.


Motivation
Virtual reality (VR) is a branch of computer graphics defined by an experience in which a user is effectively immersed in a responsive virtual world [3]. An important feature of any VR application is to ensure general stability and a stable temporal resolution (frame rate) so that the user does not lose the feeling of being in a virtual world. Four technologies are crucial for immersive VR: an immersive display, rendering at least 20 frames per second (FPS), a tracking system that continuously reports the user's position and orientation, and a realistic environment [3].
There are several VR displays on the market -these are known as Head-mounted displays (HMD), which allows for presenting and interacting with the virtual world. Some technologies embedded in expensive HMDs may be replaced by smartphones and take advantage of their sensors, creating an affordable option. However, due to a limited graphical processing capacity of most low-cost mobile devices, many applications cannot be rendered with a stable temporal resolution [20]. Adding photorealistic multimedia content into an immersive space, due to these technological gaps, can hinder a widespread development of products and solutions [12]. For example, real-time photorealistic VR rendering [20] is currently unfeasible to be processed in mobile devices.
One possible solution is to combine streaming techniques with VR to create a stable and immersive photorealistic environment for low-cost devices. Streaming is characterize in this work as a form of continuous media delivery, commonly supported by well-known standardized protocols. Streaming techniques usually deliver the content continuously, while it is being presented to the end user, instead of a complete and previous transfer of the whole media content to the client application. Streaming may be even a requirement for some applications, due to the amount of media data to be transferred or the real-time characteristics of the media itself [21]. In this work, through the use of real-time streaming with interactive control of the VR camera, we separate VR processing and rendering from its interaction and visualization to achieve low cost, yet realistic real time VR rendering.

Proposal and contribution
This paper presents the Unity-based Distributed VR Environment (U-DiVE) framework, an extended version of a preliminary framework presented in [18]. In this work we present an improved pipeline, several fixes and a complete evaluation of the framework.
The proposed framework allows low-cost mobile devices to present realistic scenes. In U-DiVE, all the expensive processing is computed on the server-side, so that the mobile device only executes the video stream player, sending the user's head orientation back to the server. The main contribution of this paper is to present an extended and improved version of the proposed framework in [18] and a quantitative analysis to assess its latency when applied to typical home equipment. The analysis showed U-DiVE produces low-latency, realistic and immersive environments suitable for low-cost smartphones with inexpensive smartphone-based head-mount displays (like Google Cardboard).
The proposed approach shows that not only the required processing power and desired performance becomes achievable, since the complex tasks are moved to server side, but the battery power consumption level makes smart-phone-based HMD possible for realistic scenes since it deals only with regular streaming reception and interaction.
The evaluation tests were divided into two points of analysis: (i) checking if the frame rate of the whole process is greater than 30 FPS for a fluid experience and (ii) discussing if the overall latency is acceptable.
The paper is structured as follows. In Section 2, we present related work. Section 3 describes a theoretical background concerning photorealism and streaming. Section 4 presents the development of the framework. In Section 5, we show the analysis of the results. Finally, Section 6 presents the conclusions and future work.

Related work
In general, U-DiVE framework combines three main aspects: VR content consumption through low-cost mobile devices, immersive media streaming from a server to the mobile device, and photorealistic VR rendering in real-time. This section describes some relevant works concerning these aspects.
The authors in [16] presented methods for handling multiple data streams with different latency values associated with each other in a working AR system. These methods are applied to an AR system for real-time ultrasound visualization and demonstrate improved registration and visualization. The first method is a technique to reduce relative latency, adjusting the moment of sampling the incoming data stream. A compromise that does not increase maximum latency to decrease relative latency is the just-in-time acquisition of the data while interleaving computation. The second method is storing multiple readings and either interpolating or extrapolating these readings to simulate new readings. The work measured the latency differences, time-stamp on-host, adjust the moment of sampling, and interpolate or extrapolate data streams. The visual tests show that the real-world camera video is the lowest latency stream. For the external device trackers, it is determined relative latency by rendering a model in the tracker's coordinate system that should be aligned with the real-world object. The authors also measured the latency of the camera to the real world. To do this, a LED blinking at a rate of 5Hz is used as a trigger for an oscilloscope.
MmWave communication, edge computing, and proactive caching can be a solution to achieve an interconnected VR/AR determined by smooth and reliable service, minimal latency, and seamless support of different network deployments and application requirements. In [5] to generate a real-like view, a bit rate of up to 1 Gb/s is required, and this bandwidth is not reachable in 4G. To bring end-to-end latency down, it is needed to understand the various types of delays involved in calculating the pooled computing and communication latency budget. Delay contributions to the end-to-end wireless/mobile VR latency include sensor sampling delay, image processing or frame rendering computing delay, network delay, and display refresh delay. Also, lag spikes and dropouts needed to be kept to a minimum, or users will feel detached. The work proposes an optimization framework to maximize the successful high-definition frame delivery subject to reliability and latency constraints. Their case study demonstrates the performance gains and the underlying trade-offs inherent to wireless VR networks.
Authors in [1] perform a rigorous analysis of 1300 VR head traces and propose a multicast DASH-based tiled streaming solution for mobile multicast environments. The paper weighs video tiles based on user's viewports, divides users into subgroups based on their channel conditions and tile weights, and determines each tile's bitrate in each subgroup. They compare the proposed solution against the literature's closest ones using simulated LTE networks and show that it substantially outperforms them. Three performance metrics are used to show their results: 1) average viewport bitrate, 2) the impact of viewport change during the scheduling window, and 3) spectral efficiency. The results show that it assigns 46% higher video bitrates for the video tiles, allowing them to freely change their view directions while observing much less video quality degradation.
The next work covers different aspects of VR content representation, streaming, and quality assessment that will help establish the basic knowledge of building a VR streaming system. According to [6], the VR streaming problem can be described as "the problem of panning around a high-resolution video using head movements". Their study advanced over most recent studies dedicated to streaming VR content that propose multiple models to assess the quality of experience in a VR streaming system. The focus is on 360 videos, and there has not been much attention in the literature on evaluating the quality of such content. It is not apparent how to compare different projections of 360 videos at different bitrates with the original video.
In [30], the required throughput for VR applications is computed for an eye-like experience. A case study running different VR applications is also presented, based on an open source remote VR display characterizing their traffic statistics. The study calculates the number of bits per second to represent an image in the Oculus Quest display. VR traffic between client and server is collected by periodically synchronizing the clock and rendering the frame on the edge server fragmenting into small MAC layer packet data units, traverse over the networks, and reassembling at the HMD. The work uses H.264 and H.265 video coding and the performance is compared. The work shows that rendering VR applications can be offloaded to an edge server, reducing the energy consumption and production cost of the VR HMD.
Authors in [11] present VRComm, a web-based social VR framework for enabling remote communications via video conferencing. Authors have three main contributions: a new VR communication framework, a novel approach for real-time depth data transmitting as a 2D grayscale for 3D user representation, including a central multi-point control unit approach for this new format, and a technical evaluation of the system with respect to processing delay, CPU and GPU usage. Their evaluation shows that the proposed capture and depth to grayscale conversion is suitable for real-time video transmission but other solutions might result in better visual quality if the bitrate is below 1.5 Mbit or above 3 Mbit, as well as pre-encoded content.
Authors in [29] develop a photorealistic algorithm to run directly on mobile devices. In this paper, they present a novel distributed illumination approach for AR with consistent illumination direct light, indirect light, and shadows of primary and strong secondary lights. They split the illumination into two parts: capture the existing radiance values by HDR video cameras placed at different locations in the scene, and display augmentations with consistent illumination at interactive frame rate on the mobile device. This acquisition process reduces the amount of transferred data between a stationary PC and the participating mobile device. Their goal is to achieve a consistent illumination of virtual objects on mobile devices in a real environment and interaction of the real world with photorealistic augmentations with multiple users. To obtain photorealistic images, the authors use multiple HDR video cameras to capture the illumination from multiple directions, resulting in a expensive setup.
The work of [17] presents a middleware streaming engine that can implement existing OpenGL-based 3D network games onto heterogeneous platforms. The engine consists of capturing OpenGL command stream, scene graph reconstruction, data simplification, and compression and transmission. The development of the system is on WLAN and consists of a game server and game clients on a PC, and clients on heterogeneous devices. As [17] describe the overview, a client on a mobile device is connected to invocation manager on the PC, the manager executes the 3D game server, the game server and client makes a connection, and every mobile client is connected to a game client through the invocation manager. By appending a middleware the system can extend the PC-based 3D games onto mobile platforms without modification of the source code. In their results, 3D streaming is possible in 4 5 frames/second in spite of a dynamic game environment under software rendering.
We were unable to find a work putting together all elements presented in our proposal. Our proposal differs from the ones presented in this section exactly for the combination of multiple techniques in a more versatile way, allowing us to reach a vast number of hardware configurations while keeping the low-cost of the solution. Table 1 shows the comparison of the requirements of the main works presented in this section with our proposal.
The two closest works related to our framework are [4], in which the approach is based on ray-tracing, performs distributed rendering to address the limited mobile GPU capabilities, and uses image-based lighting from a pre-captured panorama to incorporate real-world lighting; and [27], which is a solution focusing on integrating traditional image-based lighting estimation on mobile systems. In this publication they also evaluate the effect of illumination estimation methods on human perception and situational quality preference. Both works differ from ours in that they do not offer a complete VR solution with a focus on low-cost devices.

Theoretical foundation
This section presents a theoretical foundation over concepts applied in the U-DiVE framework.

Photorealism
In this section, we present the Ray-tracing technique used in [18]. The Ray-tracing algorithm was designed by Turner Whitted in 1980. The researcher used the ray casting technique to solve reflections, refractions and shadows to get more accurate pixel colors [7]. Ray tracing, in modern computing, is a technique used to create realistic images [9]. It relies on the knowledge of the amount of light each pixel receives after processing the scene. The global illumination information is stored in a "tree of rays", starting from the viewer to the first surface encountered, bouncing in other surfaces and light sources [33]. Figure 1 shows how the primary rays are cast from the camera and pass through the pixel until it hits a 3D object or a defined limit. On hit, the ray bounces in all directions depending on the type of the surface. From each hit point, shadow rays are cast toward the light source, and if these rays do not reach the light source, they create shadows. 1

Streaming
Multimedia content, including audio and video, may be delivered over the Internet using push-or pull-mode protocols. Some of these protocols were specifically designed for audio/video streaming, which is a technique where the application continuously receives data samples while the previous ones are being timely decoded and presented to the user. Some established applications such as IPTV (Internet Protocol Television) and VoIP (Voice over Internet Protocol) have used protocols such as RTP (Real-Time Transport Protocol) [31], which is primarily a push-mode application-level protocol over UDP (User Datagram Protocol) [26]. An RTP packet transfers a given media sample (some milliseconds of audio and/or video) accompanied with relevant information like the media type and a timestamp for an easier decoding process. The use of UDP combined with such small data packets and easier decoding allow RTP implementations to achieve low latency in data transmission.
However, UDP is unfortunately used for different kinds of attacks against Internet hosts and thus firewalls are usually preconfigured to block UDP traffic. This UDP difficulty for end-to-end communications and the outstanding growth of content delivery networks (CDNs) for Web applications brought interest to make HTTP (Hypertext Transfer Protocol) a viable multimedia protocol. HTTP is a pull-mode protocol designed for transferring hypertext files from a server to a client given a client request. Its implementation is simple and robust and if used in an adapted way it can work for streaming on the Internet.
HTTP adaptive streaming (HAS) techniques like HLS (HTTP Live Streaming) [26] and MPEG-DASH (Dynamic Adaptive Streaming over HTTP) [15] rely on any version of HTTP since 1.1 to transfer segments of a media file, one by one. This means that audio and video content must be not only encoded on the server side, but also segmented in small files (containing 2s or 10s of content, for example) so that HTTP pull-mode logic can be used continuously by the client side to individually get the next segments according to the current play time. The fact that the client continuously requests media segments allows HAS techniques to support data rate adaptations during play time. If each media segment is encoded with alternative representations in different bitrates, the client becomes able to switch to a lower bitrate for the next segments if it detects poor bandwidth availability. In this way, smaller files can be transferred during congestion occurrences. Clearly, a description of which audio/video segments, including their alternatives, comprise a given media content must be provided by the server and interpreted by the client before streaming starts.
Considering our goal to support a photorealistic VR environment in low-cost mobile devices, the available mobile operating system APIs, their common support for one or more HAS techniques and their powerful web browsers undoubtedly indicate that converging to a web-based approach could be a good solution. However, HAS techniques introduce a relevant latency (to encode and generate a downloadable file with a media segment in real time) and client processing (to run heuristics, request each media segment and to decode it), which would result in a poor interactive mobile experience. On the one hand, to lower the latency we may choose to reduce the media segment size, but the number of HTTP requests from the client would increase, together with upstream traffic and power consumption. On the other hand, to reduce the number of HTTP requests the media segments must be increased, resulting in longer transfer latency.
The U-DiVE framework employs WebRTC (Web Real-Time Communication) API [2], which allows for the use of RTP/UDP, with its much lower overhead in terms of latency, processing and traffic, in a local network setup or over the Internet [23]. And yet it is a design decision that promote Web convergence, since WebRTC is part of HTML5 and is available on all modern browsers, including their mobile versions. WebRTC enables a peerto-peer (P2P), browser-to-browser, communication. The end-to-end UDP communication becomes possible via a client-server protocol to previously exchange metadata to cope with network address translators (NATs) and firewalls. For two endpoints to start communicating to each other, some information exchange must be done, including: • Session control information used to initialize, close, modify communications, and report error messages; • Media metadata commonly supported between endpoints such as codecs and codec settings, bandwidth, and media types; • Network data, such as a host's IP address and port.

Latency measurement
Latency in interactive real-time graphics simulations comes from various sources. The authors in [25] identify the following sources: • sensor reading and computation • sensor data communication • application computation • rendering computation • display refresh In streaming, other latency sources are generated, such as package delivery and distance between computers on the network. To measure these latencies, Steed [32] propose latency estimation with a regular video camera, automatically computed once the video is captured. The method uses a tracked pendulum and a small light attached to it and then record the pendulum and a screen behind it, which shows a simulated image whose position is driven by the tracking information.
The authors in [22] infer the latency by recording tracking data via a video image at 60Hz. A tracker is attached to a moving pendulum, and when the pendulum passes the vertical axis, they can count frames. This method requires reconfiguration of the tracker space, which is impractical in some situations. The authors in [24] manage to determine the frame offset of a motion automatically using a motion detection algorithm, but the latency is detected in multiples of the frame rate.
In [8], different methods were used, such as Sine-Fitting Method and Di Luca's, to measure the latency of several interactive systems that may be of interest to the virtual environments engineer, with a significant level of confidence. They develop a new latency measurement technique called Automated Frame Counting to assess latency using highspeed video. This technique uses image processing techniques to extract the tracked object's position, resulting in a set of samples that characterizes the motion of the object. The algorithm guides selecting the threshold for binarizing the frames and identifying salient object's locations to track. Once tracking is complete, the user selects the feature scale. Then it extracts the features and subtracts the locations providing the number of frames. The average of these frames is returned as the latency estimated for that capture.
Like other authors, He et al. [13] describes an end-to-end latency measurement method for virtual environments. In their method, a video camera records a physical controller and the corresponding virtual cursor simultaneously and analyzes the playback to determine the lag between wand motion and the motion of the virtual image of the wand.

U-DiVE framework
The U-DiVE framework was developed using the Unity Engine, 2019.3.12f version, and the ray-tracing algorithm available in its High Definition Render Pipeline (HDRP). This ray-tracing algorithm is a feature of Microsoft's DirectX, capable of producing realistic real-time images. A graphical view of the framework can be seen in Fig. 2.
U-DiVE's pipeline components are described as follows:

Fig. 2 U-DiVE pipeline
Start To start U-DiVE, firstly a Node.js server is initiated. Unity is configured to connect to the address provided by the server. From that server, Unity is started. Then the user may open the web browser on his mobile device and type the same server address to be accessed. Thus, the browser connects to Unity via the Node.js scripts and the exchange of information is now enabled to occur.
Extract device orientation After U-DiVE starting process, the next step comes from the mobile web browser. The client-side script delivered to the browser periodically extracts the mobile device's orientation in a quaternion format. 2 The relative orientation is extracted at a frequency of 60 times per second and referenced to the screen frame. Each quaternion value occupies 4 bytes of memory and an input event flag occupies 1 byte totaling a 17-byte buffer array. This buffer array is sent to the server to be processed internally by Unity.

Process orientation
The internal processing of the orientation quaternion depends on the event flag that was sent in the first position of the buffer array. The signal can take the value of a click, a key press, device sensors, etc. Unity recognizes the data type being received from this information. Values are stored in Unity by the input system in global variables and can be used in any object in the environment.

Set VR camera
The VR camera object is constructed with an empty parent object. Below this object two cameras are placed, one for each eye. Thus, the coordinates received by Unity's input system are applied to that parent object. With this operation, the internal cameras will move accordingly, whenever this information is updated.
Scene, Ray Tracing, Rendered image When starting Unity, the scene is rendered by HDRP using its native Ray Tracing algorithm. The Ray Tracing works like post-processing and is applied as global post-processing, that is, it will be applied to all the existing cameras in the scene (right and left eyes).

Apply distortion shaders
In addition to Ray Tracing, image distortion shaders are applied to bring out the VR view. This distortion is based on the Brown-Conrady barrel distortion model. This distortion is applied after the image has been generated using Ray Tracing and then stored to be sent to the mobile device's browser.
Create stereo image Unity has a rendered texture object that can be used to store camera images. Since the cameras were created occupying half the resolution at the X coordinate, when applying the two cameras to the same render texture, the image fits perfectly. That object then generates a single, stereo image with both cameras, producing a VR view.

Receive rendered image, Present VR image
Once the image is rendered and stored in the texture, it is streamed to the client-side script running on the mobile browser. The client-side script receives the image and applies it to its instantiated video player. The flow continues, with the periodic extraction of new orientation of the mobile device.
In [18] three steps were defined as foundations to build U-DiVE: scene processing for VR; management of the orientation captured by the mobile devices; and WebRTC connection for streaming. In that approach, Unity Render Streaming provides Unity's high definition rendering abilities via a browser. This streaming technology takes advantage of WebRTC and makes it possible to send/receive data to/from the client and the server. In the current version of the framework, the creation of VR cameras was refactored and improved and the pipeline information flow was redesigned as shown in Fig. 2.
Unity's Render Streaming consists of 3 components: Web server, Web browser and Unity (Editor or Application). P2P communication is created between Unity and the Web browser, and data is transmitted over UDP/IP. The Web server enables the communication between the Web browser and Unity.
The Render Streaming tackles two problems: performance and latency to provide highfidelity graphics and a stable streaming frame rate for a high-quality user experience. NVIDIA Video Codec SDK is used to broadcast applications to the browser and to perform GPU hardware encoding on the frame buffer, reducing latency. In U-DiVE, the system is configured so that the video is streamed to the client in 1280x720 pixels, and the bit rate starts at 16.000kbps and tops at 160.000kbps.
The Unity's Render Streaming package used was not ready to use VR cameras. To overcome this limitation, our framework has a feature that modifies the stream function in order to accept two cameras, one for each eye, and we rely on Google Cardboard SDK for Unity to build our VR cameras. Two cameras were created with a distance of 0.06 in unity coordinates between them to create a parallax effect. Finally, to match the output produced with the hardware used, a fragment shader was applied to distort the image and provide a lens correction.

Experimental setup
We defined as low-cost device (and the minimum user requirement), for the client side, a mid-range smartphone (∼200 USD) with a low-cost smartphone-based head-mounted display (∼10 USD). We considered those requirements as a typical home setup. On the server side, our setup for U-DiVE evaluation includes a desktop PC with an Intel i5-9400f processor, a GTX 1660 super video card and 8GB RAM. The client side was deployed in a Xiaomi Redmi Note 4 with 4GB RAM, Snapdragon 625 Qualcomm MSM8953 processor, IEEE 802.11 b/g/n/ac WLAN and Google Chrome browser. The wireless router in the tested was an EchoLife HG8145V5 with frequency range of 5Ghz, mode 802.11a/n/ac, channel width of auto 20/40/80 mhz, route WAN mode, and four ports Gigabit 10/100/1000 Mb/s. To record the scenes, we used an iPhone SE-2 with slow motion video at 240fps.
We were unable to change some of the Unity's algorithms in our work, so measuring latency through exchanging messages from the client and server became an issue. To address that we used the approach of recording the movement of the mobile device. As the physical controller and the virtual screen are on the same device, we record the mobile device's movements using a 240 frames per second slow-motion camera. Thus, the measured latency has a precision of 4ms approximately (inter-frame interval).
The experiments were recorded with the mobile device placed on a table for superior stability. The movement of the device with the framework was only in the frontal direction and with similar speeds. Thus, there is no significant variation in pixels between recordings of the same scene.

Scenes
To perform our tests, we build three main scenes: a scene with a plane and a cube; a simple Cornell box; and a more complex scene that explores more realistic features (Fig. 3). We also build an additional scene composed of a full-screen grid (Fig. 4). We use the same ray-tracing parameters and VR shaders for all scenes.
We can see in Fig. 3 small differences when comparing images with and without ray tracing. On the shaded region of the cube in the first images ( Fig. 3a and b), we can see a shade of red accentuated due to the light that reflects from the surface of the shaft in the ray tracing version (Fig. 3b). In the images of the Cornel Box ( Fig. 3c and d) we can see the shades in green in the front cube and red in the back cube in Fig. 3d. In the most complex scene (Fig. 3e and f) we can see a difference in the shading of the sofa and the painting on the wall in the ray tracing version (Fig. 3b).
As shown in Fig. 3b, the first scene has only a white cube centered on a red plane with a directional light. This scene was created to be the simplest and lightest test scene.
A Cornell box was created to be used as an intermediate scene ( Fig. 3c and d). It consists of a box with the left and right sides colored green and red, respectively. The Cornell box has two objects on the floor to reflect the wall colors and a light source positioned in the upper side of the box.
A grid scene was created to allow measuring with greater precision when counting the frames of the recordings (Fig. 4). Thus, the exact moment when the screen responds to the movement becomes more noticeable. Since the Grid is just an object, it is not influenced by the colors around it, so the difference between having ray tracing and not having it is almost imperceptible to the eyes.
The most complex scene, depicted in Fig. 3f, can be viewed in more details in Fig. 5. It was built with more advanced ray tracing capabilities. In Fig. 5c, we highlighted hard shadows under the chair and table produced by the lamp. Refractions and shadows are highlighted in Fig. 5d. In Fig. 5e, we show a mirror reflecting parts of the scene, resulting in  Fig. 5f, which has hard shadows, soft shadows, and reflections on the glass wall.

Evaluation
This section presents a quantitative analysis of a U-DiVE prototype running on the experimental setup. To extract this analysis we recorded a testbed where a smartphone is used to visualize and interact with the four scenes described in the previous subsection. The recording was carried out with a 240fps video camera and the captured video was used to measure the latency between the movement of the mobile device and the expected change in the scene's point of view being displayed by the same device. The movement actions were done by the lab tester using his own hands, directly over the smartphone, so the captured video could keep tracking of both the movement and its resulting VR visualization on the same device. This would be hard to achieve using a smartphone in a head-mount apparatus.
Since the wireless network may suffer some kind of instability, five experiments were recorded for each scene, and then we computed the average latency. Each experiment has the same movement direction and speed of the smartphone so that the results are similar between the recordings.
After the test is recorded, an end-to-end latency analysis is made from the moment the smartphone is moved until the player presents the first expected changes to the movement. The number of recorded frames during this interval is noted. To calculate the latency value, we take the number of frames noted and multiply by 4.1ms, which is the interval between frames in a 240fps video. For each experiment, we perform this process, then we remove the highest and lowest values and calculate the average of the remaining values. One may notice that a 4.1ms precision for a clock may not be suitable for a evaluation protocol, however the results show that the measured latency values are two orders of magnitude higher than that clock precision. Table 2 shows the results of the average end-to-end latency. The same scenes and positions were recorded with and without ray tracing. We can observe that in each recording (R1 to R5), a different scene obtained higher latency. At first it was the cube scene with ray tracing, in the second it was the complex scene with ray tracing, in the third and fourth it was the complex scene but without ray tracing, and in the fifth it was the grid scene with ray tracing. This shows that the latency is not entirely dependent on the server hardware for VR processing, but in fact on the whole end-to-end latency, which is the time it takes to execute the whole pipeline.
If we look at the average latency of each scene, we can observe that there are no significant changes between the scene with active and non-active ray tracing, Fig. 6. With the latency values of all recordings, we can see that the difference between the lowest latency, 147.6ms, and the highest latency, 225.5ms, is 77.9ms. This shows that latency can vary widely between different sessions due to the use of wireless home equipment, a low-cost smartphone and an noisy radio communication spectrum. Even so, the results evidentiate an average latency low enough to corroborate the feasibility of U-DiVE approach with its first prototype.   Table 2. Each labeled column corresponds to a recording session of each scene and each row represents the value obtained in these recordings

Conclusion
This paper presents the U-DiVE framework, aiming to allow low-cost devices to present complex scenes with a high degree of realism. We also presented a quantitative evaluation of the framework to assess its latency. Since no other framework for comparison was found, we create four scenes to evaluate performance. We verified that the latency of the scenes with and without ray tracing has a small difference, showing that even a complex scene does not interfere in the transfer of data between client and server. Although ray tracing is a heavy algorithm, the evaluation of the data showed that using the algorithm has no relevant impact and latency does not seem to be linked to its usage.
It is important to note that this paper focuses on the use of ray tracing techniques for low-cost devices. Recently, the industry has been investing heavily in initiatives capable of processing ray tracing algorithms locally. Samsung's Exynos 2200 processor coupled with AMD's RDNA 2 architecture is one of the latest promises in this direction, but it will be available only on premium devices. ARM's new GPU, Immortalis-G715, is another initiative with the same promise and caveat. We believe that in the near future there will be room for the use of ray tracing both locally, on premium devices, and with solutions such as the one proposed in this paper, taking advantage of the expansion of fast mobile networks like 5G.
In future works, we intend to modify the framework shaders to a vertex displacement based solution that eliminates the need to render a middle texture and a qualitative test with multiple users to evaluate the user experience. We are also planning to extend U-DiVE with edge computing techniques, allowing VR applications to accompany the user movement throughout mobile networks, wherever he/she goes. We can also investigate better video encoding and streaming techniques to avoid quality loss caused by data compression and real-time delivery.
Author Contributions All authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.
Funding This research received no external funding.
Data Availability Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.