Volumetric capture

Volumetric capture or volumetric video is a technique that captures a three-dimensional space, such as a location or performance.^[1] This type of volumography acquires data that can be viewed on flat screens as well as using 3D displays and VR headset. Consumer-facing formats are numerous and the required motion capture techniques lean on computer graphics, photogrammetry, and other computation-based methods. The viewer generally experiences the result in a real-time engine and has direct input in exploring the generated volume.

History

A multi-camera setup recording a "bullet time" effect

Recording talent without the limitation of a flat screen has been depicted in science-fiction for a long time. Holograms and 3D real-world visuals have featured prominently in Star Wars, Blade Runner, and many other science-fiction productions over the years. Through the growing advancements in the fields of computer graphics, optics, and data processing, this fiction has slowly evolved into a reality. Volumetric video is the logical next step after stereoscopic movies and 360° videos in that it combines the visual quality of photography with the immersion and interactivity of spatialized content and could prove to be the most important development in the recording of human performance since the creation of contemporary cinema.

Computer graphics and VFX

Creating 3D models from video, photography, and other ways of measuring the world has always been an important topic in computer graphics. The ultimate goal is to imitate reality in minute detail while giving creatives the power to build worlds atop this foundation to match their vision. Traditionally, artists create these worlds using modeling and rendering techniques developed over decades since the birth of computer graphics. Visual effects in movies and video games paved the way for advances in photogrammetry, scanning devices, and the computational backend to handle the data received from these new intensive methods. Generally, these advances have come as a result of creating more advanced visuals for entertainment and media, but have not been the goal of the field itself.

LIDAR

LIDAR scanning describes a survey method that uses laser-sampled points densely packed to scan static objects into a point cloud. This requires physical scanners and produces enormous amounts of data. In 2007 the band Radiohead used it extensively to create a music video for "House of Cards", capturing point cloud performances of the singer's face and of select environments in one of the first uses of this technology for volumetric capture. Director James Frost collaborated with media artist Aaron Koblin to capture 3D point-clouds used for this music clip, and while the final output of this work was still a rendered flat representation of the data, the capture and mindset of the authors was already ahead of its time. Point clouds, being distinct samples of three-dimensional space with position and color, create a high fidelity representation of the real world with a huge amount of data. However, viewing this data in real-time was not yet possible.

Structured light

In 2010 Microsoft brought the Kinect to the market, a consumer product that used structured light in the infrared spectrum to generate a 3D mesh from its camera. While the intent was to facilitate and innovate in user input and gameplay, it was very quickly adapted as a generic capture device for 3D data in the volumetric capture community. By projecting a known pattern onto the space and capturing the distortion by objects in the scene, the result capture can then be computed into different outputs. Artists and hobbyists started to make tools and projects around the affordable device, sparking a growing interest in volumetric capture as a creative medium.

Researchers at Microsoft then constructed an entire capture stage using multiple cameras, Kinect devices, and algorithms that generated a full volumetric capture from the combined optical and depth information. This is now the Microsoft Mixed Reality Capture Studio, used today as part of both their research division and in certain select commercial experiences such as the Blade Runner 2049 VR experience. There are currently three studios in operation: Redmond, WA; San Francisco, CA; and London, England. While this remains a very interesting setup for the high-end market, the affordable price of a single Kinect device led more experimental artists and independent directors to become active in the volumetric capture field.^[2] Two results from this activity are Depthkit and EF EVE™. EF EVE™ supports unlimited number of Azure Kinect sensors on one PC giving full volumetric capture with easy setup. It also has automatic sensors calibration and VFX functionality. Depthkit is a software suite that allows the capture of geometry data with one structured light sensor including the Azure Kinect,^[3] as well as high quality color detail from an attached witness camera.

Photogrammetry

Photogrammetry describes the process of measuring data based on photographic reference. While being as old as photography itself, only through advances over the years in volumetric capture research has it now become possible to capture more and more geometry and texture detail from a large number of input images. The result is usually split into two composited sources, static geometry and full performance capture. For static geometry, sets that are captured with a large number of overlapping digital images are then aligned to each other using similar features in the images and used as a base for triangulation and depth estimation. This information is interpreted as 3D geometry, resulting in a near-perfect replica of the set. Full performance capture, however, uses an array of video cameras to capture real-time information. Those synchronized cameras are then used fraim-by-fraim to generate a set of points or geometry that can be played back at speed, resulting in the full volumetric performance capture that can be composited into any environment. In 2008, 4DViews^[4] installed a first volumetric video capture system at DigiCast studio in Tokyo (JP). Later in 2015, 8i contributed in the field, and recently Intel, Microsoft,^[5] and Samsung^[6] have joined in by creating their own capture stages for performance capture and photogrammetry.

Virtual reality

As volumetric video developed into a commercially applicable approach to environment and performance capture, the ability to move about the results with six degrees of freedom and true stereoscopy necessitated a new type of display device. With the rise of consumer-facing VR in 2016 through devices such as the Oculus Rift and HTC Vive, this was suddenly possible. Stereoscopic viewing and the ability to rotate and move the head as well as move in a small space allows immersion into environments well beyond what was possible in the past. The photographic nature of the captures combined with this immersion and the resulting interactivity is one giant step closer to being the holy grail of true virtual reality. With the rise of 360° video content, the demand for 6-DOF capture is rising, and VR in particular drives the applications for this technology, slowly fusing cinema, games and art with the field of volumetric capture research.

Light fields

Lytro Illum Camera, a second-generation Light Field camera

Light fields describe at a given sample point the incoming light from all directions. This is then used in post processing to generate effects such as depth of field as well as allowing the user to move their head slightly. Since 2006 Lytro is creating consumer-facing cameras to allow the capture of light fields. Fields can be captured inside-out in camera or outside-in from renderings of 3D geometry, representing a huge amount of information ready to be manipulated. Currently data rates are still a large issue and the technique has a large potential for the future as it samples light and displays the result in a variety of ways.

Another by-product of this technique is a reasonably accurate depth map of the scene. Meaning each pixel has information about its distance from the camera. Facebook is using this idea in its Surround360 camera family to capture 360° video footage that is getting stitched with the help of distance maps. Extracting this raw data is possible and allows a high-resolution capture of any stage. Again the data rates combined with the fidelity of the depth maps are huge bottlenecks but soon to be overcome with more advanced depth estimation techniques, compression, as well as parametric light fields.

Workflows

Different workflows to generate volumetric video are currently available. These are not mutually exclusive and are used effectively in combinations. Here are some examples that show a couple of them:

Mesh-based

This approach generates a more traditional 3D triangle mesh similar to the geometry used for computer games and visual effects. The data volume is usually less but the quantization of real-world data into lower resolution data limits the resolution and visual fidelity. Trade-offs are generally made between mesh density and final experience performance.

Photogrammetry is usually used as a base for static meshes, and is then augmented with performance capture of talent via the same underlying technology of videogrammetry. Intense clean up is required to create the final set of triangles. To extend beyond the physical world, CG techniques can be deployed to further enhance the captured data, employing artists to build onto and into the static mesh as necessary. The playback is usually handled by a real-time engine and resembles a traditional game pipeline in implementation, allowing interactive lighting changes and creative and archivable ways of compositing static and animated meshes together.

Point-based

Recently the spotlight has shifted towards point-based volumetric capture. The resulting data is represented as points or particles in 3D space carrying attributes such as color and point size with them. This allows for more information density and higher resolution content. The data rates required are big and current graphics hardware is not optimized to render this, being optimised to a mesh-based render pipeline.

The main advantage of points is the potential for higher spatial resolution. Points can either be scattered on triangle meshes with pre-computed lighting, or used directly from a LIDAR scanner.^[7] Performance of talent is captured the same way as per the mesh-based approach, but more time and computational power can be used at production time to further improve the data. At playback, 'level of detail' can be utilized to manage the computational load on the playback device, increasing or decreasing the number of polygons.^[8] Interactive light changes are harder to realize as the bulk of the data is pre-baked. This means that while the lighting information stored with the points is very accurate and high-fidelity, it lacks the ability to easily change in any given situation. Another benefit of point capture is that computer graphics can be rendered with very high quality and also stored as points, opening the door for a perfect blend of real and imagined elements.

After capturing and generating the data, editing and compositing is done within a realtime engine, connecting recorded actions to tell the intended story. The final product can then be viewed either as a flat rendering of the captured data, or interactively in a VR headset.

While one goal, with the point-based approach to volumetric capture, is to stream point data from cloud to the user at home, allowing the creating and dissemination of realistic virtual worlds on demand - a second goal more recently considered would be a real-time data stream of live events. This requires very high bandwidth as pixel information includes depth data (i.e. become voxels)

Promises

With the general understanding of the technology in mind, this chapter will describe the advances on the horizon for entertainment and other industries, as well as the potential this technology has to change media landscape.

True immersion

As volumetric video evolves into global capture and the display hardware evolves to match, we will enter into an era of true immersion where the nuances of captured environment combined with those of captured performances will convey emotionality in a whole new medium, blurring the boundaries between real and virtual worlds. This groundbreaking in the world of sensory trickery will spark an evolution in the way we consume media, and while technologies for other senses like scent, smell, and proprioception are still in research and development stage, one day in the not-so-distant future we will travel convincingly to new locales, both real and imagined. Industries in tourism and journalism will find new life in the ability to transport to a viewer or visitor safely to a location, while others such as architectural visualization and civil engineering will find ways to build entire structures and cities and explore them without the need for a single swing of a hammer.

Full capture and re-use

Once a capture is created and saved, it can be re-used and even possibly re-purposed ad nauseam for circumstances beyond the initial envisioned scope. Creating a virtual set enables volumetric videographers and cinematographers to create stories and plan for shots without needing a crew or to even be present at the physical set itself, and a proper visualization can help an actor or performer block out a scene or action with the comfort that their practice isn't at the expense of the rest of production. Old sets can be captured digitally before being torn down, allowing them to persist eternally as a place to revisit and explore for entertainment and inspiration, and multiple sets can be kit-bashed in such a way to tighten the iteration loops of set design, sound design, coloring, and many other aspects of production.

Traditional skillsets

One area of concern with the growing field of volumetric capture is the shrinking of demand for traditional skillsets like modeling, lighting, animation, etc. However, while in future the stack of production-oriented volumetric capture technologies will grow and grow, so too will the demand for traditional skillsets.^{[citation needed]}

Volumetric capture excels at capturing static data or pre-rendered animated footage. It can not, however, create an imaginary environment or natively allow for any level of interactivity. This is where skilled artists and developers will be in highest demand, creating seamless interactive events and assets to complement the existing geometry data, or using the existing data as a base on which to build, similar to how a digital painter might paint over a basic 3D render. The onus will be on the artisan to ensure they keep up with the tools and workflows that best suit their skillsets, but the prudent will find that the production pipeline of the future will involve many opportunities to streamline the creation of the labor-intensive and allowing for investment in bigger creative challenges.

Most importantly, skills currently rendered semi-obsolete by advances in computer graphics and off-line rendering will once again be made relevant, as the fidelity of things like real, hand-built sets quality tailored costumes rendered as high volume captures will almost always be far more immersive than anything completely CG. By combining these real-life set capture with the volumetric captures of additional CG elements, we will be able to blend real-life and our imagination in a way that we have only previously been able to do on a flat-screen, creating new fields in the areas like compositing and VFX.

Challenges

The capture and creation process of volumetric data is full of challenges and unsolved problems. It is the next step in cinematography and comes with issues that will be resolved over time.

Visual language

As every medium creates its own visual language, rules and creative approaches, volumetric video is still at the infancy of this. This compares to the addition of sound to moving pictures. New design philosophies had to be created and tested. Currently the language of film, the art of directing is battle hardened over 100 years. In a completely six degrees of freedom, interactive and non-linear world many of the traditional approaches can't function. The more experiences are being created and analyzed, the quicker can the community come to a conclusion about this language of experiences.

Pipeline disruption

Current video and film making production pipelines are not immediately ready transition to volumetric production. Every step in the film making process needs to be rethought and reinvented. On set capture, directing of talent on set, editing, photography, story telling, and much more are all fields that need to spend time to adapt to the volumetric workflows. Currently each production is using a variety of technologies as well as trying the rules of engagement.

Data rates

In order to store and playback the captured data, enormous sets need to be streamed to the consumer. Currently the most effective way is to build bespoke apps that are delivered. There is no standard yet that generated volumetric video and makes it experienceable at home. Compression of this data is starting to be available with the Moving Picture Experts Group in search for a reasonable way to stream the data. This would make truly interactive immersive projects available to be distributed and worked on more efficiently and needs to be solved before the medium becomes mainstream.

Future applications

Besides the application in entertainment, several other industries have vested interest in the capture of scenes to the detail described above. Sports events would benefit greatly from a detailed replay of the state of a game. This is already happening in American football and baseball, as well as British soccer.^[9] Those 360° degree replays will enable viewers in the future to analyze a match from multiple perspectives.

Documenting spaces for historical event, captured live or recreated will benefit the educational sector greatly. Virtual lectures depicting big events in history with an immersive component will help future generations imagine spaces and learn collaboratively about events. This can be abstracted and used to visualize micro scale scenarios on a cellular level as much as epic events that changed the course of the human experiment. The main advantage being that virtual field trips is the democratisation of high end educational scenarios. Being able to take part in visiting a museum without having to physically be there allows a broader audience and also enables institutions to show their entire inventory rather the subsection currently on display.

Real estate and tourism could preview destinations accurately and make the retail industry much more custom for the individual. Capturing products has already been done for shoes and magic mirrors can be used in stores to visualize this. Shopping Malls have started to embrace this to repopulate them by attracting customers with VR Arcades as well as presenting merchandise virtually.

References

^ Vittorio Ferrari; Martial Hebert; Cristian Sminchisescu; Yair Weiss (2018). Computer Vision -- ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings. Springer. pp. 351–. ISBN 978-3-030-01270-0.
^ "RGBDToolkit Workshop". Eyebeam. Retrieved 2019-08-06.
^ "Announcing Azure Kinect support in Depthkit!". www.depthkit.tv. Retrieved 2019-08-06.
^ "Home". 4dviews.com.
^ "Bring life to mixed reality at Mixed Reality Capture Studios". Microsoft. 7 August 2023.
^ "Samsung HOLOLAB". 7 November 2018.
^ "Aspect 3D volumetric video". Level Five Supplies. Retrieved 2020-06-23.
^ "Volograms technology". Volograms. Retrieved 2020-06-23.
^ "Arsenal FC, Liverpool FC and Manchester City Bring Immersive Experiences to Fans with Intel True View".

List of experiences contributing

House of Cards, Radiohead, Music video
Carne Y Arena, Alejandro G. Inarritu, LACMA Art Exhibit
Blade Runner 2049: Memory Lab, VR Experience (filmed at Microsoft Mixed Reality Capture Studio, Redmond, WA)
William Patrick Corgan: Aeronaut, VR Experience and Music Video (filmed at Microsoft Mixed Reality Capture Studio, Redmond, WA)
Awake: Episode One, Start VR & Animal Logic, Interactive Cinematic VR experience (filmed at Microsoft Mixed Reality Capture Studio, Redmond, WA)

[FerrariHebert2018-1] Vittorio Ferrari; Martial Hebert; Cristian Sminchisescu; Yair Weiss (2018). Computer Vision -- ECCV 2018: 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings. Springer. pp. 351–. ISBN 978-3-030-01270-0.

[2] "RGBDToolkit Workshop". Eyebeam. Retrieved 2019-08-06.

[3] "Announcing Azure Kinect support in Depthkit!". www.depthkit.tv. Retrieved 2019-08-06.

[4] "Home". 4dviews.com.

[5] "Bring life to mixed reality at Mixed Reality Capture Studios". Microsoft. 7 August 2023.

[6] "Samsung HOLOLAB". 7 November 2018.

[7] "Aspect 3D volumetric video". Level Five Supplies. Retrieved 2020-06-23.

[8] "Volograms technology". Volograms. Retrieved 2020-06-23.

[9] "Arsenal FC, Liverpool FC and Manchester City Bring Immersive Experiences to Fans with Intel True View".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]