ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation

Toyota Research Institute
SIGGRAPH 2024

Abstract


The common trade-offs of state-of-the-art methods for multi-shape representation (a single model "packing" multiple objects) involve trading modeling accuracy against memory and storage. We show how to encode multiple shapes represented as continuous neural fields with a higher degree of precision than previously possible and with low memory usage. Key to our approach is a recursive hierarchical formulation that exploits object self-similarity, leading to a highly compressed and efficient shape latent space. Thanks to the recursive formulation, our method supports spatial and global-to-local latent feature fusion without needing to initialize and maintain auxiliary data structures, while still allowing for continuous field queries to enable applications such as raytracing. In experiments on a set of diverse datasets, we provide compelling qualitative results and demonstrate state-of-the-art multi-scene reconstruction and compression results with a single network per dataset.

Method


Given a single input feature corresponding to LoD 0, ReFiNe recursively expands an octree to the desired LoD using the latent subdivision network 𝜙. Unoccupied voxels at each LoD are pruned based on the output of 𝜔. To obtain a feature value at a specific spatial coordinate, we perform tri-linear interpolation within each individual LoD, then aggregate the features via multi-scale feature fusion. Finally, we use 𝜉 and 𝜓 to decode color and geometry respectively for the desired coordinate. Given the ability to query coordinates within the scene bounds, various methods including differentiable rendering can be applied for reconstruction. Importantly, ReFiNe optimizes a single LoD 0 feature per 3D asset in the training dataset, enabling multiple assets to be reconstructed from a single trained ReFiNe network.


Representing Fields


ReFiNe supports various output 3D geometry and color representations (e.g., SDF, SDF+Color, and NeRF) and its output can be rendered either with sphere raytracing or iso-surface projection (SDF), or volumetric rendering (NeRF). When compared to other multi-scene baselines, ReFiNe demonstrates both better reconstruction quality and a lower storage footprint.

Citation


@inproceedings{refine,
                    title={ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation},
                    author={Sergey Zakharov, Katherine Liu, Adrien Gaidon, Rares Ambrus},
                    journal={SIGGRAPH},
                    year={2024}
                }