Expanding the Viewpoint of Dynamic Scenes beyond Constrained Camera Motions

Shaotong Zhu, Le Jiang and ACLab Associates

📄 Paper 💻 Code

Abstract

In the domain of dynamic Neural Radiance Fields (NeRF) for novel view synthesis, current state-of-the-art (SOTA) techniques struggle when the camera's pose deviates significantly from the primary viewpoint, resulting in unstable and unrealistic outcomes. This paper introduces Expanded Dynamic NeRF (ExpanDyNeRF), a monocular NeRF method that integrates a Gaussian splatting prior to tackle novel view synthesis with large-angle rotations. ExpanDyNeRF employs a pseudo ground truth technique to optimize density and color features, which enables the generation of realistic scene reconstructions from challenging viewpoints. Additionally, we present the Synthetic Dynamic Multiview (SynDM) dataset, the first GTA V-based dynamic multiview dataset designed specifically for evaluating robust dynamic reconstruction from significantly shifted views. We evaluate our method quantitatively and qualitatively on both the SynDM dataset and the widely recognized NVIDIA dataset, comparing it against other SOTA methods for dynamic scene reconstruction. Our evaluation results demonstrate that our method achieves superior performance.

Challenge and Motivation

Challenge: Inherent view constraints for monocular video cause the uncertainty of the novel view prediction, which limits the performance of 3D reconstruction models under novel view.
Study Goal: Given a casually captured monocular video, ExpanDyNeRF is able to learn a dynamic NeRF model for novel-view synthesis.

Pipeline

Foreground-Background Decomposition
In video sequences, backgrounds are largely static while foregrounds are dynamic. Thus, the model decomposes the scene into two parts:

Static Background NeRF ( Φ_b ): Trained using all frames to model static background content.
Dynamic Foreground NeRF ( Φ_f ): For each time window, a separate foreground NeRF models dynamic changes.

Rendered outputs from both branches are blended to reconstruct the dynamic scene, supervised by the super-resolution loss ( L_sr ).

Novel View Feature Optimization

3D Prior Generation: A Gaussian Splatting Prior creates a 3D model of dynamic foregrounds.
Pseudo Ground Truth Construction: Synthetic views from novel viewpoints ( P_nv ) are generated.
Optimization: Dynamic NeRF are supervised using novel view density loss ( L_nv^σ ) and color loss ( L_nv^c ).

Pipeline Diagram

SynDM Dataset

Motivation
Existing dynamic video datasets lack ground-truth for side views, making it impossible to quantitatively evaluate novel view synthesis results at deviated angles. This limitation arises because recording dynamic multi-view videos in real-world settings is extremely difficult or nearly infeasible.

SynDM fills this gap by providing dynamic multi-view videos with side-view ground-truth, enabling systematic evaluation of novel view rendering performance.

Dataset Overview

Source: Based on the GTA-V engine, offering realistic visuals and flexible multi-view camera control.
Content: 9 videos across 3 categories (Human, Animal, Vehicle).
Viewpoint Setup: 22 cameras (19 horizontally spaced every 5° from −45° to +45°, plus 3 elevated views).

Qualitative and Quantitative Results

We conduct a comprehensive comparison between our ExpanDyNeRF and four SOTA novel view synthesis methods: RoDynRF (Liu et al., 2023), MonoNeRF (Fu et al., 2022), D3DGS (Yang et al., 2024), and D4NeRF (Zhang et al., 2023a), on SynDM and NVIDIA datasets. Qualitative results are shown in the video below with novel view deviated from -30 degree to 30 degree, and Quantitative results are shown in Table 1 via FID score, PSNR, and LPIPS. Our method achieves the best performance on both datasets.

Table 1: Quantitative comparison results on SynDM dataset

Challenges in ExpanDyNeRF and Improvements via ExpanDyGauss

Limitations of NeRF-based Methods

High Computation: Modeling dynamic scenes with NeRF is slow and memory-intensive.
Weak Background Supervision: Static backgrounds lack pseudo ground truth, and adding generative supervision (e.g., diffusion) is impractical due to NeRF's heavy computational load.
Poor High-Resolution Scaling: NeRF struggles with 1080p+ outputs, requiring complex optimizations.

Advantages of Gaussian Splatting

Lightweight and Fast: Enables efficient training and scalable 360° reconstructions.
High-Resolution Friendly: Easily supports full-HD and higher resolutions.

Expanded Dynamic Gaussian Splatting (ExpanDyGauss)
To address these issues, we propose ExpanDyGauss, a monocular Gaussian splatting framework for large-angle novel view synthesis. ExpanDyGauss leverages a video-to-video diffusion model to perform spatial-temporal inpainting, generating consistent pseudo ground truth across 360°, providing effective supervision for both static and dynamic components without significant overhead.

Overall Pipeline

Dense Initialization and Segmentation
easi3r predicts dense 3D point clouds and camera poses from monocular videos without large camera motion.

SAM segmentation separates frames into foreground and background, forming foreground Gaussians and background Gaussians.

Gaussian Reconstruction and Enhancement

4DGS Reconstruction: Foreground Gaussians are dynamically reconstructed, but side-view details are incomplete.
Pseudo Ground Truth Alignment: FreeSplatter generates object-centered Gaussians, aligned back to the scene via rigid point matching.
Diffusion-based Completion: CogVideo performs spatial-temporal inpainting along orbital trajectories to fill unobserved regions.
Final Refinement: New diffusion-generated frames supervise an extra round of 4DGS training for complete scene recovery.

Gaussian Diagram

Real Scene Results

Demo Results

Our method generates dynamic Gaussian Splatting models for novel view synthesis on both synthetic datasets and real-world captured videos. The results demonstrate the effectiveness of our approach in handling different scenarios.

Real-world Data
To demonstrate the effectiveness of our method in the real world application, we applied our method on a casually captured monocular video. Our approach can generate a dynamic Gaussian Splatting model for reasonable novel view synthesis in real-world scenarios.

Synthesis Data
A demo of the results on SynDM dataset. For a monocular input video with dynamic scene, our method can generate a dynamic Gaussian Splatting model and synthesize novel views.

Application

Robotic Perception and Navigation

Example: Small indoor robots or drones use a single forward-facing camera to reconstruct their environment, plan paths, and avoid obstacles.
Example: Low-cost delivery robots navigate office buildings using only monocular vision for real-time 3D mapping.

Human Motion Capture and Sports Analysis

Example: Fans record the concert and reconstruct an unforgettable 4D concert scene.
Example: Sports broadcasters generate 3D replay effects (e.g., freeze and rotate scenes) from single broadcast cameras.

Application Diagram