NeRF

NeRF

Network structure

Input vectors are shown in green(there is an embedding operation)
intermediate hidden layers are shown in blue
output vectors are shown in red
black arrows indicate layers with ReLU activations
orange arrows indicate layers with no activation
dashed black arrows indicate layers with sigmoid activation
“+” denotes vector concatenation
The positional encoding of the input location (γ(x)) is passed through 8 fully-connected ReLU layers

Input

Input contains images with same resolution and their corresponding pose matrix. In addition, there's also the near bound the far bound of the scene.

The images can be seen as a Tensor with shape [H W 3], whose values fall between 0 to 255. But it will be divided by 255 to make them fall between 0 and 1.

Sometimes pose matrix is ambiguous(intrinsic or extrinsic). Here, pose matrix denotes extrinsic matrix(world to camera matrix).

Each image has a pose matrix like:

[\begin{matrix} x 1 & y 1 & z 1 & t 1 \\ x 2 & y 2 & z 2 & t 2 \\ x 3 & y 3 & z 3 & t 3 \end{matrix}] [\begin{matrix} H \\ W \\ F \end{matrix}]

The column [x y z] in this matrix respectively denotes [right up back] in orthogonal camera coordinates.

The column [t] means the translation vector.

$[H\ W\ F]^{T}$ , H and W denotes the height, weight of the image, indicating the resolution. F is the focal of the camera.

Data preprocess details

We focus mostly on the pose matrix.

1.Column modification

The rotate matrix given in LLFF dataset is in the form of [down right back] in default. It should be fixed to [right up back].(Coordinate doesn't change, just change its representation method, so translation vector doesn't need to modify in this step)

2.Scale by the bound

Review the NDC coordinates:

All axes in NDC is bounded between [-1, 1]. When the camera coordinate is transformed into NDC, we want to assume the near plane is at n=-1. So we should rescale the bounds and the translation vector.

What we do is as following:


bd_factor = 0.75
sc = 1. if bd_factor is None else 1./(bds.min() * bd_factor)
poses[:,:3,3] *= sc # scale the translation vector
bds *= sc

The bound is set to about at n=-1.333 in case the bound wasn't conservative enough. When we hardcode the near to be n=-1, the scene will still be clipped in to NDC even if the actual bound is bigger than n=-1.333.

3.Recenter the pose

We want to normalize the orientation of the scene(that is the identity extrinsic matrix is looking at the scene). So what we need to do is applying the inverse of this average pose(c2w) transformation to the dataset.

In original dataset, exists:

c 2 w \cdot P_{c} = P_{w}

$P_{w}$ like this:

\begin{matrix} A v g (c 2 w)^{- 1} \cdot c 2 w \cdot P_{c} = P_{w}^{'} = A v g (c 2 w)^{- 1} \cdot P_{w} \\ \Rightarrow P_{c} \approx P_{w}^{'} \end{matrix}

What we need to do in code is applying the inverse of this average pose(c2w) transformation to all pose transformation:


poses = np.linalg.inv(avg_c2w) @ poses

The final situation is the world coordinate is fixed to be the average camera coordinate.

Render pipeline

1.Ray generation

In this part, the ray position is still in the camera coordinate(not NDC).

$\cfrac{W}{2\cdot Focal}$ $\cfrac{H}{2\cdot Focal}$ when the z translation is -1.

Each ray corresponds to a certain pixel in the image.


x = 
[[-252., -251., -250., ...,  249.,  250.,  251.],
[-252., -251., -250., ...,  249.,  250.,  251.],
[-252., -251., -250., ...,  249.,  250.,  251.],
...,
[-252., -251., -250., ...,  249.,  250.,  251.],
[-252., -251., -250., ...,  249.,  250.,  251.],
[-252., -251., -250., ...,  249.,  250.,  251.]] / Focal
y = 
[[ 189.,  189.,  189., ...,  189.,  189.,  189.],
[ 188.,  188.,  188., ...,  188.,  188.,  188.],
[ 187.,  187.,  187., ...,  187.,  187.,  187.],
...,
[-186., -186., -186., ..., -186., -186., -186.],
[-187., -187., -187., ..., -187., -187., -187.],
[-188., -188., -188., ..., -188., -188., -188.]] / Focal
z = -np.ones_like(x)

The vector mentioned above is still in camera coordinate. The train should use the rays in world coordinate. So we should transform the ray vector from camera coordinate to world coordinate.

c 2 w \cdot P_{c} = P_{w}^{'} = r a y_{d i r t i o n}

Consider that the camera is at (0,0,0) in camera coordinate, so it just need to do the translation to get transformed into world coordinate.

t r a n s l a t i o n = r a y_{o r i g i n}

Finally we get a tensor like [batch, ro+rd, H, W, 3]. Every image has H*W rays' origin and direction vector.

2.Normalized device coordinates

$ray_{origin}$ is at the camera and we need to get the ray intersection with the perspective frustum.


t = (rays_o[...,2] - (-1)) / rays_d[...,2]
rays_o = rays_o + t[...,None] * rays_d

$ray_{origin}$ $n=-\infin$ .

Note that the eye coordinates are defined in the right-handed coordinate system, but NDC uses the left-handed coordinate system. That is, the camera at the origin is looking along -Z axis in eye space, but it is looking along +Z axis in NDC.(OpenGL Projection Matrix)

The rays in viewing frustum being transformed into NDC. In NDC, the near plane is at n=-1 and the far plane is at n=1.

π (o + t d) = o^{'} + t^{'} d^{'}

$t'\rightarrow1$ $t\rightarrow\infin$ .

The whole transform process and its proof can be seen in this link.

3. Network processing

Now we can make a conclusion that: when t'=0, ray is at near plane, when t'=1, ray is at far plane.

We can take samples by setting t' to get the points' position of the rays in NDC.

NeRF simultaneously optimizes two models, one coarser, one finer.

After we feed the sampled points' position and its direction to two networks, we will get two chunks of RGB values.

4. Volume Rendering Introduction

In original paper of NeRF, the modeling and proof of volume rendering is not fully included. This paper gives an partly reasonable introduction.

$l$ $s$ $I(s)$ .

$r$ .

$A=\pi r^{2}$ .

$E$ $s$ $\Delta s$ $\rho(s)$ .

The number of particles in the cylinder is:

E Δ s ρ (s)

$\Delta s \rightarrow 0$ , we assume that there is no overlap between the particles. So the covered area of the particles is simply:

E Δ s ρ (s) A

The covered proportion of the cylinder bottom is:

\frac{E Δ s ρ (s) A}{E} = Δ s ρ (s) A

Light only gets through the uncovered area of the cylinder bottom, the intensity becomes:

I (s + Δ s) = [1 - Δ s ρ (s) A] I (s)

The intensity difference is:

Δ I = I (s + Δ s) - I (s) = - Δ s ρ (s) A I (s)

It can be transformed into a differential equation:

\frac{d I (s)}{d s} = - ρ (s) A I (s)

We define:

\begin{matrix} σ (s) = ρ (s) A \\ x = s \\ y = I (s) \end{matrix}

$\sigma(s)$ can be interpreted as the differential probability of a ray terminating at an infinitesimal particle at location s.

The original equation becomes:

\begin{matrix} \frac{d y}{d x} = - σ (x) y \\ \Rightarrow \frac{1}{y} \cdot d y = - σ (x) \cdot d x \end{matrix}

After we do an integration, we get:

\begin{matrix} l n (y) = \int_{0}^{x} - σ (t) d t + C \\ \Rightarrow y = e^{\int_{0}^{x} - σ (t) d t + C} \\ \Rightarrow y = e^{C} \cdot e^{\int_{0}^{x} - σ (t) d t} \end{matrix}

$y=I(0)$ $e^{C}=I(0)$ .

We get:

\begin{matrix} y = I (0) \cdot e^{\int_{0}^{x} - σ (t) d t} \\ \Rightarrow I (s) = I (0) \cdot e^{\int_{0}^{s} - σ (t) d t} \end{matrix}

We define:

\begin{matrix} T (s) = e^{\int_{0}^{s} - σ (t) d t} \\ \Rightarrow I (s) = I (0) \cdot T (s) \end{matrix}

$T(s)$ can be seen as the accumulated transmittance.The probability that the ray travels from tn to t without hitting any other particle.

$S$ :

F (s) = 1 - T (s), s \in [0, \infty]

$F(s)$ $S\leq s$ $F(s)$ is a CDF(cumulative distribution function).

$s$ $c(s)$ $C(r)$ $r$ .

\begin{matrix} C (r) = \int_{0}^{\infty} F^{'} (s) c (s) d s = \int_{0}^{\infty} [- T^{'} (s)] c (s) d s \\ = \int_{0}^{\infty} - [e^{\int_{0}^{s} - σ (t) d t}]^{'} c (s) d s \end{matrix}

$\int_{0}^{s}-\sigma(t)dt = f(s)$ . We get:

\begin{matrix} [e^{\int_{0}^{s} - σ (t) d t}]^{'} = [e^{f (s)}]^{'} = e^{f (s)} \cdot f^{'} (s) \\ = e^{f (s)} \cdot [- σ (s)] \\ = - T (s) \cdot σ (s) \end{matrix}

$C(r)$ becomes:

C (r) = \int_{0}^{\infty} T (s) \cdot σ (s) \cdot c (s) d s

which is the first equation in NeRF, where

\begin{matrix} T (s) = e^{\int_{0}^{s} - σ (t) d t} \end{matrix}

$\sigma(s)$ $c(s)$ is the density and color predicted by the NeRF.

5. Numerical calculation

In computer, we should construct a numerical integration.

$t_{1},t_{2},...,t_{N}$ $\delta_{i}=t_{i+1}-t_{i}$ .

T (t_{i}) = e^{- \sum_{j = 1}^{i - 1} σ (t_{j}) δ_{j}}

$T(s)$ $T(s)$ is a discrete function.

\begin{matrix} C (r) = \int_{0}^{\infty} F^{'} (s) c (s) d s = \int_{0}^{\infty} [- T^{'} (s)] c (s) d s \\ = \int_{0}^{\infty} - [e^{\int_{0}^{s} - σ (t) d t}]^{'} c (s) d s \end{matrix}

$C(r)$ should be modified into discrete difference.

\begin{matrix} C (r) = \sum_{i}^{N} [F (t_{i + 1}) - F (t_{i})] c (t_{i}) \\ = \sum_{i}^{N} [(1 - T (t_{i + 1})) - (1 - T (t_{i}))] c (t_{i}) \\ = \sum_{i}^{N} [T (t_{i}) - T (t_{i + 1})] c (t_{i}) \\ = \sum_{i}^{N} [e^{- \sum_{j = 1}^{i - 1} σ (t_{j}) δ_{j}} - e^{- \sum_{j = 1}^{i} σ (t_{j}) δ_{j}}] c (t_{i}) \\ = \sum_{i}^{N} T (t_{i}) (1 - e^{- σ (t_{i}) δ_{i}}) c (t_{i}) \end{matrix}

which is the third equation in NeRF.

6. Overall pipeline

We generate the ray in camera coordinate, and transform it to world coordinate. Each ray correspond to a certain pixel in the image. Rays' origin and direction is the batch, certain pixel RGB value is the target.

Then comes the rendering.

We transform the rays' origin and direction in perspective frustum to NDC. And sample several points from each ray. ([N_ray, N_sample, 3])

$\sigma(t_{i})$ $c(t_{i})$ .([N_ray, N_sample, 4])

$\sigma(t_{i})$ $c(t_{i})$ , we finally know the estimated RGB of each ray.

Optimization

We use N_samples number of points to train coarser network, use N_samples+N_important number of points to train fine network.

Two networks will output two images, whose losses are added together to optimize these two networks.

The result takes solely from the finer network, but coarser network also need to be trained.

$T(t_{i})(1-e^{-\sigma(t_{i})\delta_{i}})$ in discrete volume rendering is called weight here.

After volume rendering of the coarser image finishes, another bunch of important points will be sampled based on the weight CDF predicted by the coarser network. This is where the N_important number of sampled points come from.

In other words, coarser network is trained to help the stratified/hierarchical sampling for the finer network.

Misc

1.Pixel depth and disparity

In volume rendering process, considering the samples from one single ray, the point must have a high density if its weight takes a big value. So we think the weight of sampled point contribute to the evaluation of a pixel depth.


depth_map = torch.sum(weights * z_vals, -1)

Every sampled point on one single ray has a weighted contribution to the pixel depth.

Disparity just takes its scaled reciprocal.


disp_map = 1./torch.max(1e-10 * torch.ones_like(depth_map), depth_map / torch.sum(weights, -1))

2.White background

Render a scene with a white background just need a little modification in rgb_map.


# the ray with low accumulative weights means that
# there is no high opacity.
# so rgb value of that ray will be bigger than or equal to 1, means white
rgb_map = rgb_map + (1.-acc_map[...,None])

3.Render pose generation

$camera\rightarrow (0,0,-focal)$ is always the -Z axis in camera coordinate.


c = np.dot(c2w[:3,:4], np.array([np.cos(theta), -np.sin(theta), -np.sin(theta*zrate), 1.]) * rads)

$(cos\theta,-sin(\theta),-sin(0.5\theta))$ to set the camera position in average camera coordinate.

The path can be seen in this url.