LatentLog

[ 딥러닝 논문 리뷰 ] - GLASS Flows: Transition Sampling For Alignment (ICLR 2026)

Lee현서 — Sat, 21 Mar 2026 16:34:28 +0900

안녕하세요, 벌써 막학기 학부 생활이 시작되었고 학부 연구생 생활도 열심히 해 나가고 있습니다. 최근에는 Diffusion, Flow model에서의 Inference-time scaling에 관심이 많아져서 관련 논문을 보고 있었습니다. 그러던 중 ICLR 2026에 GLASS Flows라고 하는 논문을 찾게 되었고, Flow Matching의 저자인 Lipman형님과, Ricky Chen형님의 논문이라는걸 알았고 재밌어 보여서 읽었다가, 너무 많은 수식에 혼절 직전까지 왔었습니다. 이에 2일정도를 논문 읽는데에 쏟아부었고 재밌는 논문인거 같아서 리뷰를 해보려고 합니다. 이제 곧 대학원 입시이기도 한데, 파이팅 해야겠습니다!

paper: https://arxiv.org/pdf/2509.25170

github: https://github.com/PeterHolderrieth/glass_flows_tutorial

GitHub - PeterHolderrieth/glass_flows_tutorial

Contribute to PeterHolderrieth/glass_flows_tutorial development by creating an account on GitHub.

github.com

1 Introduction

최근에 많은 연구들이, inference-time에 model을 발전시키는 방향으로 연구가 진행되었습니다. 추가적인 reward를 최적화하면서 finetuning하는 것과 같은 방법들이 있습니다. 이러한 reward alignment 방법은, text-to-image alignment, inverse problem, molecular design과 같은 분야에 적용되었습니다. 하지만 이러한 알고리즘은, 좋은 성능을 내는데에 아주 많은 컴퓨팅 시간을 필요로 했습니다. 그래서 이러한 효율성이 이 분야에서 가장 큰 문제입니다.

diffusion, flow 모델에서의 inference는 보통 다음 2개의 방법으로 이루어집니다.

ODE sampling: flow matching에서 사용되며, diffusion모델에서의 probability flow ODE를 풀때 사용됩니다.
SDE sampling: 기본적인 diffusion모델

경험론적으로, ODE sampling이 보다 효율적으로 결과물을 샘플링할 수 있습니다. 하지만, SDE샘플링을 사용하면, randomness가 reverse time에 적용이 되면서, $$ p_{t' \mid t} (x_{t'} \mid x_t) = \mathbb{P} [X_{t'} = x_{t'} \mid X_t = x_t] $$ 같은 transition probability로 특성화되게 됩니다. 여기에서 $p_{t'|t}(x_{t'}|x_{t})$를 transition kernel이라고 하며, 많은 reward alignment algorithm이 transition kernel에 의존합니다. 이 randomness가 SDE sampling의 가장 큰 장점이자 특징인 것이죠. 하지만 ODE에서 SDE 샘플링으로 넘어가면, ODE 샘플링의 효율성을 잃게됩니다.

그래서 저자들은 $p_{t'|t}$를 ODE로부터 샘플링을 할 수 있는 Gaussian Latent Sufficiernt Statistic (GLASS) Flows를 제안합니다. GLASS Flows는 다음과 같은 특징이 있습니다.

ODE의 높은 효율성을 통합
SDEs의 controllable stochastic evolution을 통합

GLASS Flows의 inner model

이를 위해 저자들은 inner flow matchin model을 통해 $p_{t'|t}$를 샘플링하게 하였는데요, 이 inner model은 pre-trained flow matching모델로부터 쉽게 얻을 수 있게 설계했습니다. 이러한 추가 학습이 필요없는 inner model은 sufficient statistics라는 이론 통계학 개념에 강하게 의존합니다. 그리고 GLASS Flows는 어느 SDE sampling 알고리즘에 plug-in으로 간단히 적용 가능합니다. 저자들은 실험에서 GLASS Flow로 reward alignment로 text-to-image generation에서 SOTA를 낼 수 있었다고 합니다.

2 Background and Motivation

이 섹션에서는 flow matching (FM) framework위주로 background설명을 합니다, 비슷하게 diffusion에서도 적용 가능합니다. 본 논문에서는 $z \in \mathbb{R}^d$로 실제 data distribution $p_{data}$를 나타냅니다. 그리고 flow matching표기와 동일하게 data point는 timestep $t=1$을 뜻하고 noise는 $t=0$을 말합니다. noise data를 표시하기 위해, Gaussian conditional probability path $p_t(x_t|z)$를 다음과 같이 표현합니다.

$$x_t = \alpha_t z + \sigma_t \epsilon,\ \ \ \ \epsilon \sim \mathcal{N}(0,I_d)\ \ \Leftrightarrow\ \ p_t(x_t|z)=\mathcal{N}(x_t;\alpha_t z, \sigma_t^2I_d)$$

$z\sim p_{data}$는 marginal probability path인 $p_t(x_t) = \mathbb{E}_{z\sim p_{data}}[p_t(x_t|z)]$를 만들 수 있습니다. 이는 Gaussian noise인 $p_0 = \mathcal{N}(0, I_d)$와 $p_1 = p_{data}$를 interpolate하는 probability path라고 할 수 있고, GMM이라고도 볼 수 있겠습니다. 결과적으로 FM모델은 marginal vector field를 아래와 같이 학습하게 됩니다.

$$u_t(x_t) = \int u_t(x_t|z)p_{1|t}(z|x_t)dz,\ \ \ p_{1|t}(z|x_t) = \frac{p_t(x_t|z)p_{data}(z)}{p_t(x_t)}$$

$u_t(x_t|z)$는 conditional vector field를 나타냅니다. 이 marginal vector field를 초기 Gaussian noise로부터 ODE를 simulation하게 되면 trajectory는 $p_t$라는 marginal이 됩니다.

$$X_0 \sim p_0,\ \ \ \frac{d}{dt} X_t = u_t(X_t)\ \ \ \Rightarrow \ \ \ X_t \sim p_t$$

diffusion에서는 이러한 ODE sampling 방식을 probability flow ODE라고 합니다. 추가적으로 DDIM논문에 있었던 time-reversal SDE를 통해 데이터를 샘플링 할 수도 있습니다.

$$X_0 \sim \mathcal{N}(0, I_d),\ \ \ dX_t = [u_t(X_t)+\frac{v_t^2}{2}\bigtriangledown\log p_t(X_t)]dt + v_tdW_t,\ \ \ v_t^2=2\frac{\dot{\alpha}_t}{\alpha_t}\sigma_t^2-2\sigma_t\dot{\sigma}_t$$

여기에서, $\bigtriangledown\log p_t(x_t)$는 score function을 뜻하며, 위에 점이 붙은건, scheduler의 time-derivations를 뜻합니다. 이를 우리는 DDPM sampling이라고 하고, 여기서 말하는 score function은 위에서 말한 $u_t$로 reparameterization할 수 있습니다. $u_t$를 훈련한 동일 neural network를 통해서 SDE simulation을 할 수 있다는 것입니다.

3 Motivation: Efficient Transitions for Reward Alignment

Inference-time reward alignment에서 원하는건 $p_{data}$가 아닙니다. post-training을 함으로써, $p_{data}$를 prior distribution으로 함으로써, user-specified objective로 분포를 steer하는 것을 목표로 합니다. 여기에서 reward objective function $r: \mathbb{R}^d \rightarrow \mathbb{R}$을 reward function이라고 합니다. 목표는 아래와 같은 reward-tilted distribution을 만드는 것을 목표로 합니다.

$$p^r(z) = \frac{1}{Z_r}p_{data}(z)exp(r(z))\ \ \ (Z_r > 0)$$

여기에서 likelihood인 $p^r(z)$은 $p_{data}(z)$보다 높으며, $r(z)$가 높다. 다음으로는 대표적인 3가지 reward alignment algorithm을 소개하고, 어떻게 이러한 방법들이 $p_{t'|t}$에 의존하는지 보겠습니다.

3.1 Sequential Monte Carlo (SMC) methods

SMC in FK-steering

$p_{t'|t}$를 proposal distribution으로 활용하는 방법입니다. particle filter를 생각하면 쉬운데, $K$개의 particle $x_t^k$를 다음과 같이 뽑습니다.

$$x_{t'}^k \sim p_{t'|t}(\cdot|x_t^k)\ \ \ (0 \leq < t' \leq 1, k=1,\cdots,K)$$

이 particle은 potential 함수인 $G(x_t, x_{t'})$를 통해서 우리가 원하는 tilted distribution으로 guiding됩니다. FK-steering이 논문에서 제안하는 3가지 potential함수등이 예시가 되겠습니다. particle은 다음과 같이 multinomial분포로 샘플링됩니다.

$$\underbrace{a_{t'}^k \sim \text{Multinomial}(G(x_t^1,x_{t'}^1),...,G(x_t^K,x_{6'}^K))}_{\text{sample indices}},\ \ \ \ \underbrace{x_{t'}^k=x_{t'}^{a_{t'}^k}}_{\text{reassign particles}}\ \ \ (k=1,\cdots,K)$$

이는 결국에 SMC를 순차적으로 unpromising particle을 promising한 것으로 대체하는 것을 말합니다.

3.2 Search methods

DDPM sampling을 search tree branch의 $p_{t'|t}$로부터의 rollout이라고 생각하는 것입니다. 이러한 search based 방법들은, value function을 고려하는데 아래와 같이 정의됩니다.

$$V_t(x_t)=\log \mathbb{E}_{z\sim p_{1|t}(\cdot|x_t)}[\text{exp}(r(z))],\ \ \ \ \text{where}\ p_{1|t}(z|x_t)=p_t(x_t|z_{p_{data}}(z)/p_t(x_t))$$

tree의 각 노드를 평가하기 위해, $V_t$를 사용하며, 이는 flow matching posterior인 $p_{1|t}$에 의존합니다. $p_{1|t}(z|x_t)$와 같은 posterior분포로부터 샘플링하는 것은 오직 SDE를 푸는 것 밖에 답이 없기때문에, 이 연산은 매우 비효율적입니다. 그래서 대부분의 논문에서는 이 $V_t(x_t)$를 근사해서 사용하고, 위의 SMC에서 potentials로도 사용할 수 있습니다.

3.3 Guidance methods

guidance method는 marginal vector field인 $u_t$를 intermediate reward function $r_t$를 사용해서 수정하는 방식입니다.

$$u_t^r(x)=u_t(x)+c_t\bigtriangledown r_t(x)\ \ \ (c_t \geq 0)$$

이상적인 상황에서의 $r_t(x)$는 $V_t(x)$입니다, 이는 말했다 싶이 계산적으로 매우 비효율적이기에 근사합니다 또한, SMC과정을 correcting하는데에 사용되거나, SDE 샘플링에 사용합니다. 구체적으로 SMC에서 proposal에 $\bigtriangledown r_t(x)$항을 추가해서 dirft를 더 상세하게 refine하는 역할을 하게 됩니다. 이는 guided proposal + SMC correction 같은 이름으로 자주 등장하는 개념중에 하나입니다.

3.4 GLASS Flows motivation

본 논문의 핵심 motivation은 위와같은 novel한 reward alignment algorithm을 제안하는 것이 아닙니다. 저자들은, 위에 사용되는 transition을 최적화하는것에 해당합니다. 저자들의 목표는 아래와 같습니다.

goal of GLASS Flows

이제 천천히 이게 무슨 내용인지 알아보겠습니다.

4. GLASS Flows

우리의 목표는 transition kernel로부터 $X_{t'}\sim p_{t'|t}(x_{t'}|x_t)$입니다. 저자들은 transition kernel을 디자인 하기 위해 inner flow matching model인 $u_s(\bar{x}_s|x_t,t)$를 새로운 time variable인 $s (0 \leq s,t \leq 1, \bar{x}_s, x_t \in \mathbb{R}^d)$를 정의합니다. 우리의 목표를 수식적으로 표현하면 아래와 같습니다.

$$\bar{X}_0 \sim p_{init},\ \ \ \ \frac{d}{ds}\bar{X}_s = u_s(\bar{X}_s|x_t,t)\ \ \ \ \Rightarrow \bar{X}_1 \sim p_{t'|t}(\cdot|X_t=x_t)$$

우리는 정확히 $p_{init}$에서 시작하여 $s=1$에서 transition kernel에서 샘플링할 수 있는 결과를 얻어야 합니다. 또한, 저자들은 stochasticity를 initial condition인 $\bar{X}_0$을 샘플링함으로써 가능하게 합니다. 그 이후의 transition은 ODE를 따르므로 deterministic하게 됩니다. 이에 대한 의사 코드는 아래와 같습니다.

GLASS Flows pseudo code

이제 이 의사 코드가 뭐를 의미하는지 자세히 알아보겠습니다.

4.1 GLASS Transitions

먼저 우리는 trnasition family를 정의해야합니다. $X_t, X_{t'}$를 아래와같은 probability path에서 정의한 marginal을 가질 수 있도록 정의합니다.

$$X_t \sim \mathcal{N}(\alpha_tz,\sigma_t^2I_d),\ \ \ X_{t'}\sim\mathcal{N}(\alpha_{t'}z,\sigma_{t'}^2I_d)\ \ \ (z\sim p_{data})$$

또한, data point $z \in \mathbb{R}^d$가 있을때, 각각 차원의 mean, variance는 고정됩니다. 하지만 우리는 이러한 transition family에서 자유도 하나를 가질 수 있게 설정하는데 $X_t, X_{t'}$간의 correlation인 $-1 \leq \rho \leq 1$입니다. 구체적으로 아래와같이 mean scale $\mu$와 covariance인 $\Sigma$를 정의합니다.

$$\mu= \begin{pmatrix} \mu_1\\ \mu_2 \end{pmatrix} = \begin{pmatrix} \alpha_t\\ \alpha_{t'} \end{pmatrix}, \qquad \Sigma= \begin{pmatrix} \Sigma_{11} & \Sigma_{12}\\ \Sigma_{21} & \Sigma_{22} \end{pmatrix} = \begin{pmatrix} \sigma_t^2 & \rho \sigma_t\sigma_{t'}\\ \rho \sigma_t\sigma_{t'} & \sigma_{t'}^2 \end{pmatrix}$$

그리고 우리는 joint distribution인 $X = (X_t, X_{t'})^T$를 다음과 같이 정의합니다.

$$\begin{equation} X \sim p_{t,t'}(X\mid z) = \prod_{j=1}^d \mathcal{N}\!\left( (X_t^{j}, X_{t'}^{j});\, z^{j}\mu,\, \Sigma \right), \qquad z=(z^1,\ldots,z^d)^\top \sim p_{\text{data}}. \tag{12} \end{equation}$$

각각의 좌표는 똑같이 독립적으로 noised됩니다. 또한, correlation을 좌표차원에 대해서 정의할 수 없고, 오직 time에 대해서만 정의합니다. 이를 통해 transition family $p_{t',t}(X_t,X_{t'})$은 baye's theroem에 따라 conditional distribution을 아래와 같이 정의할 수 있게됩니다.

$$\begin{equation} p_{t'\mid t}(X_{t'}\mid X_t) = \frac{p_{t,t'}(X_t, X_{t'})}{p_t(X_t)}. \tag{13} \end{equation}$$

이를 이 논문에서는 GLASS transition이라고 정의합니다. 중요한점은, $\rho = \frac{\alpha_t \sigma_{t'}}{\sigma_t \alpha_{t'}}$로 두면, $p_{t'|t}^{\text{DDPM}}(X_{t'}|X_t)=p_{t'|t}(X_{t'}|X_t)$와 같이 정의할 수 있는데, 이 말은 DDPM transition은 GLASS transntion의 special case라는 점입니다. 여기에서 $\rho$는 $X_t, X_{t'}$간의 similarity를 조절하는 역할을 하게 됩니다.

4.2 Constructing the Velocity Field

$D_t$인 Denoiser를 다음과 같이 expected of posterior로 reparameterize할 수 있습니다.$$\begin{equation} D_t(x) = \int z\, p_{1\mid t}(z\mid x)\,dz = \frac{1}{\dot{\alpha}_t \sigma_t - \alpha_t \dot{\sigma}_t}\Big(\sigma_t\,u_t(x) - \dot{\sigma}_t\,x\Big). \tag{14} \end{equation}$$

이 수식의 두번째 항을 보면 denoiser는 velocity field인 $u_t$를 reparameterizing함으로써 구할 수 있다는것을 볼 수 있습니다. 즉 이걸로 Denoiser의 output을 통해 $u_t$를 구할 수 있는 것입니다. 이제부터는 $D_t$를 이용해서 우리가 원하는 velocity field인 $u(\bar{x}_s|x_t,t)$를 구할 수 있는 방법에 대해 다루겠습니다.

4.2.1 GLASS Denoiser

우리는 GLASS denoiser를 $x_t, x_{t'}$가 주어졌을때 expected posterior로 정의할 수 있습니다.

$$\begin{equation} D_{\mu,\Sigma}(x) = \int z \, p(Z=z \mid X=x)\, dz, \qquad x=(x_t,x_{t'}),\;\; x_t,x_{t'}\in\mathbb{R}^d . \tag{15} \end{equation}$$

noisy $x_t$는 $z$에 대한 measurement입니다. DDPM에서와 같이 정석적인 denoiser $D_t$는 하나의 Gaussian measurement인 $x_t$가 주어졌을때의 평균 $z$였습니다. 하지만, GLASS Denoiser에서는 inner model때문에, 2개의 Gaussian measurements $(x_t, x_{t'})$로부터 평균 $z$를 Denoiser $D_{\mu, \Sigma}$로 부터 구해야합니다. 이를 우리는 GLASS Denoiser라고 합니다. 우리는 또한, single denoiser step만으로 수행하고 싶기 때문에, 이 2개의 measurements $(x_t, x_{t'})$를 효율적이고 합리적으로 요약해야합니다. 이때 사용되는 것이 sufficient statistic이고 아래와 같이, 구할 수 있습니다.

$$S(x)= \frac{\mu^\top \Sigma^{-1} x}{\mu^\top \Sigma^{-1} \mu}, \qquad x=(x_t,x_{t'})^\top \in \mathbb{R}^{2\times d}$$

이론 통계학에서는, $S(x)$를 sufficient static이라고 합니다. 이는 매우 직관적인데, $S(x)$는 계산해보면 $x_t, x_{t'}$의 가중합꼴이 됩니다. 이 가중합의 의미는, measurement가 더 $z$같을 수록(lower variance, higher scale) 더 높은 가중치를 주는 원리입니다. 이제, 어떻게 GLASS Denoiser를 적용할까에 대한 생각을 해봐야 합니다. 먼저, 우리가 정의한 $S(X) | Z$의 확률분포는 다음과 같이 정의되게 됩니다. $$S(X)\mid Z=z \sim \mathcal{N}\!\left(z,\ \frac{1}{\mu^\top\Sigma^{-1}\mu}\right)$$ 그리고 FM denoiser는 measurement가 $X_t = \alpha_t Z + \sigma_t \epsilon$형태일 때를 처리합니다. 그래서 FM denoiser가 익숙한 입력 분포와 똑같이 만들기 위해 먼저 스케일을 $Y := \alpha_{t}S(X)$ 이렇게 맞춰줍니다. 그러면 $Y\mid Z=z \sim \mathcal{N}\!\left(\alpha_{t}z,\ \alpha_{t}^2(\mu^\top\Sigma^{-1}\mu)^{-1}\right)$ 와 같이 정의되는데, 이제 $Y | Z$의 분산이 $\sigma_t^2$가 되게 $t^*$를 고르면 $$\alpha_{t^*}^2(\mu^\top\Sigma^{-1}\mu)^{-1}=\sigma_{t^*}^2 \quad\Longleftrightarrow\quad \frac{\sigma_{t^*}^2}{\alpha_{t^*}^2}=(\mu^\top\Sigma^{-1}\mu)^{-1}$$

이와 같다. 그래서 invertible function $g(t) = \sigma_t^2 / \alpha_t^2$로 두고, $t^*=g^{-1}\!\left((\mu^\top\Sigma^{-1}\mu)^{-1}\right)$ 이렇게 설정하면 기존의 DDPM sampling처럼 GLASS Denoiser를 동작시킬 수 있는 것입니다.

4.2.2 GLASS Velocity Field

이제 GLASS velocity field인 $u_s(\bar{x}_s|x_t,t)$가 GLASS Denoiser의 reparameterization이라는 것을 유도할 것입니다. 먼저 우리는 $p_{t,t'}(x_t,x_{t'}|z)$가 Gaussian이라고 했고, 이의 conditional distribution이 다음과 같이 Gaussian이라고 하였습니다.

$$\begin{equation}
p_{t'\mid t}(x_{t'} \mid x_t, z)
=
\mathcal{N}\!\left(x_{t'};\, \bar{\alpha} z + \bar{\gamma} x_t,\ \bar{\sigma}^2 I_d\right).
\tag{16}
\end{equation}$$

$$\begin{equation}
\bar{\gamma} = \rho \frac{\sigma_{t'}}{\sigma_t},
\qquad
\bar{\alpha} = \alpha_{t'} - \bar{\gamma}\alpha_t,
\qquad
\bar{\sigma}^2 = \sigma_{t'}^2(1-\rho^2).
\tag{17}
\end{equation}$$

이를 바탕으로 marginalizing을 하면 다음과 같은 marginal Gaussian probability path를 얻을 수 있게 됩니다.

$$\begin{equation}
p_s(\bar{x}_s \mid x_t, z)
=
\mathcal{N}\!\left(\bar{x}_s;\, \bar{\alpha}_s z + \bar{\gamma} x_t,\ \bar{\sigma}_s^2 I_d\right),
\qquad
p_s(\bar{x}_s \mid x_t)
=
\int p_s(\bar{x}_s \mid x_t, z)\, p_{1\mid t}(z\mid x_t)\, dz.
\tag{18}
\end{equation}$$

여기에 있는 스케줄러인 $\bar{\alpha}_s, \bar{\sigma_s}$는 $\bar{\alpha}_0 = 0, \bar{\alpha}_1 = \bar{\alpha}, \bar{\sigma}_1 = \bar{\sigma}, \bar{\sigma}_0^2 > 0$. 이 스케줄러 조건이면 marginal probability path가 noise와 GLASS transition을 interpolation하는 경로가 된다고 보면 됩니다. 저자들은, 스케줄러가 $p_s(\bar{x}_s|x_t,z)$가 optimal transport path (CondOT scheduler)가 되도록 다음과 같이 자연스럽게 잡을 수 있다고 합니다, i.e. $\bar{\alpha}_s=s\bar{\alpha}, \bar{\sigma}_s = (1-s)\bar{\sigma}_0 + s\bar{\sigma}$.

GLASS Flows. Theorem 1.

이제 논문의 Teorem1 부분을 볼텐데요, 여기에서는 구체적으로 inner flow matching model $u_s(\bar{x}_s|x_t,t)$ 을 포함하는 어느 flow matching, diffusion도 GLASS transition으로 샘플링이 가능하다는 것을 보여주는데요, 보겠습니다. 먼저 equation 18에서, intermediate probability path를 다음과 같이 $p_s(\bar{x}_s|x_t,z) = \mathcal{N}(\bar{x}_s; \bar{\alpha}_s z + \bar{\gamma}x_t, \bar{\sigma}_s^2 I)$로 정의할 수 있었는데, 다음처럼 샘플링 reparameterization하여 $\bar{X}_s = \bar{\alpha}_s Z + \bar{\gamma}x_t + \bar{\sigma}_s\epsilon,\ \ \ \epsilon \sim \mathcal{N}(0, I)$와같이 표현할 수 있습니다. 이제 위 식을 $s$에 대해 미분하면 다음과 같습니다.

$$\partial_s \bar{X}_s
=
(\partial_s \bar{\alpha}_s)\, Z
+
(\partial_s \bar{\sigma}_s)\, \epsilon$$

그런데 ODE의 이렇게 implicit한 noise $\epsilon$로 표현하는게 아니라, $\bar{X}_s$ 자체의 함수로 표현되어야 하므로, $\epsilon$를 $\bar{X}_s$로 다시 씁니다.

$$\epsilon = \frac{\bar X_s - \bar\alpha_s Z - \bar\gamma x_t}{\bar\sigma_s}$$

이를 미분식에 다시 대입하면

$$\partial_s \bar X_s
= (\partial_s \bar\alpha_s)\, Z
+ (\partial_s \bar\sigma_s)\,
\frac{\bar X_s - \bar\alpha_s Z - \bar\gamma x_t}{\bar\sigma_s}$$

정리하면 최종적으로 다음과 같습니다.

$$\partial_s \bar X_s
=
\left(\frac{\partial_s \bar\sigma_s}{\bar\sigma_s}\right)\bar X_s
+
\left(\partial_s \bar\alpha_s
- \bar\alpha_s \frac{\partial_s \bar\sigma_s}{\bar\sigma_s}\right) Z
+
\left(-\bar\gamma \frac{\partial_s \bar\sigma_s}{\bar\sigma_s}\right) x_t$$

즉,

$$\partial_s \bar X_s
= w_1(s)\bar X_s + w_2(s) Z + w_3(s) x_t$$

와같이, 논문에 equation 21.과 동일한 형태가 됩니다. 하지만, 실제 샘플링에서는 latent $Z$를 직접 알 수 없으므로, 현재 상태에서의 평균적인 velocity를 쓰기 위해 조건부기대값을 취해줍니다.

$$u_s(\bar x_s \mid x_t, t)
:= \mathbb E\left[
\partial_s \bar X_s \mid \bar X_s = \bar x_s, X_t = x_t
\right]$$

위 식에 선형성 공식을 대입하면

$$u_s(\bar x_s \mid x_t, t)
= w_1(s)\bar x_s+w_2(s) \mathbb E[Z \mid X_t = x_t, \bar X_s = \bar x_s]+w_3(s) x_t$$

여기서

$$D_{\mu(s),\Sigma(s)}(x_t,\bar x_s)
:= \mathbb E[Z \mid X_t = x_t, \bar X_s = \bar x_s]$$

로 정의한 것이 위에서 계속 설명한 GLASS denoiser입니다. 따라서 최종적으로

$$u_s(\bar x_s \mid x_t, t)
= w_1(s)\bar x_s+w_2(s) D_{\mu(s),\Sigma(s)}(x_t,\bar x_s)+w_3(s) x_t$$

가 나오게 되는데, 이게 바로 Theorem 1.의 equation 19.입니다. 이전에 정의한 conditional probability path를 통해 scale, variance를 구하면 equation 20.을 얻을 수 있고, equation 21은 바로 위에서 각 항의 계수를 나타냅니다. 따라서, 스케줄러 $\bar{\alpha}_s, \bar{\sigma}_s, \bar{\gamma}$가 잘 정의되어 있다면, 최종 확률분포인 $\bar{X}_1$와 이의 inner probability path인 $\bar{X}_s$은 다음과 같이 ODE로 구해질 수 있는 것이죠.

$$\bar{X}_0 \sim \mathcal{N}(\bar{\gamma}x_t, \bar{\sigma}_0^2 I_d),\ \ \ \frac{d}{ds}\bar{X}_s = u_s(\bar{X}_s|x_t,t)$$

이 velocity를 통해 ODE를 풀면 $0 \leq s \leq 1$상에서 $\bar{X}_s \sim p_s(\cdot|x_t)$와 같이 inner flow matching probability trajectory를 구할 수 있는 것입니다. 이처럼 추가적인 모델 학습이 필요없고, 결과가 Gaussian measurements의 sufficient statistic에 의존하는 방법론을 Gaussian Latent Sufficient Statistic (GLASS) Flows 라고 명명하는 것입니다.

Sampling with GLASS Flows

저자들은 전체 생성과정을 한번에 ODE integration으로 하는 대신 시간 구간을 $K$개로 쪼개고 각 구간마다 확률적인 Markov transition을 샘플링해 이어 붙히는 새로운 샘플러를 제안합니다. Theorem 1.의 요지는 GLASS가 만든 inner ODE를 정확히 풀면 그 결과 $\bar{X}_1$이 원하는 transition $p_{t'|t}(\cdot|x_t)$에서 나온 샘플이 되는 것입니다. 그래서 만약 어떤 단계에서 이미 $X_{t_k}\sim p_{t_k}$라면, 그 다음 단계도 $X_{t_{k+1}}\sim p_{t_{k+1}}$를 만족합니다.

또한, $\rho$를 자유롭게 골라도 marginal이 유지되는 것도 핵심 특징입니다. GLASS transition은 $(X_t, X_{t'})$의 joint Gaussian을 만들 때 $X_t$의 분산은 $\sigma_t^2$, $X_{t'}$의 분산은 $\sigma_{t'}^2$, 둘 사이 상관만 $\rho$로 조절하는 식으로 정의됩니다. 즉 $\rho$는 두 시점이 얼마나 같이 움직이느냐를 바꾸지만, 각 시점의 marginal $p_t, p_{t'}$자체는 스케줄러 $\alpha_t, \sigma_t$로 고정되어있고, GLASS는 그 고정된 marginal을 가지는 joint Gaussian family안에서 $\rho$만 바꾸는 구조라, 어떤 $\rho$를 써도 각 단계의 marginal 경로 $p_{t_k}$위에 계속 남게 됩니다.

기존 ODE 샘플링은 한 번 시작하면 무조건 결정론적입니다, 하지만 GLASS는 매 구간마다 transition을 샘플링하므로 전체가 Markov chain이 됩니다. 이 Markov chain은 각 시점에서 미리 학습된 모델이 이미 학습한 marginal probability인 $p_t$를 유지합니다. 따라서, 확률적 탐색 / SMC / 검색 과 같은 reward-guiding method에 필요한 stochasticity를 얻으면서도, 모델이 원래 학습한 분포 경로를 망가뜨리지 않느다고 저자들을 주장합니다.

계산량은 $K \cdot M$입니다. 맨 위에 있던 Algorithm1을 보면 이해하기 쉬운데, 먼저 한 transition을 샘플링하려면 inner time $s \in [0,1]$ ODE를 $M$ step으로 적분하며, 각 스텝마다 사실상 pre-trained 네트워크를 평가하므로, transition이 총 $K$개면 총 NFE가 대략 $K \cdot M$이 됩니다. 참고로 Appendix에 $M = 1$로 각 transition을 inner ODE 한 스텝으로 정의하면, 한 번의 업데이트가 DDIM형태의 가우시안 업데이트로 정의된다는 점도 큰 특징입니다. 또한, 이 모든건 당연히 discretization error가 없고 모델이 정확한 $u_t$를 안다는 이상적인 가정이 있어야 이와같은 marginal 보존이 정확히 성립합니다!

Implementation

이 섹션에서는 GLASS Flows를 실제 구현할때, 수치적으로 불안정해질 수 있고, CFG/파라미터화/이산시간 모델까지 어떻게 정리해서 적용하냐에 대해 다룹니다. GLASS Flows에서 $s=0$에서 $\bar{X}_0 = \bar{\gamma}x_t + \bar{\sigma}_0\epsilon$꼴입니다. 여기에서 GLASS Denoiser를 그대로 계산하려고 하면, $\bar{X}_0$이 $z$와 독립인 구조 때문에, posterior가 사실상 $p(z|x_t,\bar{x}_0) = p(z|x_t)$로 붕괴합니다. 이를 sufficient static $S(x)$으로 억지로 계산하려고 하면 분모/역행렬/스케줄러 비율에서 불안정이 커질 수 있습니다. 그래서 실제 구현에서는 $s=0$이면 GLASS denoiser를 계산하지 말고 그냥 $D_t(x_t)$를 반환하는 $D_{\mu(0), \Sigma(0)}(x_t,\bar{x}_0) = D_t(x_t)$와 같은 분기처리가 필요합니다.

$2 \times 2$공분산 $\Sigma(s)$ 역행렬 계산 안정성

$Sigma(s)$는 구조상 $\text{det}\Sigma (s) = \sigma_t^2\bar{\sigma}_s^2$이고, $\sigma_t > 0, \bar{\sigma}_s > 0$이면, 이론적으로 항상 invertible입니다. 다만 실제로는 $t \rightarrow 1$이면 $\sigma_t \rightarrow 0$이 되고, $s \rightarrow 1$이면서 $t' = 1 \text\ \ {or}\ \ \rho \rightarrow \pm 1$이면 $\bar{\sigma}_s \rightarrow 0$이므로 실제 float32에서 계산이 불안정해질 수 있습니다.

실제로 그래서, 네트워크 forward를 제외한 나머지계산은 float64로 처리하는 방법을 추천한다고 합니다. 뿐만 아니라 $\Sigma (s) \leftarrow \Sigma (s) + \epsilon I$같은 작은 diagonal jitter를 추가하거나, ODE 적분에서 $s=1$까지 정확히 계산하기보다, 보통은 $s \leq 1 - \frac{1}{M}$까지만 평가해서 $\bar{\sigma}_s = 0$ 특이점을 회피하는 방법을 채택합니다 (Algorith 1.에 반영)

기존 대규모 t2i FM/diffusion은 대부분 CFG를 쓰므로, GLASS에서도 이를 그대로 쓰되, 중요한 구현 원칙이 있습니다. $$u_t^w (x|c) = (1+w)u_t(x|c) - w u_t(x)$$

이 수식을, 그냥 새로운 ground-truth vector field라고 간주하고, GLASS의 모든 계산을 전부 $u_t^w$기준으로 일관되게 수행하게합니다. 이렇게 하는 이유는, GLASS는 위에서도 봤지만 $u_t \leftrightarrow D_t$ reparameterization을 계속 쓰는데, CFG를 마지막에만 섞으면 $D_t$와 $u_t$의 관계가 깨져 inner model도 틀어질 수 있습니다. 다음과 같이 모든 계산을 합니다.

$$D_t^{(w)}(x\mid c)
=
\frac{1}{\dot{\alpha}_t\sigma_t-\alpha_t\dot{\sigma}_t}\big(\sigma_t u_t^{(w)}(x\mid c)-\dot{\sigma}_t x\big)$$

또한, 다른 파라미터화(denoiser/score/$\epsilon$-pred)의 경우와 discrete diffusion은 어떻게 처리하느냐에 대한 답은 다음과 같습니다. GLASS 알고리즘은 $u_t(x)$형태의 velocity field라고 가정합니다. 하지만 실제 diffusion구현은 score $\bigtriangledown \log p_t(x)$, denoiser $D_t(x)$, noise predictor $\epsilon_{\theta}$등 다양한 형태가 있습니다. 저자들은 그냥 어떤 형태든 먼저 $D_t(x)$ 형태로 reparameterize한 뒤, GLASS denoiser/velocity를 구성하면 된다고 말합니다 (일반적인 특징임). discrete인 상황에서는 GLASS 내부에서 계산되는 네트워크에 질의할 시간인 $t^* = g^{-1}((\mu^T\Sigma^{-1}\mu)^{-1})$가 연속값으로 나오므로, 네트워크가 학습된 discrete grid 밖의 $t$를 넣으면 out-of-domain이 될 수 있다는 문제가 있습니다. 이를 위해 $t^*$가 grid에 떨어지도록 내부 $s$스텝을 제한/선택하는 방법을 채택한다고 합니다.

4.3 Inference-time Reward Alignment With GLASS

Sequential Montel Carlo (SMC)

$$x_{t'}^{(k)} \sim p_{t'|t}(\cdot|x_t^{(k)})$$

기존 SMC 기반 방법들은 particle을 뽑을때, SDE 샘플링을 써서 느려졌는데, GLASS Flows를 쓰면, 각 particle 전이를 ODE로 샘플링할 수 있어 같은 compute budget에서 더 많은/더 정확한 샘플 전이 샘플을 얻을 수 있게 도비니다. 또한, discretization error가 줄어들며, 결과적으로 SMC의 resampling/reweighting이 더 제대로 작동하게 도와줍니다.

Value function estimation

$$V_t(x_t) = \log \mathbb{E}_{z \sim p_{1|t}(\cdot | x_t)}[\text{exp}(r(z))]$$

검색류 방법은 노드 평가를 위해 value function을 쓴다고 했었습니다. 여기서 핵십은 $p_{1|t}(z|x_t)$ posterior에서 샘플을 뽑아야 추정이 잘 되는데, 기존엔 이 posterior 샘플링이 SDE기반이라 매우 비효율적이였습니다. 다음과 같이 근사치로 Monte Carlo로 추정해버립니다.

$$p_{1|t}(z\mid x_t)=p^{\mathrm{DDPM}}_{t'=1\mid t}(X_1=z\mid X_t=x_t)$$

하지만, GLASS Flows는 이런 posterior/전이 샘플링을 ODE 방식으로 빠르게 만들고, Monte Carlo로 $V_t$를 더 정확히 추정 가능하고, 검색에서 노드 선택/가지치기가 개선될 여지가 많아지게 된다고 합니다.

Reward guidance

$$u_t^r (x) = u_t (x) + c_t \bigtriangledown r_t (x)$$

GLASS 내부 velocity field $u_s (\bar{x}_s|x_t,t)$에 유사하게 reward항에 더해 guidnace를 할 수 있다고 설명합니다.

5 Related Work

전통적인 discrete-time diffusion(i.e., DDPM)은 보통 continuous ODE/SDE의 1차 근사 관점으로 이해할 수 있고, GLASS는 이러한 작은 스텝 근사가 아니라, 멀리 떨어진 $t < t'$사이의 전이 $p_{t'|t}$를 직접 타깃으로 삼아 샘플링한다는 점이 다릅니다.

논문이 말하듯 Transition Matching의 어떤 supervision은 GLASS trnasition의 특수한경우 $\rho = 1$로 해석될 수 있다고 합니다. 하지만 TM은 pre-training/architecture 변경 중심인거고, GLASS는 pretrained 모델을 inference-time에 변환 중심이라서 문제 설정 자체가 다르다고 주장합니다.

TADA와의 관계도 언급합니다. 둘 다 핵심 수학 아이디어가 Gaussian conjugacy / sufficient statistic로 사전학습 모델에서 재훈련 없이 새로운 동력학/샘플러를 복원한다는 점에서 닮았고, TADA는 주로 샘플링 가속을 위해 state augmentation을 쓰고, GLASS는 reward alignment에 필요한 stochastic translation을 ODE 효율로 제공하는 데 초점을 둔다고 합니다.

또한, reward finetuning (GRPO, SOC, 일부 RL)도 언급하는데, 이러한 방법에선, 학습 중에 다양한 샘플을 뽑아 reward를 평가하고 그 신호로 업데이트를 합니다. 이때 흔히 쓰는데 DDPM/SDE sampling인데, GLASS Flows를 활용해서 학습 중 exploration을 위해 어쩔 수 없이 SDE를 쓰던 부분을 효율적으로 최적화할 수 있다고 언급합니다.

6 Experiments

6.1 Efficient Posterior Sampling and Value Function Estimation

이 절은, reward alignment에 중요한 다음 두 작업을 GLASS Flows가 실제로 더 효율적으로 만드는지 실험으로 보여줍니다.

posterior sampling: $p_{1|t}(z|x_t)$에서 $z$를 샘플링
value function estimation: $V_t(x)$를 잘 추정

비교 대상은 아래와 같습니다.

DDPM sampling (SDE 기반): 전이를 샘플링할 때 매 스텝에 랜덤 노이즈가 들어감
GLASS Flows (ODE 기반 전이 샘플링): 시작에서만 랜덤 초기값을 뽑고, 이후는 ODE로 결정론적으로 진행하지만 결과는 목표 전이분포를 따르도록 설계

벤치마크는 노이즈 낀 이미지 $x_t$를 주고 원본 $z$를 다시 샘플링하는 문제를 벤치마크로 썼습니다.

posterior sampling experiments

먼저 ImageNet모델에서 $z \sim p_{data}$로 이미지를 샘플링하고, $x \sim p_t(\cdot|z)$로 노이즈를 입히고, $z' \sim p_{1|t}(\cdot|x)$와 같이 posterior를 여러 샘플로 추정한 task입니다. 왼쪽 posterior sampling exampling에서는 $M=6$으로 GLASS Flow의 inner step을 설정하였고, 논문에 나와있진 않지만 $K = 1$로 설정하였고, 200개의 Monte Carlo sample을 활용하였습니다. $M$이 작으면 ODE 이산화 오차가 생기지만, 실험적으로 같은 $M$에서 SDE(DDPM)보다 오차가 훨씬 작아서, 왼쪽 그림처럼 posterior를 GLASS Flows가 더 잘 복원한 것을 볼 수 있습니다.

이론적으로는 두 방법 다 충분히 많은 스텝이면 posterior를 맞출 수 있습니다 ($M=30$). 하지만 작은 $M = 10$에서 $t=0.2$를 봤을때, posterior recovery, value function estimation 부분에서 DDPM보다 GLASS Flows가 훨씬 적은 FID, (Pearson) Correlation이 높은 것을 볼 수 있습니다.

6.2 Novel Sampling Method

이 절에 posterior sampling을 넘어서 T2I 모델에 적용하여 비교하고 싶은 것입니다.

ODE sampling: 한 번 초기 노이즈를 뽑으면 이후 경로는 결정론적
DDPM samplingL 매 스텝에 노이즈가 들어가는 확률적 샘플링. 전이를 자연스럽게 제공하지만, NFE가 같을 때 ODE보다 품질이 떨어지거나 더 많은 스텝이 필요하다는 emperical한 문제가 있음.
GLASS Flows: 겉으로 ODE처럼 적분을 하되, 각 전이 샘플링에서 내부 초기 랜덤성으로 확률성을 만들어 SDE전이처럼 branching가능한 전이 샘플을 제공하는 방식

novel sampling methods experiments

FLUX 기본 파라미터에서 ODE vs DDPM의 성능 격차가 큽니다. 즉 같은 NFE (i.e. 50)에서 DDPM이 더 흐리거나 품질이 낮게 나오는 경향을 보여주는것을 말합니다. GLASS Flows($\rho=0.4)$가 그 격차를 줄입니다, DDPM의 전이 확률성은 유지/대체 하면서 품질은 ODE수준으로 회복시킵니다. 실제로 확률적 전이인데도 ODE sampling과 동급 성능을 나타냅니다. 즉 GLASS Flows는 efficiency <-> stochasticity라는 tradeoff를 제거해버립니다!

6.3 Sequential Montel Carlo Experiments

이전 6.2절에서 $\rho = 0.4$, FLUX 모델에 대해 가장 잘 맞는 설정으로 골랐고, 이어서 6.3에서는 그 설정을 사용해 SMC기반 정렬, 특시 Feynman-Kac Steering에 GLASS Flows를 plugin으로 넣어 성능/효율을 평가하겠다는 흐름입니다. 그리고 잠깐 $\rho$가 SMC/FKS에서 갖는 의미/직관에 대해 잠깐 살펴보겠습니다.

$\rho$가 크면: 한 particle의 다음 상태가 현재 상태와 더 비슷하게 움직여서 branch가 덜 퍼질 수 있고(탐색 다양성 적음), 대신 국소적으로 안정적일 수 있음.
$\rho$가 작으면: particle들이 더 넓게 퍼지며 (탐색 다양성 높음), 너무 작으면 품질/안정성에 손해가 날 수도 있음.

저자들은 $\rho = 0.4$가 품질(GenEval 등)과 탐색성(정렬 성능)사이 밸런스가 좋았다는 관찰을 바탕으로, 이후 SMC 실험에서도 그 설정을 베이스로 씁니다. 다만 SMC/FKS 성능은 단지 sampler의 정확도 뿐만 아니라 다음과 같은 요인에도 영향을 받게 되는데.

기본 생성 모델의 probability path/오차 특성.
reward model의 형태
particle의 수/resampling 빈도
한 transition의 시간 간격($t' - t$)

그래서 저자들은 FLUX에서는 $\rho = 0.4$가 가장 좋았지만, 다른 backbone/스케줄러/해상도/도메인에서는 다른 $\rho$가 최적일 수 있음을 알립니다. 6.3에서는 본격적으로 같은 FKS 알고리즘이라도 transition을 SDE(DDPM) vs GLASS Flows로 샘플링할 때 reward 점수와 GenEval같은 외부 품질 지표에서 효율-성능 tradeoff가 어떻게 바뀌는지 보려는 세팅입니다.

SMC via FKS experiments

당연히 BoN보다 FKS가 정량 지표가 좋습니다. 우리가 자세히 볼 것은 DDPM vs ODE vs GLASS로 비교한 SMC 실험입니다. 여기에서 중요한 시사점들을 정리해서 설명하겠습니다. 먼저 SDE로 FKS를 해도 BoN-ODE를 못이긴다는 점입니다. FKS라는 더 좋은 정렬 알고리즘을 쓰더라도, 전이를 SDE(DDPM)으로 샘플링하면 GenEval이 크게 깎이거나, 전체 효율-품질이 망가질 수 있습니다. 또한, FKS-GLASS (CLIP 39.8, GenEv 72.6)와 BoN-ODE (GenEv 70.6), FKS-SDE (GenEv 64.1)을 보면, GLASS를 쓰면, FKS에 필요한 확률적 전이 샘플링은 유지하면서, 샘플링 품질은 ODE급으로 올려 결과적으로 reward도 올리고 GenEval도 올리는 조합이 가능해 진다는 중요한 실험 결과를 볼 수 있습니다. 또한, GLAS는 단지 DDPM을 ODE로 바꿈 수준이 아니라, SMC(FKS)의 장점을 실제 성능으로 연결시키는 기반을 제공한다는 아주 중요한 시사점을 제공해줍니다.

GLASS-FKS with gradient guidance

추가로 FKS + GLASS + $\bigtriangledown$이들어있는 실험결과도 있는데, 이는 $\bigtriangledown$ guidance로 FKS의 각 스텝에서 보상증가 방향으로 미세하게 끌어주는 항을 추가한 FKS-GLASS입니다. 참고로 이는 reward가 미분가능해야만 할 수 있습니다.

7 Concolusion

GLASS Flows는 flow/diffusion 모델에서 Markov tarnsition $p_{t'|t}(x_{t'}|x_t)$을 SDE없이 ODE로 샘플링할 수 있게 해주는 방법입니다. 이를 위해 inner flow matching model $u_s(\bar{x}_s|x_t,t)$을 구성해, 초기값만 랜덤으로 뽑고 이후는 결정론적 ODE으로 빠르게 전개합니다.

앞으로 이 논문에서 언급한 Transition Matching, TADA같은 논문도 읽어보고, learning based reward-alignment 논문도 많이 읽어봐야겠습니다.

[ 딥러닝 논문 리뷰 - PRMI Lab ] - TRPO & PPO 의 설명과 코드 구현

Lee현서 — Sun, 8 Feb 2026 14:42:20 +0900

생성모델 논문에서 강화학습을 접목한 논문이 최근에 많이 보였습니다. 그래서 이번 방학에 강화학습 예를들어 PPO, TRPO, GRPO등이 뭔지 알아보고 코드까지 세세하게 분석해보려 했습니다. ETRI 인턴을 갔다오면 지쳐서 매번 쓰려져 미뤄왔던 포스팅을 지금에야 하게되었습니다. 이번에는 고려대 오승상 교수님의 강화학습 강의를 차근차근 보고, 가장 관심이 많았던 TRPO와 PPO가 TRPO에서 어떻게 발전된 형태이고 그 중 PPO의 코드는 어떻게 구현되고 결과는 어떤지 알아보겠습니다.

오승상 교수님 강화학습: https://www.youtube.com/watch?v=c15b9AjHxBA&list=PLvbUC2Zh5oJtYXow4jawpZJ2xBel6vGhC&index=27

Trust Region Policy Optimization (TRPO)

TRPO (UC Berkely 2015): https://arxiv.org/abs/1502.05477

Trust Region Policy Optimization

We describe an iterative procedure for optimizing policies, with guaranteed monotonic improvement. By making several approximations to the theoretically-justified procedure, we develop a practical algorithm, called Trust Region Policy Optimization (TRPO).

arxiv.org

DDPG는 model update stepsize가 너무 크면, 모델 수렴이 잘 되지 않는다는 단점이 있었습니다. 이를 위해 TRPO는 trust region ( $\delta$)라는 개념을 도입하여 DDPG의 문제를 해결합니다.

우리는 $\eta(\pi)$를 advantage인 $ A_{\pi_{old}} $와 $\eta(\pi_{old})$를 이용해 업데이트를 하게됩니다. 결국에 $$\eta(\pi) = \eta(\pi_{old}) + \sum_{s}\rho_{\pi}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)$$ 와 같이, state visitation frequency ($\rho$)를 이용해 업데이트 식을 구성할 수 있게 됩니다. 여기에서 모덴 state s에 대해서 $\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a) \geq 0$이라면, $\eta$는 monotonic하게 증가하겠죠. 하지만 우리가 nerual network로 근사하고 계산하다보면, 해당 부분이 negative가 되어 이러한 설정이 깨져버리게 됩니다. 그래서 아직까지는, monotonic improvement를 만족하는지는 애매한 상황입니다. 그리고 애초에 $\rho_{\pi}(s)$를 구하려면, 새로운 policy에 대해 sample을 많이 구해야하는데 이 또한, 우리는 새로운 policy를 찾고있는 상황이기 때문에, 지금 상황에서는 구하는 것은 불가능합니다.

그래서, 우리는 $\rho_{\pi}$대신에 $\rho_{\pi_{old}}$로 local approximation을 하게 됩니다. 이러한 approximation된 꼴과 이전 꼴에는 중요한 관계가 있는데, 바로 $\pi_{\theta_0}$이라는 값에서 동일한 값을 가지고, 1차 미분 계수의 값이 같다는 것입니다. 이를 증명하기 위해서는, $\sum_{s}\rho_{\pi_{old}}(s)\sum_{a}\pi(a|s)A_{\pi_{old}}(s,a)$에서 $A_{\pi_{old}}(s,a)$이라는 값이 a에 대해 Expectation을 취해주면, 0이 된다는 사실로 두개의 사실을 증명할 수 있습니다. 이를 통해서 아주 작은 step일떄 $\pi_{old} \rightarrow \pi$로 갈때 $L_{\pi_{old}}$가 improve된다면, $\eta$도 improve한다는걸 알 수 있습니다. 여전히 근대 어느정도 step까지 허용이 될 지는 모릅니다.

그래서 다음으로 conservative policy iteration update의 개념을 소개합니다. 이러한 issue를 위해 새로운 policy를 사용하는 것이 아닌 mixture policy를 도입하고, $\pi^{'}$를 $L_{\pi_{old}}(\pi)$를 최대화하는 $\pi$로 설정합니다. 이를 통해 우리는 $$\eta(\pi) = L_{\pi_{old}}(\pi) - \frac{2\epsilon\gamma}{(1-\gamma)^2}\alpha^2, \epsilon = \max_{s}|\mathbb{E}_{a\sim\pi^{'}(a|s)}[A_{\pi_{old}}(s,a)]|$$ 라는 lower bound를 얻을 수 있습니다. 하지만, 실제로는 mixture policy를 사용하고 이를 practical하게 사용하기 매우 쉽지 않습니다. 그 이유는 $\epsilon$에서 모든 state에서의 $\pi^{'}$에서의 최댓값을 구하는데, state space가 크면 이를 구하기가 매우 쉽지 않기 때문입니다. 그래서 이를 근사할 또다른 방법을 찾게 됩니다.

우리는 $\alpha = \max_{s}D_{TV}(\pi_{old}(\cdot|s)||\pi(\cdot|s))$처럼 Total Variation distance를 사용하여, $$\eta(\pi) = L_{\pi_{old}}(\pi) - \frac{4\epsilon\gamma}{(1-\gamma)^2}\alpha^2, \epsilon = \max_{s,a}|A_{\pi_{old}}(s,a)|$$ 와 같이 lower bound를 재정의하고, 이를 KL Divergence로 다음과 같이 바꿉니다 $$\eta(\pi) = L_{\pi_{old}}(\pi) - CD_{\text{KL}}^{\text{max}}(\pi_{old},\pi), C = \frac{4\epsilon\gamma}{(1-\gamma)^2}$$. 이는 $D_{TV}(p||q)^2 \leq D_{KL}(p||q)$라는 사실로 간단히 증명이 됩니다.

그 다음 Miniorization-Maximization algorithm으로, lower bound를 surrogate objective로 두어, policy를 monotonically하게 improvement하게끔 보장하게 합니다. MM algorithm은 자세히 설명하지 않겠습니다.

구체적으로, 위에서 구한 lower bound $L_{\pi_{old}}(\pi) - CD_{\text{KL}}^{\text{max}}(\pi_{old},\pi)$를 surrogate objective function인 $M_{i}(\pi)$로 두고 MM algorithm을 위 슬라이드처럼 진행하게 된다면, monotonic하게 improvement할 수 있게 됩니다. 하지만 실제로, 수많은 iteration을 통해 optimal policy를 찾을 수 있기 때문에, TRPO는 많은 계산이 뒤따릅니다. 추가적으로 만약 discount factor $\gamma \rightarrow 1$이라면, $C$값이 커지게 될 것이고, 그러면 $D_{\text{KL}}^{\text{max}}(\theta_{old}, \theta)$값이 작아져야 합니다, 이 말은, new policy, old policy사이의 간격이 작아야 하기 때문에 gradient stepsize가 작아져야해서, 많은 계산량이 또 소모되게 됩니다.

우리는 surrogate objective를 Largrangian duality로 KL constrained objective로 아래와같이 표현할 수 있습니다. $$\max_{\theta}L_{\theta_{old}}(\theta)\ \ \ \text{subject to }\ \ D_{\text{KL}}^{\text{max}}(\theta_{old},\theta)\leq \delta$$. 이는 계산량이 무한하다면, 정확히 같은 form이 됩니다. 그리고 우리는 $\delta$를 $C$보다 hyperparameter로 조절하기 쉽기 때문에, 이를 조절하면서 학습하게 되고 이를 trust region이라고 하는 것입니다. 하지만 $D_{\text{KL}}^{\text{max}}(\theta_{old},\theta)$는 모든 state에 대해 계산되어야 하기 때문에, constraint의 계산이 부정확할 수 있습니다. 따라서 우리는 Heuristic approximation을 하게 되는데, 기존의 max로 표현되었던 부분을 $\mathbb{E}_{s\sim\rho_{\theta_{old}}}$로 감싸주어 approximation을 합니다. 강화학습에서는 Montecarlo simulation을 하기 때문에, 이를 sampling으로 대체할 수 있습니다.

Montecarlo simulation을 통해 single path method를 채택한 sample-based estimation을 합니다. 위 식에서 일단 $\eta(\theta_{old})$값은 maximum $\theta$를 찾는데 필요없기 때문에 빼주었습니다. 그리고 state visitation frequenct를 확률 값으로 바꾸기 위해 $\frac{1}{1-\gamma}$를 곱해주어 표현해주고, 우리는 $a \sim \pi_{\theta}$를 통해 Advantage값을 계산해야 하는데, new policy $\pi$는 우리는 모르기에 Montecarlo simulation이 불가능합니다. 따라서, Importance sampling을 통해 old policy로 해당 값을 가능하게끔 꼴을 바꾸어 줍니다. 또한, 실제 논문에서는 $A_{\theta_{old}}$를 Q-value인 $Q_{\theta_{old}}$로 바꾸어 구현하기도 한다고 합니다. 그래서 최종 최적화 형태는 아래와 같습니다. $$\max_{\theta}\mathbb{E}_{s\sim\rho_{\theta_{old}}, a\sim\pi_{\theta_{old}}}\big[ \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}Q_{\theta_{old}}(s,a)\big]\ \ \ \text{subject to}\ \ \ \mathbb{E}_{s\sim\rho_{\theta_{old}}}[D_{\text{KL}}(\pi_{\theta_{old}}(\cdot|s)||\pi_{\theta}(\cdot|s))]\leq\delta$$

우리는 TRPO에서 Natural Gradient를 사용하여 policy를 update하게 되는데, 이러한 NPG가 어떻게 이루어지는지 보겠습니다. 결론만 말해서, $L_{\theta_{old}}$ term은 1차 derivate만 사용하고, $D_{\text{KL}}$ term은 2차 derivate만 사용합니다. 2차 derivate에서는 Hessian인 $H$ matrix가 사용됩니다. $$H = \nabla_{\theta}^2\bar{D}_{\text{KL}}(\theta_{old}||\theta) = \Big( \frac{\partial^2\bar{D}_{\text{KL}}(\theta_{old}||\theta)}{\partial\theta_i\partial\theta_j} \Big) \Big|_{\theta=\theta_{old}}$$를 우리는 Fisher Information Matrix로 명명합니다. 실제로는 $N$개의 sample을 통해 평균을 구해 $H$를 구하게 됩니다. 위 슬라이드에서, 실제로 $L_{\theta_{old}}, D_{\text{KL}}$를 2차 미분까지 근사를 하지만, 각각 $\theta$를 업데이트 하는데 관여하지 않는 term들을 날리고, 0인 값을 날리면 실재로 각각 1, 2차항만 남게 됩니다.

그리고 $H$를 통해서 Natural gradient인 $H^{-1}\nabla_{\theta}L_{\theta_{old}}(\theta)\big|_{\theta=\theta_{old}}$가 가장 steepest 하게 gradient를 업데이트 direction이 됩니다. $H$는 실제로 curvature를 반영하기 때문에, $\theta$를 더 올바른 방향으로 update할 수 있게 합니다. 이를 통해 업데이트 식은 다음과 같습니다 $\theta = \theta_{old} + \beta\cdot H^{-1}\cdot g$. 그리고 이를 통해 constraint를 다시 표현하게 되면, $\frac{1}{2}(\beta\cdot H^{-1}\cdot g)^T H \beta\cdot H^{-1}\cdot g \leq \delta$가 됩니다. 즉, 이 constraint를 만족하는 최대의 learning rate $\beta$는 $frac{1}{2}(\beta\cdot H^{-1}\cdot g)^T H \beta\cdot H^{-1}\cdot g = \delta$ 를 만족할 때 라는것을 알 수 있습니다.

즉 learning rate $\beta$를 $\sqrt{\frac{2\delta}{g^TH^{-1}g}}$로 업데이트 하면, trust region을 만족한다고 할 수 있는 것입니다. 여기서 우리는 NPG를 사용해서 update하면 빨리 converge를 하지만, $H$를 계산해야 하기 때문에 계산량이 너무 많게됩니다. 여기에서 우리는 $\min_{x}f(x) = \frac{1}{2}x^T H x - gx$라는 시스템을 풀면 됩니다. 이는 $H$가 positive-definite matrix이기 때문에 (convex한 형태임), Conjugate gradient를 통해 quadratic equation $f(x)$의 해를 기존의 gradient descent보다 빠르게 풀 수 있습니다.

마지막으로, 우리의 update 식은 수많은 approximation을 거쳤기 떄문에, $\beta$값으로 policy를 update하면 constraint를 만족하지 않을 수 있습니다. 그래서 line search를 진행하게 되는데, $\beta$를 $\alpha (0 < \alpha < 1)$로 점진적으로 곱해가며 shrinking을 하면서, KL constraint를 만족하는지 확인하고, 만족하는 $\beta$값을 사용합니다.

Proximal Policy Optimization (PPO)

PPO (OpenAI 2017) paper: https://arxiv.org/abs/1707.06347

기존의 TRPO에서 KL constraint는 $\pi_{\theta_{old}}$와 $\pi_{\theta}$가 너무 멀어지는걸 방지하기 위한 term입니다. 하지만 이러한 constraint가 없어지면 policy ratio $\frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$가 급격하게 커지거나 작아지면서 매우 불안정해지게 policy가 update됩니다. 그래서 PPO는 Clipped surrogate objective function을 도입합니다.

$$\max_{\theta}L^{\text{CLIP}}(\theta) = \mathbb{E}[\min(r(\theta)A_{\theta_{old}}(s,a), \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon)A_{\theta_{old}}(s,a)]$$와 같이 KL constraint를 clipping으로 대체함으로서 policy가 너무 급격하게 바뀌는걸 방지합니다. 그리고 $L^{\text{CLIP}}(\theta)$는 $L^{\text{TRPO}}(\theta)$에 대한 lower bound이여야 하기 때문에, min을 기존의 surrogate 값에 적용해줍니다.

PPO는 추가적으로 $L^{\text{VF}}(\theta), S[\pi_{\theta}](s)$항을 추가합니다. $L^{\text{VF}}$는 value estimation (critic)을 학습하기 위한 term이며, 만약 policy와 value함수를 같은 네트워크로 최적화를 한다고 하면, 더 안정적이게 수렴할 수 있게 도와주는 역할을 합니다. 또한, $S[\pi_{\theta}](s)$값은, entropy bonus로서, 무질서도를 높임으로서(noise 증가) exploration을 조금 더 키워주게 하는 역할을 합니다.

PPO 코드 구현

TRPO의 코드는 사실 너무 복잡하기 떄문에, 진행하지 않겠다. 아주 간단한 PPO의 코드 구현체를 HDBG 님의 블로그를 참고하여 구현하고 돌려보았다. 전체 코드는 아래와 같다.

import random

from tqdm import tqdm
from collections import deque

import numpy as np
import pandas as pd
import gymnasium as gym
import matplotlib.pyplot as plt

import torch
import torch.nn as nn딛
import torch.nn.functional as F
from torch.distributions import Normal
from torch.utils.data import TensorDataset, DataLoader
import torch.nn as nn


"""
    연속적인 행동 공간을 다루기 위해서 출력값을 mu(행동의 평균값), log_std(행동의 표준편차에 로그를 취한 값)
"""
class MLPGaussianPolicy(nn.Module):
    def __init__(self, dim_state, dim_action, dim_hiddens=(512, ), activation_fn=F.relu):
        super(MLPGaussianPolicy, self).__init__()
        self.input_layer = nn.Linear(dim_state, dim_hiddens[0])
        self.hidden_layers = nn.ModuleList()
        for i in range(len(dim_hiddens) - 1):
            hidden_layer = nn.Linear(dim_hiddens[i], dim_hiddens[i+1])
            self.hidden_layers.append(hidden_layer)
        self.mu_layer = nn.Linear(dim_hiddens[-1], dim_action)
        self.log_std_layer = nn.Linear(dim_hiddens[-1], dim_action)
        self.activation_fn = activation_fn
        
    def forward(self, s):
        s = self.activation_fn(self.input_layer(s))
        for hidden_layer in self.hidden_layers:
            s = self.activation_fn(hidden_layer(s))
            
        mu = self.mu_layer(s)
        log_std = torch.tanh(self.log_std_layer(s))
        
        return mu, log_std.exp()


"""
    Crtic
"""
class MLPStateValue(nn.Module):
    def __init__(self, state_dim, hidden_dims=(512, ), activation_fn=F.relu):
        super(MLPStateValue, self).__init__()
        self.input_layer = nn.Linear(state_dim, hidden_dims[0])
        self.hidden_layers = nn.ModuleList()
        for i in range(len(hidden_dims) - 1):
            hidden_layer = nn.Linear(hidden_dims[i], hidden_dims[i + 1])
            self.hidden_layers.append(hidden_layer)
        self.output_layer = nn.Linear(hidden_dims[-1], 1)
        self.activation_fn = activation_fn

    def forward(self, x):
        x = self.activation_fn(self.input_layer(x))
        for hidden_layer in self.hidden_layers:
            x = self.activation_fn(hidden_layer(x))
        x = self.output_layer(x)

        return x


"""
    PPO는 On-policy알고리즘임. 그래서 자신이 방금 겪은 경험으로 학습하고 나면 그 경험은 바로 버림
    따라서, (st, at, rt, st+1, at+1, 종료여부)를 잠시 저장해두는 바구니
"""
class RolloutBuffer:
    def __init__(self):
        self.buffer = list()

    def store(self, transition):
        self.buffer.append(transition)

    """
        저장된 경험들을 꺼내고 리스트를 비운다.
    """
    def sample(self):
        s, a, r, s_prime, done = map(np.array, zip(*self.buffer))
        self.buffer.clear()
        return (
            torch.FloatTensor(s),
            torch.FloatTensor(a),
            torch.FloatTensor(r).unsqueeze(1),
            torch.FloatTensor(s_prime),
            torch.FloatTensor(done).unsqueeze(1)
        )

    @property
    def size(self):
        return len(self.buffer)
    
class PPO:
    def __init__(
        self,
        state_dim,
        action_dim,
        hidden_dims=(64, 64 ),
        activation_fn=torch.tanh,
        n_steps=2048,
        n_epochs=10,
        batch_size=64,
        policy_lr=0.0003,
        value_lr=0.0003,
        gamma=0.99,
        lmda=0.95,
        clip_ratio=0.2,
        vf_coef=1.0,
        ent_coef=0.01,
    ):
        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
        self.policy = MLPGaussianPolicy(state_dim, action_dim, hidden_dims, activation_fn).to(self.device)
        self.value = MLPStateValue(state_dim, hidden_dims, activation_fn).to(self.device)        
        self.n_steps = n_steps
        self.n_epochs = n_epochs
        self.batch_size = batch_size
        self.lmda = lmda
        self.gamma = gamma
        self.clip_ratio = clip_ratio
        self.vf_coef = vf_coef
        self.ent_coef = ent_coef

        self.policy_optimizer = torch.optim.Adam(self.policy.parameters(), lr=policy_lr)
        self.value_optimizer = torch.optim.Adam(self.value.parameters(), lr=value_lr)
        
        self.buffer = RolloutBuffer()

    """
        [-1, 1]사이의 행동을 샘플링 
    """
    @torch.no_grad()
    def act(self, s, training=True):
        self.policy.train(training)

        s = torch.as_tensor(s, dtype=torch.float, device=self.device)
        mu, std = self.policy(s)
        z = torch.normal(mu, std) if training else mu
        action = torch.tanh(z)

        return action.cpu().numpy()
    
    """
        핵심 학습 로직

    """
    def learn(self):
        self.policy.train()
        self.value.train()

        # buffer에서 데이터를 통째로 꺼내옴
        s, a, r, s_prime, done = self.buffer.sample()
        s, a, r, s_prime, done = map(lambda x: x.to(self.device), [s, a, r, s_prime, done])
        
        # GAE 및 log_prob_old 계산
        with torch.no_grad():
            delta = r + (1 - done) * self.gamma * self.value(s_prime) - self.value(s)  # \delta_t 담은 배열
            adv = torch.clone(delta)  # gae를 담을 배열
            ret = torch.clone(r) # return을 담을 배열
            for t in reversed(range(len(r) - 1)):
                adv[t] += (1 - done[t]) * self.gamma * self.lmda * adv[t + 1]
                ret[t] += (1 - done[t]) * self.gamma * ret[t + 1]

            # \pi_{old}(a|s) 로그 확률 값 계산하기
            mu, std = self.policy(s)
            m = Normal(mu, std)
            z = torch.atanh(torch.clamp(a, -1.0 + 1e-7, 1.0 - 1e-7)) # act단계에서 tanh를 씌웠으므로, 원래의 정규분포 확률을 계산하기 위해 atanh를 취해 값을 되돌림
            log_prob_old = m.log_prob(z).sum(dim=-1, keepdims=True) # 업데이트 전의 전책이 이 행동을 할 확률을 미리 계산해둠
        
        # Training the policy and value network ``n_epochs`` time
        dts = TensorDataset(s, a, ret, adv, log_prob_old)
        loader = DataLoader(dts, batch_size=self.batch_size, shuffle=True)

        # 수집한 데이터로 n_epochs만큼 반복해서 학습함
        for e in range(self.n_epochs):
            value_losses, policy_losses, entropy_bonuses = [], [], []
            for batch in loader:
                s_, a_, ret_, adv_, log_prob_old_ = batch
                # 가치 네트워크의 손실함수 계산
                value = self.value(s_)
                value_loss = F.mse_loss(value, ret_) # value loss는 예측한 가치 V(s)와 실제 보상 합계 ret의 차이를 줄임

                # 정책 네트워크의 손실함수 계산
                mu, std = self.policy(s_)
                m = Normal(mu, std)
                z = torch.atanh(torch.clamp(a_, -1.0 + 1e-7, 1.0 - 1e-7))
                log_prob = m.log_prob(z).sum(dim=-1, keepdims=True)
                
                ratio = (log_prob - log_prob_old_).exp() # 새로운 정책과 옜날 정책의 확률 비율
                surr1 = adv_ * ratio
                surr2 = adv_ * torch.clamp(ratio, 1.0 - self.clip_ratio, 1.0 + self.clip_ratio) # clipping objective (ppo)

                policy_loss = -torch.min(surr1, surr2).mean()
                entropy_bonus = -m.entropy().mean()

                loss = policy_loss + self.vf_coef * value_loss + self.ent_coef * entropy_bonus
                self.value_optimizer.zero_grad()
                self.policy_optimizer.zero_grad()
                loss.backward()
                self.value_optimizer.step()
                self.policy_optimizer.step()

                value_losses.append(value_loss.item())
                policy_losses.append(policy_loss.item())
                entropy_bonuses.append(-entropy_bonus.item())

        result = {'policy_loss': np.mean(policy_losses),
                  'value_loss': np.mean(value_losses),
                  'entropy_bonus': np.mean(entropy_bonuses)}

        return result
    

    def step(self, transition):
        result = None
        self.buffer.store(transition)
        # 설정된 스탭만큼 데이터가 모이면, learn을 호출해 학습을 시작함.
        if self.buffer.size >= self.n_steps:
            result = self.learn()

        return result
        

def evaluate(env_name, agent, seed, eval_iterations):
    env = gym.make(env_name)
    scores = []
    for i in range(eval_iterations):
        (s, _), terminated, truncated, score = env.reset(seed=seed + 100 + i), False, False, 0
        while not (terminated or truncated):
            a = agent.act(s, training=False)
            s_prime, r, terminated, truncated, _ = env.step(a)
            score += r
            s = s_prime
        scores.append(score)
    env.close()
    return round(np.mean(scores), 4)

def seed_all(seed):
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

env_name = 'Hopper-v5'

seed = 0
seed_all(seed)
max_iterations = 1000000
eval_intervals = 10000
eval_iterations = 10
# activation_fn = F.relu
activation_fn = F.tanh

env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
agent = PPO(
    state_dim,
    action_dim,
    activation_fn=activation_fn,
)

logger = []
(s, _), terminated, truncated = env.reset(seed=seed), False, False
for t in tqdm(range(1, max_iterations + 1)):
    a = agent.act(s)
    s_prime, r, terminated, truncated, _ = env.step(a)
    result = agent.step((s, a, r, s_prime, terminated))
    s = s_prime
    
    if result is not None:
        logger.append([t, 'policy_loss', result['policy_loss']])
        logger.append([t, 'value_loss', result['value_loss']])
        logger.append([t, 'entropy_bonus', result['entropy_bonus']])
    
    if terminated or truncated:
        (s, _), terminated, truncated = env.reset(), False, False
        
    if t % eval_intervals == 0:
        score = evaluate(env_name, agent, seed, eval_iterations)
        logger.append([t, 'Avg return', score])


logger = pd.DataFrame(logger)
logger.columns = ['step', 'key', 'value']

fig = plt.figure(figsize=(12, 4))

ax = fig.add_subplot(1, 4, 1)
key = 'Avg return'
ax.plot(logger.loc[logger['key'] == key, 'step'], logger.loc[logger['key'] == key, 'value'], 'b-')
ax.grid(axis='y')
ax.set_title("Average return over 10 episodes")
ax.set_xlabel("Step")
ax.set_ylabel("Avg return")

ax = fig.add_subplot(1, 4, 2)
key = 'policy_loss'
ax.plot(logger.loc[logger['key'] == key, 'step'], logger.loc[logger['key'] == key, 'value'], 'b-')
ax.grid(axis='y')
ax.set_title("Policy loss")
ax.set_xlabel("Step")
ax.set_ylabel("Policy loss")

ax = fig.add_subplot(1, 4, 3)
key = 'value_loss'
ax.plot(logger.loc[logger['key'] == key, 'step'], logger.loc[logger['key'] == key, 'value'], 'b-')
ax.grid(axis='y')
ax.set_title("Value loss")
ax.set_xlabel("Step")
ax.set_ylabel("Value loss")

ax = fig.add_subplot(1, 4, 4)
key = 'entropy_bonus'
ax.plot(logger.loc[logger['key'] == key, 'step'], logger.loc[logger['key'] == key, 'value'], 'b-')
ax.grid(axis='y')
ax.set_title("Entropy bonus")
ax.set_xlabel("Step")
ax.set_ylabel("Entropy bonus")

fig.tight_layout()
# plt.show()
plt.savefig('./output/output.png')

PPO는 2,000번 이상 환경과 상호작용하며 데이터를 수집하고 네트워크 파라미터를 여러번 업데이트 시킵니다.
PPO class에 vf_coef, ent_coef, clip_ratio등은 ppo objective function을 구성하기 위한 재료이다.
learn 메서드에서, GAE actor-critic과 같이 동작하며, 구한 Value, Advantage, importance sampling coeffient 값을 통해, 수집한 데이터로 여러번 정책을 업데이트 한다.
- n_epoch번 네트워크가 업데이트 되는 동안 $\pi_{\theta_{old}}$는 고정되어 있다.
RolloutBuffer를 통해 on-policy (PPO는 on-policy이다) 데이터 수집 및 폐기를 구현한다.
- 사실 importance sampling과정에서 off-policy처럼 보일 수 있지만, on-policy이다. (미세한 off-policy)
MLPGaussianPolicy, MLPStateValue 클래스로 정책 네트워크 및 상태 가치 네트워크를 각각 구현한다 (같은 네트워크 X)
PPO 알고리즘으로 Gymnasium MuJoCo 환경 중 하나인 Hopper-v4를 제어한다.
- PPO activation을 ReLU, tanh로 바꿔가며 실험한다.
실제로 PPO는 $N$개의 policy가 각각 병렬적으로 환경과 $T$번 상호작용하며 $NT$개의 경험 데이터를 획득하고, 이 경험 데이터들을 사용하여 목적 함수를 최적화 한다.
- 이 코드는 $N=1$ 인 경우인데, PPO는 $N=1$경우에서 잘 동작하지 않는다. 그 이유는 1개의 환경에서 상호작용하여 얻은 $T$개의 데이터가 서로 너무 correlated되어서 과적합 확률이 높아지기 때문이다.
  - 병렬 에이전트는 PPO등 on-policy알고리즘 성능 향상에 거의 필수적이다.

Tanh (good!)

RELU (not good..)

실제로 activation으로 tanh를 쓰면, Avg return이 1.5배 높아졌다. 이러한 현상은 on-policy에서 상당히 흔한? 현상이라고 한다. 이는 구현이 매우 복잡한 TRPO와 비슷한 성능이다.

[ 딥러닝 논문 리뷰 - PRMI Lab ] - DreamFusion: Text-to-3D using 2D Diffusion (ICLR 2023)

Lee현서 — Tue, 6 May 2025 19:01:41 +0900

CS492(D)강의를 들으면서 인상깊었던 논문을 개인적으로 찾아보던중, SDS loss라는 개념이 인상깊어서 해당 논문을 간단하게나마 리뷰하고 정리하겠습니다. DreamFusion의 핵심인 SDS Loss에 대해 알아보고, DreamFusion에서 사용되는 shading기법에 대해 간단히 알아보겠습니다.

논문 링크: https://arxiv.org/pdf/2209.14988

참고: https://xoft.tistory.com/39

[논문 리뷰] DreamFusion (ICLR 2023) : Text to 3D 연구

DREAMFUSION: TEXT-TO-3D USING 2D DIFFUSION, Ben Poole, arXiv2022, Google Resarch Dream Fusion에서는 NeRF와 Diffusion Model기반의 Text-to-2D 모델을 사용해서 Text-to-3D 방법을 제시합니다. 유사한 이전 연구인 Dream Fields(2022)는

xoft.tistory.com

성민혁 교수님 강의: CS492(D)

SDS Loss

기존의 DDPM Loss는 위와같은 형태입니다. U-net을 통해 $x_t, t$를 주면 noise를 예측하는 식입니다. DDIM도 비슷합니다. DreamFusion에서 사용되는 SDS Loss에 대해 간단히 알아보겠습니다.

DreamFusion

DreamFusion의 오른쪽 부분이 SDS부분입니다. DreamFusion에서 입력은 NeRF로 생성된 이미지입니다. 위 수식에서와 같이 $g(\theta ; c)$ 특정 angle에서 샘플링된 이미지라고 할 수 있습니다. 이는 DDPM에서의 $x_0$이라고 할 수 있습니다.

$x_t$를 noise를 추가해서 만들 수 있습니다. 여기서 $y$는 text를 임베딩 한 값입니다. 이를 $\hat{\epsilon}_{\theta}$에 넣는데, 이는 noise predictor입니다. 이의 결과를 실제 noise와의 차이를 구성해 loss식을 만들 수 있습니다.

그리고 이에 대한 Gradient를 취하면 위와같은 최종 Gradient식이 완성됩니다. 이제 여기서 DreamFusion저자들은 training speed를 향상시키기 위해 일종의 trick을 사용합니다.

$\nabla_{\theta}\mathcal{L}_{DF}(\theta)$을 chain rule을 이용해 분해하면 Noise Predictor Jacobian이 중간에 있게됩니다. $\hat{\epsilon}_{\theta}$는 parameter가 많은 모델이기 때문에, 이를 계산하기 위해서는 memory가 매우 많이 필요합니다. 저자들은 간단히 해당 term을 0으로 만들어도 SDS가 잘 동작한다는 것을 발견해서 실제로 omit했다고 합니다. 위 그림에서 U-net이 lock되어있는 이유도 이에 해당합니다.

그럼 위와같이 SDS loss를 구성할 수 있습니다. DreamFusion의 Appendix에는 해당 수식이 왜 동작하는지에 대한 증명도 있습니다.

NeRF로 rendering한 이미지 $x$가 있고, latent variable $z_t$를 만드는 확률분포 $q$와 텍스트로 임베딩된 $y$가 주어질때 U-net의 파라미터에 따라 다른 확률분포를 만드는 $p_{\phi}$간의 KL-Divergence를 최소화하기 위해 위 식을 구성합니다[weighted density distillation(2018)]. 그리고 $\epsilon$으로 샘플링한 기댓값으로 나타내고, Gradient를 취하면 (A), (B)와 같은 항으로 구성되게 됩니다. 먼저 (A)의 $z_t$는 $\theta$에 대해 고정된 값이 만들어지기 때문에 미분값이 0입니다.

위와같이 우리는 (A)을 parameter score + path derivate로 분해하여 표현할 수 있는데, Stikcking-the-Landing이라는 논문에서 path derivate만 남기고 parameter score를 제외시키면 path derivate가 다른 Loss term과 correlate되어 variance가 줄어들게 된었다고 해서, 해당 Term을 사용하겠다 합니다.

(B) term은 score function을 chain rule로 분해하여 위와같이 나타냅니다. NeRF parameter $\theta$에 대해 결정되니까 score function의 정의에서 chain rule로 분해해 주어야 합니다.

최종적으로 위와같은 식을 만들 수 있는데, 이를 통해 lower-variance gradient를 가지면서 안정적이고 빠르게 수렴하게 만들 수 있다고 합니다.

Rendering

DreamFusion

위 그림을 보면, SDS Loss를 사용하는 부분을 제외한 왼족 부분은 NeRF를 이용해서 이미지를 sampling하는 과정입니다. sampling을 하는 과정을 간단히 설명하겠습니다.

random camera pose, light position을 사용해서 64x64의 shade된, NeRF로 sampling한 이미지를 생성합니다. 기존의 NeRF와 다른점은, 3가지의 서로 다른 rendering 방법을 랜덤으로 사용했다는 점입니다.

shading없이 albedo $\rho$로 렌더링 (NeRF는 view-dependent)
shading하여 렌더링
albedo $\rho$를 white로 바꿔서 shading하여 렌더링

그럼 DreamFusion에서 shading을 어떻게 구현했는지 보겠습니다.

NeRF는 Ray의 방향에의해 결정되는 radiance color $\rho$을 volume rendering하여 color pixel값을 결정했습니다. DreamFusion은 위 방식에 추가조명을 활용하여 조명에 따라 달리 보이는 surface color c을 활용하여 volume rendering합니다.

DreamFusion에서는 volume density $\tau$에 대해 미분하여 normal vector n을 계산합니다. 여기서 n은 geometry의 local 방향성을 알려주는 vector입니다. 그리고 light 3D point $l$에서 발산되는 조명 색상 $l_p$와 ambient 조명 색상 $l_{\alpha}$를 가정하여, diffuse reflectance가 위 수식의 $c$와 같이 계산되어 최종적인 volume rendering을 통해 color 가 계산됩니다.

추가로 albedo $\rho$를 white color (1, 1, 1)로 바꿔주면, texture가 없는 shaded output을 만들 수 있습니다. 이렇게 shaded x albedo를 통해 color를 결정하면 물체의 외형 정보와 조명 효과가 분리되어 모델링됩니다. 즉 이를 통해 성능을 높일 수 있었다고 합니다.

추가적인 최적화기법은 나중에 시간이 되면 추가하겠습니다.

Experiments

위와같이 하나의 scene에 대해 text prompt로 refine 3D scene을 구성할 수 있습니다.

3D reconstruction task는 일반적으로 recovered geometry와 GT간의 Chamfer Distance를 계산하여 평가하고, view-synthesis는 rendered된 veiw와 GT를 비교하는데 PSNR를 씁니다. 하지만 이에 대한 zero-shot에 대해서는 GT가 없기 떄문에, 입력 문장에 대한 렌더링된 이미지의 일관성을 측정하는 CLIP R-Precision으로 평가합니다. R-Precision은 렌더링된 이미지가 주어질때 CLIP이 오답 텍스트중에 정답을 적절히 찾는 정확도로 계산됩니다. 이에 대한 다양한 zero-shot text-to-3D모델과의 비교 결과입니다.

Ablation결과입니다. 위 4개의 최적화 방법을 점진적으로 늘려 실험했습니다.

viewpoints의 범위 늘리기 (i)
- (ii)가 없다면 얼굴이 2개가 나올 수 있음
view-dependent한 prompt engineering (ii)
- geometry를 향상시켰지만 surface가 non-smooth함.
추가 조명 사용 (iii)
- geometry를 향상시켰지만 어두운 부분은 여전히 non-smooth함.
albedo를 white으로 만들어 shading (iV)
- geometry를 smooth하게 함.

2025년도 1학기 회고록

Lee현서 — Tue, 6 May 2025 15:52:31 +0900

꿈에 그리던 전역을 하고, 복학을 한지 벌써 2달이 지났다. 지금 머릿속에서 내가 무엇을 시작해야하고 하지 말아야하는지 명확해 지는 시점인것 같다. 그런김에 회고와 목표의식을 뚜렷하게 하고자 글을 쓴다.

군대

SW개발병을 하면서 정말 다양한 만났다. 먼저 내 알동기 2명이 아니였으면 군생활 어떻게 했을까 걱정될 정도로 너무 잘 챙겨줬다. 그 외에도 나랑 같이 놀아주고 공부하고 대화를 나눠준 선임 후임들 모두에게 감사하다. 무엇보다도 뛰어난 사람들과 지내면서 다양한 인사이트를 얻었다. 대학원의 길, 창업의 길, 취업의 길, 프리랜서의 길 등등. 이러한 분들 덕분에 군대에서도 번아웃이 오지 않고 내가 원하는 대학원의 길을 목표로 꾸준히 정진한것 같다.

입대전에는 컴퓨터비전 학부연구생을 막 시작할 터라, Detection모델등을 봤었다. 군대에서 이를 기반으로 3D Vision독학, Diffusion 모델 독학등 내가 원하는 분야를 마음껏 공부했다. 나와 목표가 같은 친구들과도 스터디를 하며 더욱 성장할 수 있었다. 그렇게 전역할때쯤 되어서는 나만의 연구를 하기 위한 딥러닝용 서버를 구축할 수 있었다.

복학

복학해서는, 이제 내가 듣고싶은 과목들만 듣기로 했다. 그래서 지금은 컴퓨터비전, 인공지능, 확률과통계, 딥러닝, 기계학습개론을 듣는다. 인공지능과목에서는 내가 생각하지 못했던 로봇공학에서 다루는 인공지능 기법에 대해 배우고있다. LiDAR, RL, Kalman Filter등등. 기초 학문이라고 앝잡아보았다가 큰코 다치는 중이다. 수업에서 모든 기술에 대해서 수학적으로 분석해주시고 이전에 배웠던 개념들과의 연관성을 설명해주시는데, 수학을 좋아하는 나로서는 정말 흥미롭다. 그 외에도 컴퓨터비전 시간에는 군대에서 공부했던 3DGS, NeRF등등에 적용된 3D Vision개념의 실마리를 풀어 나가는 것 같아서 정말 재밌다.

학부연구생을 하면서 연구 과제를 드디어 맡았다. 항공우주연구원과 하는 pansharpening과제와 교수님께서 기초연구원에서 진행되는 3DGS과제가 있는데 참여해보라고 하셨다. 해당 연구를 기반으로 Remote Sensor분야의 Conference에도 2편정도 투고할 예정이다. 이번 학기가 기대되는 이유에는 이러한 새로운 연구를 진행할 수 있다는 점에 있는 것 같다.

목표

복학 후부터, 대학원 진학을 위해 다양한 랩실을 찾아보았다. 카이스트에 다니는 후임이 KAIRI를 추천하기도 했고 SK Fellowship같은 것도 추천해주었다. 그리고 인공지능 대학원과 전산학부 대학원에 내가 좋아하는 학문을 하시는 교수님을 찾아보았다. 마침 전산학부 연구실을 뒤지다가 성민혁 교수님 랩실을 보게되었고 연구분야가 정말 흥미로웠고 Internship메뉴에 April Mid에 공고가 올라온다고 했다. 이를 계기로 해당 홈페이지를 정독하였고 내가 지금 할 수 있는 것을 생각했다. 먼저 해당 랩실 소개 ppt도 보고, 교수님의 강의도 보았다.

강의 홈페이지에서 CS492(D), CS479의 1강의씩을 들어보았는데, 내가 알고있던 개념들을 더 딥하게 유도해주시고 다양한 논문들과의 관련성을 설명해주시고 내가 궁금했던 점을 명쾌히 설명하고 있었다. 사실 지금 이 포스트를 쓰는 것도 이 두가지 강의를 다 들었기 때문이기도 하다. 내가 수업을 들으면서 개념에 대한 설명을 들을때 설레었던 적은 처음인것 같다. 정말 많은 것을 배웠는데, 말에 다 담지 못할 정도이다. 사실 논문리뷰 포스트를 올리지 못한 이유도 해당 강의에 매진하기 위해서였다.

이제는 CS492(D)의 과제부터 차근히 해결해보려고 한다. 가장 흥미로운 과제는 Distillation, DPMSolver인데, Distillation은 DreamFusion이 정말 흥미로웠기 때문이고, DPMSolver는 수식이 미분방정식이 들어가서 매우 어려운데 교수님께서 강의중에 DPM-Solver-2로 들어가면 DPM-Solver-1보다 훨씬 복잡해지는데 한번 해결해보라고 한걸 들어서이다. 사실 해당 내용의 해결 과정을 블로그에 올리고 싶어 교수님께 예약 메일을 보낼려 하다가, 강의를 다 듣기도전에 설레발인거 같아서 취소했다.

성민혁 교수님에 대해 많은 것을 찾아보면서 점점 내 우상이 되어가는 것 같다. 나도 교수님처럼 행복하게 연구하고 싶다. 그리고 주말에 인턴 지원서를 다 작성해서 지원했다. 이번에 카이스트 다니는 후임도 우연히 지원한 곳이 겹치게 되었는데 꼭 같이 방학때 대전에서 VIsion연구를 하고 싶다.

이제 학부 연구생 과제, CS492(D), CS479과제, 학부공부를 병행하면서 영어 회화 공부도 정말 열심히 할 것이다. CS479 project pitch를 듣는데, 다들 영어 발음과 수준이 대단했다. 나도 듣는 것은 잘하지만 유창한 회화는 하지 못한다.

나가며

내가 좋아하는 것을 하고있는 것 같아서 정말 행복하다. 위에서 말한 과제를 하나하나 해결하며 비공개 포스트로 개시하겠다. 그리고 Guest Lecture는 듣지 못하였는데, 이것도 하나하나 들어야겠다! (DDIM 저자분도 있었다). 재밌는 한 학기가 될 것 같다.

CS492d 과제 풀이 완료: https://github.com/eunoiahyunseo/CS492-D-

CS479 과제 풀이 완료: https://github.com/eunoiahyunseo/CS479

현재는 비공개 처리

[ 딥러닝 논문 리뷰 - PRMl Lab ] - Asyrp: DIFFUSION MODELS ALREADY HAVE A SEMANTIC LATENT SPACE (ICLR 2023)

Lee현서 — Sun, 23 Mar 2025 15:42:00 +0900

교수님께서 style transfer논문을 찾아보라고 하셔서, diffusion + style transfer논문을 찾아보는 중에 DiffStyle을 보게되었고 해당 논문의 fundamental엔 Asyrp(Asymmetric reverse process)라는 논문이 있었습니다. 이제 본격적으로 졸업을 위해 SCI에 논문을 써야하기에, 이러한 좋은 논문을 읽으며 논문을 쓰는법을 많이 배운거 같습니다.

git: https://github.com/kwonminki/Asyrp_official

GitHub - kwonminki/Asyrp_official: official repo for Asyrp : Diffusion Models already have a Semantic Latent Space (ICLR2023)

official repo for Asyrp : Diffusion Models already have a Semantic Latent Space (ICLR2023) - kwonminki/Asyrp_official

github.com

paper: https://arxiv.org/pdf/2210.10960

Abstract

diffusion model은 많은 domain에서 좋은 성능을 내며 성공했습니다. 하지만, diffusion model은 generative process을 control할 semantic latent space에 대한 연구가 부족했습니다. 이에 대해 저자들은 pretrained된 diffusion model을 freeze시킨 상태에서의asymmetric reverse process (Asyrp)을 제안했습니다. 저자들이 제안한 h-space라는 latent space는 semantic image manipulation을 수행하기 쉬웠습니다. 또한 저자들은 versatile한 editing 방법론과 editing strength를 이용한 interval과 quality deficiency를 고려한 interval을 설정하여 quality-boosting할 수 있는 구조에 대해 소개합니다. 저자들의 방법론은 (DDPM+, iDDPM, ADM,, etc) 많은 구조에 대해 적용가능하며, (CelebA-HQ, AFHQ-dog, LSUN-church, LSUN-bedroom, METFACES,, etc) 많은 데이터셋에 적용가능합니다. https://kwonminki.github.io/Asyrp/ 프로젝트 페이지는 이와 같습니다.

Background

DDIM

해당 논문에서는 DDIM을 주로 설명합니다. DDIM은 위와같은 확률분포에서 reverse process를 sampling합니다. 또한, $\eta$값이 1이면 DDPM, 0이면 DDIM이 됩니다.

CLIP

CLIP은 블로그에서도 소개한적이 있는데, image Encoder $E_I$와 text encoder $E_T$ output의 similarity를 기반으로 multi-modal embedding을 학습합니다. 이는 mode collapse없이 cosine-distance를 이용한 loss를 통해 homogeneous editing이 가능합니다.

Introduction

Image guidance는 guiding image의 latent variable과 unconditional latent variables와 섞는 방법입니다. 이는 어느정도 control을 제공하긴 하나, guide된 이미지의 어떤 condition을 반영해야하는지에 대한 부분이 모호하며, 직관적으로 condition을 magnitude하는것이 힘듭니다.

Classifier guidance는 LDM할때도 다루었지만, 이는 새로 학습한 classifier의 gradient을 reverse process에 활용해 target class를 만드는 방법입니다. 이는 추가적인 classifier를 학습시켜야 한다는 점이 비효율적입니다. 또한, sampling과정에서 classifier의 gradient를 구하는 것이 비용이 든다고 합니다.

DiffusionCLIP같은 경우에는 이미지를 latent로 보낸 다음에 CLIP loss를 통해 새로운 모델을 fine tuning하는 방식입니다. 이는 이전 방식과는 다르게 target attribute를 source에 잘 섞을 수 있지만 여전히 많은 모델이 필요해 비효율적이라고 합니다.

GAN (Goodfellow et al., 2020)은 latent space에서 image editing에 대한 쉬운 직관을 내놓았었습니다. 원본 이미지에 해당하는 잠재 벡터(latent vector)가 주어지면, 이 벡터를 어떤 방향으로 조정하면 생성된 이미지가 CLIP의 임베딩 공간에서 원하는 텍스트 설명과 가장 비슷해질지를 찾을 수 있습니다. 이러한 latent direction은 다른 이미지에도 똑같이 적용될 수 있습니다. 하지만 real image가 들어왔을때 이에대한 latent vector를 찾는것은 매우 어렵습니다. 이에 대한 diffusion의 연구도 많았으나, 대부분이 추가적인 classifier를 만들어야 했습니다.

해당 논문에서는 Asyrp을 발견했으며, 이는 frozen diffusion model에서도 original image의 latent space에서의 edit이 가능합니다. 저자들의 latent space이름은 h-space이며 이는 추후에 설명할 좋은 attribute들을 다 가지고 있습니다. Figure1에서 (d)가 이에 해당합니다.

Discovering semantic latent space in diffusion models

해당 부분에서는 naive approach가 작동하지 않는지와 새로운 controllable reverse process를 제안합니다. 그리고 앞으로 해당 논문에서는 야래오 같은 간략한 표현을 씁니다.

간략함을 위해 $sigma_{t}z_t$는 생략하지만, $\eta \neq 0$일때는 생략하지 않습니다.

Problem

pretrained and frozen diffusion model에서 $x_T$를 통한 semantic latent manipulation이 목표입니다. 초기의 아이디어는, $x_T$를 CLIP loss방향으로 이동시키는 방법이었습니다. 하지만 이는 잘못된 manipulation이 발생할 수 있고, 이미지가 distort될 수 있다는 문제가 있습니다.

위의 방법에 대한 대안은 network가 predict한 $\epsilon_{t}^{\theta}$를 shift시키는 것입니다. 하지만 이는 $x_0$을 적절히 manipulate할 수 없는데, $\mathsf{P}_t, \mathsf{D}_t$가 이를 $p_{\theta}(x_{0:T})$로 계속 가게 하여, 일종의 상쇄가 일어납니다. 저자들은 이에 대한 Theorem 1을 정의합니다.

이에 대한 증명은 Appendix C에 있으며 위 그림 (a-b)에서 $x_0, \tilde{x}_0$이 거의 같음을 확인할 수 있습니다.

Asymmetric reverse process

저자들은 새로운 asymmetric한 reverse process를 아래와 같이 새로이 정의합니다.

asyrp

단지 $mathsf{P}_t$를 $\epsilon_t^\theta \rightarrow \tilde{\epsilon}_t^\theta$로 변경한 것 뿐입니다. 이는 $x_t$로 향하는 방향은 그대로 두고, $\mathsf{D]_t$는 original flow를 따라가게 합니다. 아래 그림은 직관적인 Asyrp의 프로세스를 보여줍니다.

저자들은 $\mathsf{P}_{t}^{edit}$과 $\mathsf{P}_{t}^{source}$를 위에서 소개한 directional CLIP loss의 input으로 사용하고, 이 둘간의 차이를 정규화합니다. 그 후 $\vartriangle\epsilon=argmin_{\vartriangle\epsilon}\mathbb{E}_{t}\mathcal{L}^{(t)}$인 $\vartriangle\epsilon$값을 구하는데, $\mathcal{L}^{(t)}$는 아래와 같이 정의됩니다.

뒤에서 다루겠지만 이러한 $\vartriangle\epsilon$은 $x_0^{edit}$의 특성을 렌더링하지만, $\epsilon-space$는 diffusion model에서 semantic latent space의 특성이 부족합니다.

h-space

현재 SOTA의 diffusion모델들의 $\epsilon_t^{\theta}$는 U-Net에서 나옵니다. 이러한 이유로 저자들은 이를 bottle-neck으로 생각했고, U-net의 가장 깊은 부분의 feature map인 $h_t$를 $\epsilon_t^{\theta}$를 control하기 위해 선택했습니다. $h_t$는 $\epsilon_t^{\theta}$보다 더 작은 spatial resolution과 high-level semantic을 가지고 있기도 하기 때문이죠. 이러한 이유로 sampling equation은 아래와 같이 변경합니다.

sampling equation

$\epsilon_t^{\theta}(x_t|\vartriangle h_t)$는 original feature maps $h_t$에 $\vartriangle h_t$를 추가한 것입니다. $\vartriangle h_t$는 위의 식 Eq(7)를 최소화함으로써, 기존의 $\mathsf{P}_t(\tilde{\epsilon}_t^{\theta}(x_t))$대신 $\mathsf{P}_t(\epsilon_t^{\theta}(x_t|\vartriangle h_t))$를 사용하여 attrribute를 조절합니다.

저자들은 h-space에서 기존과 다른 특성을 발견했다고 합는데, 이는 아래와 같습니다.

같은 $\vartriangle h$는 다른 sample에서도 같은 효과를 적용한다.
$\vartriangle h$를 Linear scaling하면 attribute change의 mangitude를 조절할 수 있고, 이는 심지어 negative scale까지 적용됩니다.
2개 이상의 $\vartriangle h$를 동시에 합쳐 attribute를 조정하는 것도 가능합니다.
$\vartriangle h$는 quality degradation없이도 결과를 뽑아낼 수 있습니다.
$\vartriangle h$이 주는 attribute의 변화가 timestep t가 달라도 일관되게 변합니다.

Implicit neural directions

$\vartriangle h$을 수많은 timestep에 적용하기에는 너무 많은 iteration이 필요하고, learning rate나 scheduling을 하기에도 많이 힘듭니다. 대신에, $h_t$ -> $\vartriangle h$인 implicit function인 $f_t(h_t)$를 정의합니다. $f_t(h_t)$는 2개의 1x1 convolution으로 이루어진 작은 network이며, U-net의 timestep과 concatenate됩니다 (Figure 14와 차원은 아무래도 다른거 같습니다). 이 또한 $\mathsf{P}_t^{edit}=\mathsf{P}_t(\epsilon_t^{\theta}(x_t|f_t))$를 사용하여 Eq (7)를 최적화 합니다.

$f_t$를 사용하여 learning하면 learning rate에 보다 robust하며, 모든 timestep에 대해 $\vartriangle h_t$를 학습하는 것보다 더 빨리 수렴한다고 합니다. 또한, $f_t$를 통해 implicit function을 학습하면 unseen timestep과 bottleneck feature에 대해 일반화를 할 수 있습니다. 이러한 일반화는 DDIM에서 정의한것과 같은 subsequence ${x_{\tau i}}_{\forall} \in [1, S]$, $S < T$에서 training accelerate를 할 수 있습니다. 이를 통, custom subsequcne ${\tilde{\tau}_i}$, $\tilde{S} < T$를 정의할 수 있는데, $\vartriangle \tilde{h}_{\tilde{\tau}} = f_{\tilde{\tau}}(h_{\tilde{\tau}})S/\tilde{S}$를 통해 normalize를 함으로써 구현할 수 있게됩니다. 이는 $\sum \vartriangle h_t$의 크기를 보존할 수 있습니다 ($ \vartriangle \tilde{h}_{\tilde{\tau}}=\vartriangle h_t S $). 이를 통해 $f_t$를 어떤 subsequence, timestep length에도 상관없이 훈련시킬 수 있습니다. 뒤에 볼 Figure 6을 제외하고 모든 experiment에서 $\vartriangle h_t$를 얻기위해 $f_t$를 사용합니다.

Generative process design

이제까지는 asyrp의 h-space에 대해 다루었다면, 여기서는 전체적인 editing process 3단계에 대해 다루겠습니다. 이는 editing with Asyrp -> traditional denoising -> quality boosting형태로 이루어져있으며, 우리는 각각의 phase를 결정하기위한 quantifiable 요소에 대해 알아보겠습니다.

Editing process with asyrp

Diffusion model은 초반 stage에서 고차원적인 context정보를 학습하고, 더 높은 차원에서는 imperceptible fine details를 학습합니다 (Choi et al., 2022). 비슷하게 저자들은 generative process의 초기단계는 semantic changes를 잘 학습할 수 있도록 변경한다고 합니다. 저자들이 정한 해당 stage를 $[T, t_{edit}]$으로 명명합니다.

LPIPS는 2개의 이미지의 유사도를 평가하기 위해 사용되는 지표중에 하나입니다.

비교할 2개의 이미지를, 각각 VGG에 넣고 중간 layer의 feature값들간의 similarity를 측정하여 평가지표로 활용합니다.

LPIPS($x$, $\mathsf{P}_T$)와 LPIPS($x$, $\mathsf{P}_t$)는 original image와 predicted image간의 perceptual distance를 각각 timestep $T$, $t$에서 측정한 값입니다. 해당 지표의 의미는 reverse process에서 더 예측해야할 component의 크기를 의미합니다. 그리고 editing strength $[T, t]$을 아래와같이 정의합니다.

editing strength

Figure 3에서 LPIPS($x$, $\cdot$)을 $\mathsf{P}_t$, $x_t$에 대해 나타냈는데, 이들간의 inset이 editing strength를 의미합니다. 저자들은 일반적으로 가장 적은 editing interval을 통해 충분한 distinguishable한 change를 가져다줄 수 있는 editing inverval을 찾습니다. 경험적으로 $t_{edit}$, LPIPS($x$, $\mathsf{P}_{t_{edit}}$)$=0.33$이 가장 합당한 interval이라고 결정할 수 있다고 합니다.

하지만 몇개의 attribute는 일반적으로 더 많은 visual change를 필요로 합니다 (e.g., pixar > smile). 이러한 attribute에 대해서는 editing strength $\xi$를 $\delta = 0.33d(E_T(y_{source}), E_T(y_{target}))$로 하는데, 여기서 $E_T(\cdot)$은 $y_{(\cdot)}$에 대한 CLIP text embedding이며, $d(\cdot , \cdot)$은 인자간의 cosine distance를 의미합니다. 저자들은 $t_{edit}$ attribute를 LPIPS($x$, $\mathsf{P}_{t_{edit}}$)$=0.33 - \delta$로 확장해서 표현하기로 합니다.

Quality boosting with stochastic noise injection

DDIM이 stochasticity ($\eta = 0$)으로 perfect inversion을 가능하게 했습니다. 다만, Karras et al.(2022)는 stochasticity가 image quality를 증가시킨다고 말합니다. 비슷하게, 저자들은 $[t_{boost}, 0]$에 stochastic noise를 inject하고 이를 boosting interval이라고 명명합니다.

더 긴 boosting inverval은 higher quality로 이어지지만, 너무 과도하게 긴 interval은 content를 변형시킵니다. 그리하여, 저자들은 editing inverval과 비슷하게 quality boosting와 minimal content change를 보장하는 최소한의 구간을 결정하고자 합니다. image의 noise를 quality boosting의 용량이라고 생각하고 이를 quality deficiency라 하기로 합니다. ($\gamma_t=LPIPS(x, x_t)$)와 같이 표현하는데, 이는 original image와 $x_t$간의 noise의 amount입니다. 여기서 $\mathsf{P}_t$대신 $x_t$를 사용했는데, 지금 상황에서는 semantic 보다 actual image에 대해 다루기 때문입니다. 저자들은 경험적으로 $t_{boost}$ 를 $\gamma t_{boost}=1.2$가 가장 많은 일반적인 quality boosting과 가장 적은 content change를 제공했다 합니다. 그리고 이는, $[t_{boost}, 0]$이 editing strength를 0.25보다 적음을 보장한다 합니다. Figure 3을 보면, $t_{boost}$이후에 $LPIPS(x, x_t)$가 날카롭게 drop한 반면, $LPIPS(x, \mathsf{P}_t)$는 적게 바뀌었음을 볼 수 있습니다. 저자들은 실험의 대부분의 quality degrade는 DDIM때문이지 Asyrp때문이 아니라고 합니다.

overall process of image editing

Experiments

첫번째 장 에서는 다양한 attribute, dataset, architecture에 대해서 h-space와 Asyrp의 효율성을 보여줍니다. 두번째 장에서는 quantitative 결과를 보여주고, 세번째 장에서는 h-space와 $epsilon$-space에서의 semantic latent space의 속성에 대해 분석합니다.

implement details

detail은 위에 설명한 내용입니다.

Versatility of h-space with Asyrp

Asyrp on various dataset

기존의 U-Net 기반의 아키텍쳐와 다양한 dataset에 대해 Asyrp의 효율성을 보여줍니다. Asyrp은 training에서 보지 못했던 attribute도 synthesize할 수 있습니다 (church -> {department, factory, and temple}). 심지어 dogs에서 Poodle과 Yorkshire에 smiling을 synthesize했는데, 해당 종의 dataset에서는 찾아보기 힘든 그림입니다.

Figure 5를보면, 사람 이미지를 다양한 identities로 변경한 것을 보여줍니다. 이는 Asyrp의 Versatility는 놀라운데, 그 이유는 model을 바꾸지도 않고 inference단계에서 그저 h-space의 bottleneck feature map을 shift했는데, 이런 결과가 나오기 때문입니다.

Quantitative comparison

저자들의 방법은 별도의 finetuning없이 various diffusion model과 결합할 수 있습니다. 그럼에도 불구하고 Asyrp은 모든 모델을 finetuning시키는 DiffusionCLIP와 비교합니다.

compare with DiffusionCLIP

해당 실험결과는 80명의 참가자들에게 40개의 original image set에 대해, 어떤게 더 quality 좋고 자연스럽게 synthesize했냐고 물어봤을때의 결과입니다. Table 1은 Asyrp이 DiffusionCLIP을 다방면에서 월등히 뛰어남을 보여줍니다. Appendix K에 더 많은 평가지표로 비교한 결과가 있습니다.

Analysis on h-space

저자들은 semantic latent space의 다양한 속성에 대해 분석합니다 (homogeneity, linearity, robustness, consistency across timesteps).

Homogenity

FIgure 6에서 h-space와 $\epsilon$-space에서의 homogenity를 비교한 결과입니다. 하나의 image에 대해 최적화된 $\vartriangle h_t$가 다른 input image에 적용된 결과를 보여줍니다. 또한, 하나의 이미지에 대해 최적화된 $\vartriangle \epsilon_t$가 다른 input image에 적용되면 이는 image distort를 발생시킵니다.

Figure 10을 보면, $\vartriangle h_t^{mean}=\frac{1}{N}\sum \vartriangle_t^i$가 random sample $N=20$에 대한 결과와 거의 비슷했음을 보여줍니다.

Linearity

Figure 7을 보면, $\vartriangle h$를 linearly scaling하면 이에 해당하는 크기만큼 visual attribute도 변함을 볼 수 있습니다. 놀랍게도 이는 negative scaling까지 적용됩니다.

더 나아가서 Figure 8을 보면, 서로 다른 $\vartriangle h$를 합치면 이는 두개의 semantic change를 한번에 표현할 수 있다는 결과를 볼 수 있습니다.

Robustness

Figure 9는 h-space와 $\epsilon$-space에 대해 random noise를 추가한 것에 대한 결과입니다. random noise는 h-space와 $\epsilon$-space에 대해 각각 random direction, random magnitude로 골라졌습니다. h-space에 Perturbation된 결과는 semantic change가 거의 없는 realistic image로 이어졌습니다. 반면에 $\epsilon$-space에 Perturbation된 결과이미지는 distort되는 결과를 볼 수 있습니다.

Consistency across timesteps

이전에 모든 샘플에 대한 $\vartriangle h_t$에 대해 $\vartriangle h_t^{mean}$을 사용해도 기존과 비슷한 결과를 나타냈다고 했습니다. 이번에는 저자들은 time-invariant한 $\vartriangle h_{global}=\frac{1}{T_e}\sum_{t}\vartriangle h_t^{mean}$를 $\vartriangle h_t$를 추가하게되면, 이 또한 비슷한 결과가 나옴을 볼 수 있었다 합니다 ($T_e$는 editing interval $[T, t_{edit}]$). 즉 저자들은 best quality를 뽑을려면 $\vartriangle h_t$를 사용하는것이 좋고, $\vartriangle h_t^{mean}$ or $\vartriangle h^{global}$을 일종의 타협점으로서 간단하게 뽑고싶을때 사용해도 괜찮다고 합니다.

[ 딥러닝 코드 리뷰 - PRMI Lab ] - DDPM 코드 리뷰 및 실행

Lee현서 — Mon, 17 Mar 2025 13:02:52 +0900

이번에는 DDPM 공식 레포 코드를 분석하고 그 안에 구현된 디테일들이나 최신 기술들에 대해 알아보려고합니다. 마지막에는 직접 돌려봐서 celeba 데이터셋에 대해서 훈련시키고 샘플링시키는 작업까지 해보겠습니다.

논문 링크: https://arxiv.org/pdf/2006.11239

U-Net

model

class Unet(Module):
    def __init__(
        self,
        dim,
        init_dim = None,
        out_dim = None,
        dim_mults = (1, 2, 4, 8),
        channels = 3,
        self_condition = False,
        learned_variance = False,
        learned_sinusoidal_cond = False,
        random_fourier_features = False,
        learned_sinusoidal_dim = 16,
        sinusoidal_pos_emb_theta = 10000,
        dropout = 0.,
        attn_dim_head = 32,
        attn_heads = 4,
        full_attn = None,    # defaults to full attention only for inner most layer
        flash_attn = False
    ):
        super().__init__()

        # determine dimensions

        self.channels = channels
        self.self_condition = self_condition
        input_channels = channels * (2 if self_condition else 1)

        # init_dim이 없으면 dim으로 
        init_dim = default(init_dim, dim)
        self.init_conv = nn.Conv2d(input_channels, init_dim, 7, padding = 3)

        # [dim, dim, dim * (dim_mults)[0], dim * (dim_mults)[1], ..., (dim_mults)[len(dim_mults) - 1]]
        # [64, 64, 128, 256, 512]
        dims = [init_dim, *map(lambda m: dim * m, dim_mults)]
        in_out = list(zip(dims[:-1], dims[1:]))

        # time embeddings

        time_dim = dim * 4

        self.random_or_learned_sinusoidal_cond = learned_sinusoidal_cond or random_fourier_features

        # time embedding(positional encoding)을 위한 SinusoidalPosEmb 생성.
        if self.random_or_learned_sinusoidal_cond:
            sinu_pos_emb = RandomOrLearnedSinusoidalPosEmb(learned_sinusoidal_dim, random_fourier_features)
            fourier_dim = learned_sinusoidal_dim + 1
        else:
            sinu_pos_emb = SinusoidalPosEmb(dim, theta = sinusoidal_pos_emb_theta)
            fourier_dim = dim

        # 선언한 sinu_pos_emb를 time_mlp에 추가 --> Linear --> GELU --> Linear
        self.time_mlp = nn.Sequential(
            sinu_pos_emb,
            nn.Linear(fourier_dim, time_dim),
            nn.GELU(),
            nn.Linear(time_dim, time_dim)
        )

        # attention

        if not full_attn:
            full_attn = (*((False,) * (len(dim_mults) - 1)), True)

        num_stages = len(dim_mults)
        full_attn  = cast_tuple(full_attn, num_stages)
        attn_heads = cast_tuple(attn_heads, num_stages)
        attn_dim_head = cast_tuple(attn_dim_head, num_stages)

        assert len(full_attn) == len(dim_mults)

        # prepare blocks

        FullAttention = partial(Attention, flash = flash_attn)
        resnet_block = partial(ResnetBlock, time_emb_dim = time_dim, dropout = dropout)

        # layers

        self.downs = ModuleList([])
        self.ups = ModuleList([])
        num_resolutions = len(in_out)


        # Down-sampling (downs)
        for ind, ((dim_in, dim_out), layer_full_attn, layer_attn_heads, layer_attn_dim_head) in enumerate(zip(in_out, full_attn, attn_heads, attn_dim_head)):
            is_last = ind >= (num_resolutions - 1)

            attn_klass = FullAttention if layer_full_attn else LinearAttention

            self.downs.append(ModuleList([
                resnet_block(dim_in, dim_in),
                resnet_block(dim_in, dim_in),
                attn_klass(dim_in, dim_head = layer_attn_dim_head, heads = layer_attn_heads),
                Downsample(dim_in, dim_out) if not is_last else nn.Conv2d(dim_in, dim_out, 3, padding = 1)
            ]))

        # Mid block
        mid_dim = dims[-1]
        self.mid_block1 = resnet_block(mid_dim, mid_dim)
        self.mid_attn = FullAttention(mid_dim, heads = attn_heads[-1], dim_head = attn_dim_head[-1])
        self.mid_block2 = resnet_block(mid_dim, mid_dim)

        # Up-sampling (ups)
        for ind, ((dim_in, dim_out), layer_full_attn, layer_attn_heads, layer_attn_dim_head) in enumerate(zip(*map(reversed, (in_out, full_attn, attn_heads, attn_dim_head)))):
            is_last = ind == (len(in_out) - 1)

            attn_klass = FullAttention if layer_full_attn else LinearAttention

            self.ups.append(ModuleList([
                resnet_block(dim_out + dim_in, dim_out),
                resnet_block(dim_out + dim_in, dim_out),
                attn_klass(dim_out, dim_head = layer_attn_dim_head, heads = layer_attn_heads),
                Upsample(dim_out, dim_in) if not is_last else  nn.Conv2d(dim_out, dim_in, 3, padding = 1)
            ]))

        default_out_dim = channels * (1 if not learned_variance else 2)
        self.out_dim = default(out_dim, default_out_dim)

        self.final_res_block = resnet_block(init_dim * 2, init_dim)
        self.final_conv = nn.Conv2d(init_dim, self.out_dim, 1)

    @property
    def downsample_factor(self):
        return 2 ** (len(self.downs) - 1)
    
    # DDPM의 reverse process과정
    def forward(self, x, time, x_self_cond = None):
        assert all([divisible_by(d, self.downsample_factor) for d in x.shape[-2:]]), f'your input dimensions {x.shape[-2:]} need to be divisible by {self.downsample_factor}, given the unet'

        if self.self_condition:
            x_self_cond = default(x_self_cond, lambda: torch.zeros_like(x))
            x = torch.cat((x_self_cond, x), dim = 1)

        x = self.init_conv(x)
        r = x.clone()

        t = self.time_mlp(time)

        h = []

        for block1, block2, attn, downsample in self.downs:
            x = block1(x, t)
            h.append(x)

            x = block2(x, t)
            x = attn(x) + x
            h.append(x)

            x = downsample(x)

        x = self.mid_block1(x, t)
        x = self.mid_attn(x) + x
        x = self.mid_block2(x, t)

        for block1, block2, attn, upsample in self.ups:
            x = torch.cat((x, h.pop()), dim = 1)
            x = block1(x, t)

            x = torch.cat((x, h.pop()), dim = 1)
            x = block2(x, t)
            x = attn(x) + x

            x = upsample(x)

        x = torch.cat((x, r), dim = 1)

        x = self.final_res_block(x, t)

        # 해당 값이 predicted noise값값
        return self.final_conv(x)

Unet에 대해서는 다 아실거라고 생각하고 간단히 정리하겠습니다.

x_self_cond
- 이는 이전 출력을 다음 입력에 추가해서 넣어주기 위한 인자입니다.
init_conv
- input_channels -> init_dim으로 차원 embedding을 해주는 시작 convolution입니다.
dims
- dim_mults로 U-net의 downsample 차원을 조절해줍니다.
- dim_mults가 (1, 2, 4, 8) 이면 [init_dim, init_dim*1, init_dim*2, init_dim*4, init_dim*8]
time_dim
- ddpm에서 현재 timestep T를 Unet의 input으로 넣어주는데, 이에 대한 embedding 크기를 나타냅니다.
random_or_learned_sinusoidal_cond
- positional embedding을 학습시킬것인지에 대한 인자입니다.
time_mlp
- Linear -> GELU -> Linear로 우리가 원하는 크기의 time embedding을 만듭니다.
FullAttention
- Unet에서 Attention을 사용하기 위해 정의한 부분입니다. (class Attention)
  - flash = flash_attn을 주어 더 빠른 flash attention을 수행시켜줄 수 있습니다. (저는 4090에서 flash attention을 지원하지 않아 못돌렸습니다.)
downs, ups
- ModuleList들을 저장한 배열인데, downs에는 resnet_block -> resnet_block -> attn_klass -> downsample을 저장해줍니다.
- forward()에서 downs배열을 for돌아서 넣고 mid_block 후에 ups배열 다 돈다음에 끝납니다.
- 참고로 time embedding과 추가로 concatenate해서 돌리기 때문에, 배열의 차원이 기존과는 다릅니다.
  - class ResNetBlock을 참고하시면 될거 같습니다.

Attention, Positional Encoding과 관련된 부분은 설명하지 않겠습니다. Attention부분은 eipons 라이브러리로 대부분이 구현되어 있으며, mem key_value 를 사용하여 메모리 효율성을 높인 구현체입니다. 궁금하면 class Attention(full_attn)과 LinearAttention을 보시기 바랍니다. Positional Encoding은 이전의 NeRF code 분석글을 참고하시기 바랍니다.

GaussianDiffusion

class GaussianDiffusion(Module):
    def __init__(
        self,
        model,
        *,
        image_size,
        timesteps = 1000,
        sampling_timesteps = None,
        objective = 'pred_v',
        beta_schedule = 'sigmoid',
        schedule_fn_kwargs = dict(),
        ddim_sampling_eta = 0.,
        auto_normalize = True,
        offset_noise_strength = 0.,  # https://www.crosslabs.org/blog/diffusion-with-offset-noise
        min_snr_loss_weight = False, # https://arxiv.org/abs/2303.09556
        min_snr_gamma = 5,
        immiscible = False
    ):
        super().__init__()
        assert not (type(self) == GaussianDiffusion and model.channels != model.out_dim)
        assert not hasattr(model, 'random_or_learned_sinusoidal_cond') or not model.random_or_learned_sinusoidal_cond

        self.model = model

        self.channels = self.model.channels
        self.self_condition = self.model.self_condition

        if isinstance(image_size, int):
            image_size = (image_size, image_size)
        assert isinstance(image_size, (tuple, list)) and len(image_size) == 2, 'image size must be a integer or a tuple/list of two integers'
        self.image_size = image_size

        self.objective = objective

        assert objective in {'pred_noise', 'pred_x0', 'pred_v'}, 'objective must be either pred_noise (predict noise) or pred_x0 (predict image start) or pred_v (predict v [v-parameterization as defined in appendix D of progressive distillation paper, used in imagen-video successfully])'


        # beta schedule 하는 곳
        if beta_schedule == 'linear':
            beta_schedule_fn = linear_beta_schedule
        elif beta_schedule == 'cosine':
            beta_schedule_fn = cosine_beta_schedule
        elif beta_schedule == 'sigmoid':
            beta_schedule_fn = sigmoid_beta_schedule
        else:
            raise ValueError(f'unknown beta schedule {beta_schedule}')

        betas = beta_schedule_fn(timesteps, **schedule_fn_kwargs)
        

### beta schedule 부분 ###

def linear_beta_schedule(timesteps):
    """
    linear schedule, proposed in original ddpm paper
    """
    scale = 1000 / timesteps
    beta_start = scale * 0.0001
    beta_end = scale * 0.02
    return torch.linspace(beta_start, beta_end, timesteps, dtype = torch.float64)

def cosine_beta_schedule(timesteps, s = 0.008):
    """
    cosine schedule
    as proposed in https://openreview.net/forum?id=-NEXDKk8gZ
    """
    steps = timesteps + 1
    t = torch.linspace(0, timesteps, steps, dtype = torch.float64) / timesteps
    alphas_cumprod = torch.cos((t + s) / (1 + s) * math.pi * 0.5) ** 2
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0, 0.999)


# 논문에서의 beta schedule
def sigmoid_beta_schedule(timesteps, start = -3, end = 3, tau = 1, clamp_min = 1e-5):
    """
    sigmoid schedule
    proposed in https://arxiv.org/abs/2212.11972 - Figure 8
    better for images > 64x64, when used during training
    """

    # alpha cumprod가 부드럽게 변하게 해주는 beta schedule 방식임
    # 예를들어 후반부에서는 천천히 변화해서 역방향 복원이 쉬워짐
    steps = timesteps + 1
    t = torch.linspace(0, timesteps, steps, dtype = torch.float64) / timesteps

    # 시그모이드로 변환된 시작값과 끝값을 구함
    v_start = torch.tensor(start / tau).sigmoid()
    v_end = torch.tensor(end / tau).sigmoid()

    alphas_cumprod = (-(( t * (end - start) + start ) / tau).sigmoid() + v_end) / (v_end - v_start)
    alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
    betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
    return torch.clip(betas, 0, 0.999)

ddpm

첫부분에는 beta schedule을 설정합니다. 논문에서는 단순한 linear schedule을 사용했지만, 실제 ddpm의 default는 sigmoid_beta_schedule로 alphas_cumprod가 매끄럽게 schedule되도록 beta값을 schedule하는 기법을 사용했습니다. 최종적으로 betas 배열을 선언해 beta값을 만듭니다.

# alpha값 정의
alphas = 1. - betas

# alpha_cumprod 정의의
alphas_cumprod = torch.cumprod(alphas, dim=0)
# 벡터의 맨 앞에 1을 추가 --> alpha_cumprod_{t-1} = [1, alpha_cumprod_{1}, alpha_cumprod_{2},... ,alpha_cumprod_{t-2}]
alphas_cumprod_prev = F.pad(alphas_cumprod[:-1], (1, 0), value = 1.)

timesteps, = betas.shape
self.num_timesteps = int(timesteps)

# sampling related parameters

self.sampling_timesteps = default(sampling_timesteps, timesteps) # default num sampling timesteps to number of timesteps at training

assert self.sampling_timesteps <= timesteps
self.is_ddim_sampling = self.sampling_timesteps < timesteps
self.ddim_sampling_eta = ddim_sampling_eta

# helper function to register buffer from float64 to float32

register_buffer = lambda name, val: self.register_buffer(name, val.to(torch.float32))

register_buffer('betas', betas)
register_buffer('alphas_cumprod', alphas_cumprod)
register_buffer('alphas_cumprod_prev', alphas_cumprod_prev)

# calculations for diffusion q(x_t | x_{t-1}) and others

register_buffer('sqrt_alphas_cumprod', torch.sqrt(alphas_cumprod))
register_buffer('sqrt_one_minus_alphas_cumprod', torch.sqrt(1. - alphas_cumprod))
register_buffer('log_one_minus_alphas_cumprod', torch.log(1. - alphas_cumprod))
register_buffer('sqrt_recip_alphas_cumprod', torch.sqrt(1. / alphas_cumprod))
register_buffer('sqrt_recipm1_alphas_cumprod', torch.sqrt(1. / alphas_cumprod - 1))

# calculations for posterior q(x_{t-1} | x_t, x_0)

posterior_variance = betas * (1. - alphas_cumprod_prev) / (1. - alphas_cumprod)

# above: equal to 1. / (1. / (1. - alpha_cumprod_tm1) + alpha_t / beta_t)

register_buffer('posterior_variance', posterior_variance)

# below: log calculation clipped because the posterior variance is 0 at the beginning of the diffusion chain

register_buffer('posterior_log_variance_clipped', torch.log(posterior_variance.clamp(min =1e-20)))
register_buffer('posterior_mean_coef1', betas * torch.sqrt(alphas_cumprod_prev) / (1. - alphas_cumprod))
register_buffer('posterior_mean_coef2', (1. - alphas_cumprod_prev) * torch.sqrt(alphas) / (1. - alphas_cumprod))

# immiscible diffusion

self.immiscible = immiscible

# offset noise strength - in blogpost, they claimed 0.1 was ideal

self.offset_noise_strength = offset_noise_strength

# derive loss weight
# snr - signal noise ratio

snr = alphas_cumprod / (1 - alphas_cumprod)

# https://arxiv.org/abs/2303.09556

maybe_clipped_snr = snr.clone()
if min_snr_loss_weight:
    maybe_clipped_snr.clamp_(max = min_snr_gamma)

if objective == 'pred_noise':
    register_buffer('loss_weight', maybe_clipped_snr / snr)
elif objective == 'pred_x0':
    register_buffer('loss_weight', maybe_clipped_snr)
elif objective == 'pred_v':
    register_buffer('loss_weight', maybe_clipped_snr / (snr + 1))

# auto-normalization of data [0, 1] -> [-1, 1] - can turn off by setting it to be False

self.normalize = normalize_to_neg_one_to_one if auto_normalize else identity
self.unnormalize = unnormalize_to_zero_to_one if auto_normalize else identity

alphas
- betas배열로 alphas를 선언하고 이에 대해 $\alpha_{t}, \alpha_{t-1}$를 선언합니다.
sampling_timesteps
- train, sampling할때의 총 timestep의 개수를 선언합니다.
is_ddim_sampling
- ddpm으로 훈련 & ddim으로 sampling할 것인지에 대한 flag,
ddim_Sampling_eta
- ddim sampling시에 결정해야하는 deterministic을 결정하는 eta

이전에 선언한 변수를 모델의 state에 저장하기 위해 register_buffer를 사용하여 tensor를 저장합니다. register_buffer로 state를 저장하면 tensor가 모델의 일부로 저장되지만, 학습 시에는 requires_grad=False로 설정되어 자동으로 업데이트되지 않습니다. betas, alphas_cumprod, alphas_cumprod_prev와 alphas를 이용해 ddpm수식에 많이 사용되는 form을 저장합니다.

posterior_variance
- $\tilde{\beta}_{t} = \beta_{t}\times\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}$
- 이는 Reverse Process에서 $p(x_{t-1}|x_t)$ 모델링에 필요한 variance 값입니다.
posterior_mean_coef1, posterior_mean_coef2
- $\tilde{\mu}(x_t, x_0) = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha_t}}x_t$
- 이는 Reverse Process모델링에 필요한 mean의 coef2개 입니다.

또한 ddpm을 정의할때 code에서는 3개의 objective를 선언합니다. 여기서는 이에 따라 weight_loss를 분기처리하는 부분을 맨 마지막에서 볼 수 있습니다.

pred_noise
- objective가 기존처럼 $\epsilon_{t}$를 계산하는 것입니다.
pred_x0
- objective가 $x_0$을 예측하는 것으로 이를 통해 train을 합니다.
pred_v
- objective가 새로 정의된 velocity를 예측하는 것입니다.

마지막에 이미지를 normalize, unnormalize하는 부분을 볼 수 있습니다.

normalize
- image의 입력을 [-1, 1]로 변환합니다. 이 범위로 ddpm을 훈련합니다.
unnormalize
- normalize된 이미지를 다시 [0, 1]로 복원해서 렌더링할 수 있게합니다.

DDPM Train

    def forward(self, img, *args, **kwargs):
        b, c, h, w, device, img_size, = *img.shape, img.device, self.image_size
        assert h == img_size[0] and w == img_size[1], f'height and width of image must be {img_size}'

        # image의 batchsize만큼 t를 추출한다 
        t = torch.randint(0, self.num_timesteps, (b,), device=device).long()

        # normalize할때 img를 [-1, 1]범위로 놓고 학습을 한다.
        img = self.normalize(img)

        # t값을 정규화된 image와 함꼐 reverse_process에 넣어준다.
        return self.p_losses(img, t, *args, **kwargs)
        
        
     def p_losses(self, x_start, t, noise = None, offset_noise_strength = None):
        b, c, h, w = x_start.shape

        noise = default(noise, lambda: torch.randn_like(x_start))

        # offset noise - https://www.crosslabs.org/blog/diffusion-with-offset-noise

        offset_noise_strength = default(offset_noise_strength, self.offset_noise_strength)

        if offset_noise_strength > 0.:
            offset_noise = torch.randn(x_start.shape[:2], device = self.device)
            noise += offset_noise_strength * rearrange(offset_noise, 'b c -> b c 1 1')

        # noise sample
        # sampling한 t를 활용해 x_t (noise)를 만듬
        x = self.q_sample(x_start = x_start, t = t, noise = noise)

        # if doing self-conditioning, 50% of the time, predict x_start from current set of times
        # and condition with unet with that
        # this technique will slow down training by 25%, but seems to lower FID significantly

        x_self_cond = None
        # 이전 결괏값을 다시 넣어주기 위함
        if self.self_condition and random() < 0.5:
            with torch.no_grad():
                x_self_cond = self.model_predictions(x, t).pred_x_start
                x_self_cond.detach_()

        # predict and take gradient step

        model_out = self.model(x, t, x_self_cond)

        if self.objective == 'pred_noise':
            target = noise # x_0 (원본 이미지)를 direct로 예측
        elif self.objective == 'pred_x0':
            target = x_start # pred_v가 default이다다
        elif self.objective == 'pred_v':
            v = self.predict_v(x_start, t, noise)
            target = v
        else:
            raise ValueError(f'unknown objective {self.objective}')

        # MSE로 target과 model_out과 비교
        # 아마 model도 target에 따라 다른 output을 내보내게 했을거임.
        loss = F.mse_loss(model_out, target, reduction = 'none')
        # batch차원 빼고 싹다 평균냄
        loss = reduce(loss, 'b ... -> b', 'mean')

        # SNR을 이용해 신호가 강할때(초기 노이즈가 적을때) 높은 가중치를 부여
        # 예측이 쉬운 영역에서는 큰 Loss weight를 줘 모델이 정밀하게 학습하도록 유도함
        loss = loss * extract(self.loss_weight, t, loss.shape)
        return loss.mean()
        
    @autocast('cuda', enabled = False)
    def q_sample(self, x_start, t, noise = None):
        noise = default(noise, lambda: torch.randn_like(x_start))

        if self.immiscible:
            assign = self.noise_assignment(x_start, noise)
            noise = noise[assign]

        return (
            extract(self.sqrt_alphas_cumprod, t, x_start.shape) * x_start +
            extract(self.sqrt_one_minus_alphas_cumprod, t, x_start.shape) * noise
        )

forward와 loss를 구하는 부분입니다. forward에서는 먼저 random int를 [1, timesteps]범위에서 batch size만큼 uniform distribution에서 뽑습니다. 그 후 image를 normalize하고 p_losses라는 함수에 img, t를 같이 넣어줍니다.

p_losses에서는 noise를 먼저 normal gaussian distribution에서 sampling합니다. 그리고

$ \epsilon_{offset}=\epsilon+\sigma \cdot\mathcal{N}(0, I) $를 통해 offset noise라는 기법을 noise에 적용해줍니다. offsret noise를 추가하여 학습 과정에서 노이즈를 더 안정적으로 예측할 수 있게 됩니다. 그 후, q_sample이라는 함수에 img(x_start)이전에 sampling한 t, noise를 인자로 주어 $x_t$를 만듭니다.

이전에 본 Unet(model)에 $x_t, t$를 주어 model_out을 뽑아냅니다. 참고로 model_out은 논문상으로는 $\epsilon_{t}$이지만 경우에따라서 $x_0$이나 velocity가 될 수 있다고 위에서 말했었습니다. 그 이유로, 분기문으로 objective에 따른 mse계산을 위해 target을 선택합니다. 그리고 model의 결과와 target으로 mse를 구하고 batch차원으로 싹다 평균낸 후, 위에서 loss_weight를 정의했는데 이를 적용시켜준 후 해당 loss값을 반환합니다.

p_losses에서 $x_t$계산에 사용된 q_sample에서는 $\sqrt{\bar{\alpha}_t}x_0+\sqrt{1-\bar{\alpha}_t}\epsilon$을 통해 $x_t$를 구합니다. 참고로 q_sample에 있는 immiscible은 만약 특정 데이터가 지나치게 강한 노이즈를 받거나, 특정 노이즈 패턴이 학습되지 않는 경우 적용할 수 있는 방법입니다.

    def model_predictions(self, x, t, x_self_cond = None, clip_x_start = False, rederive_pred_noise = False):
        model_output = self.model(x, t, x_self_cond)
        maybe_clip = partial(torch.clamp, min = -1., max = 1.) if clip_x_start else identity

        if self.objective == 'pred_noise':
            pred_noise = model_output
            x_start = self.predict_start_from_noise(x, t, pred_noise)
            x_start = maybe_clip(x_start)

            if clip_x_start and rederive_pred_noise:
                pred_noise = self.predict_noise_from_start(x, t, x_start)

        elif self.objective == 'pred_x0':
            x_start = model_output
            x_start = maybe_clip(x_start)
            pred_noise = self.predict_noise_from_start(x, t, x_start)

        elif self.objective == 'pred_v':
            v = model_output
            x_start = self.predict_start_from_v(x, t, v)
            x_start = maybe_clip(x_start)
            pred_noise = self.predict_noise_from_start(x, t, x_start)

        return ModelPrediction(pred_noise, x_start)
        

    def predict_start_from_noise(self, x_t, t, noise):
        return (
            extract(self.sqrt_recip_alphas_cumprod, t, x_t.shape) * x_t -
            extract(self.sqrt_recipm1_alphas_cumprod, t, x_t.shape) * noise
        )

p_losses에서 model에 입력을 주입할때 사용한 model_prediction입니다. Unet에 집어넣어 model_output을 만들고, clamp로 [-1, 1]범위로 output의 출력값의 범위를 조절합니다. 그 후, 앞서 말했던 objective에 따라 x_start, pred_noise를 계산합니다. 예를들어, objective == pred_noise에서 pred_noise는 model_output 그 자체일 것이며, x_start는 predict_start_from_noise함수를 통해 $x_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(x_{t}-\sqrt{1-\bar{\alpha}_t}\cdot\epsilon_{\theta}$를 구합니다. 나머지 objective도 동일하게 계산할 수 있고, ModelPrediction이라는 새로운 tuple과같은 구조체로 반환합니다.

DDPM Sampling

    @torch.inference_mode()
    def sample(self, batch_size = 16, return_all_timesteps = False):
        (h, w), channels = self.image_size, self.channels
        sample_fn = self.p_sample_loop if not self.is_ddim_sampling else self.ddim_sample
        return sample_fn((batch_size, channels, h, w), return_all_timesteps = return_all_timesteps)
        
        
    # sampling
    @torch.inference_mode()
    def p_sample_loop(self, shape, return_all_timesteps = False):
        batch, device = shape[0], self.device

        # x_T는 랜덤 가우시안 노이즈
        img = torch.randn(shape, device = device)
        imgs = [img]

        x_start = None
        # t를 역순으로 시작해서 x_0까지 점진적으로 샘플을 복원
        for t in tqdm(reversed(range(0, self.num_timesteps)), desc = 'sampling loop time step', total = self.num_timesteps):
            
            # self-cond가 켜져있으면, 이전의 x_start값을 다음 단계의 추가 입력으로 사용
            self_cond = x_start if self.self_condition else None
            img, x_start = self.p_sample(img, t, self_cond)
            imgs.append(img)

        ret = img if not return_all_timesteps else torch.stack(imgs, dim = 1)

        # 데이터를 [0, 1]범위로 변환 (DDM 학습시에 [-1, 1]범위로 학습했음)
        ret = self.unnormalize(ret)
        return ret
        
    # 현재 시간 t에서 x_t -> x_t-1을 샘플링
    @torch.inference_mode()
    def p_sample(self, x, t: int, x_self_cond = None):
        # 현재 배치 크기와 디바이스를 가져옴
        b, *_, device = *x.shape, self.device

        # 모든 배치에대해 현재 t값을 동일한 형태로 변환해줌
        batched_times = torch.full((b,), t, device = device, dtype = torch.long)

        # 모델이 예측한 x_t-1의 평균값과 분산을 활용
        model_mean, _, model_log_variance, x_start = self.p_mean_variance(
            x = x, t = batched_times, x_self_cond = x_self_cond, clip_denoised = True)
        noise = torch.randn_like(x) if t > 0 else 0. # no noise if t == 0

        # x_t-1샘플링을 mu + sigma * noise 형식으로 진행함
        pred_img = model_mean + (0.5 * model_log_variance).exp() * noise
        
        
    def p_mean_variance(self, x, t, x_self_cond = None, clip_denoised = True):
        preds = self.model_predictions(x, t, x_self_cond)
        x_start = preds.pred_x_start

        if clip_denoised:
            x_start.clamp_(-1., 1.)

        model_mean, posterior_variance, posterior_log_variance = self.q_posterior(x_start = x_start, x_t = x, t = t)
        return model_mean, posterior_variance, posterior_log_variance, x_start
        return pred_img, x_start
        
        
    def q_posterior(self, x_start, x_t, t):
        posterior_mean = (
            extract(self.posterior_mean_coef1, t, x_t.shape) * x_start +
            extract(self.posterior_mean_coef2, t, x_t.shape) * x_t
        )
        posterior_variance = extract(self.posterior_variance, t, x_t.shape)
        posterior_log_variance_clipped = extract(self.posterior_log_variance_clipped, t, x_t.shape)

        
        return posterior_mean, posterior_variance, posterior_log_variance_clipped

sample함수에서는 sample_fn을 is_ddim_sampling에 따라서 ddpm sampling을 할것인지, ddim sampling을 할것인지로 나눕니다. 그리고 ddpm의 경우에는 p_sample_loop으로 결과를 반환하게 됩니다.

p_sample_loop에서 우선 img를 랜덤 가우시안 노이즈로 설정합니다. 그 후, t를 역순으로 시작하여 img를 점진적으로 timestep만큼 반복하여 sampling을 합니다. 반복문 안에서 p_sample함수를 통해 추정한 $x_{t-1}, x_{0}$을 뽑아내고 img배열에 붙힙니다. 그 후, 복원한 sample을 unnormalize시켜 반환합니다.

p_sample은 $x_t \rightarrow x_{t-1} $를 하는 과정입니다. 일단 모든 batch만큼 현재 timestep(t)값으로 채운 배열을 만듭니다. t를 통해 모든 batch에 대해 p_mean_variance함수를 통해 model_mean, model_log_variance, x_start를 구합니다. 그리고 noise를 샘플링하고 $x_{t-1} = \mu + (\frac{1}{2}log(\sigma^{2})\epsilon$식을 통해 $x_{t-1}$의 pred_img를 sampling합니다.

p_sample에서 사용된 p_mean_variance는 $x_t$, t를 통해 preds를 만들고, x_start(=$x_0$)과 함께 q_posterior에 투입하여 model_mean, posterior_variance, posterior_log_variance를 구합니다. q_posterior에 사용된 변수들은 위에서 언급한 register buffer에 저장했던 posterior_log_variance_clipped, posterior_mean_coef1, posterior_mean_coef2입니다.

DDIM Sampling

    # ddim은 deterministic하게 sampling가능
    @torch.inference_mode()
    def ddim_sample(self, shape, return_all_timesteps = False):
        # eta: stochasticity조절 변수 (0이면 DDPM과 동일), objective: 모델이 예측하는 대상(pred_x0, pred_noise, pred_v...)
        batch, device, total_timesteps, sampling_timesteps, eta, objective = shape[0], self.device, self.num_timesteps, self.sampling_timesteps, self.ddim_sampling_eta, self.objective

        
        # ex) T=1000, sampling_timesteps=50 --> time_pairs [(999, 980), (880, 960), ..., (40, 20), (20, 0)]
        times = torch.linspace(-1, total_timesteps - 1, steps = sampling_timesteps + 1)   # [-1, 0, 1, 2, ..., T-1] when sampling_timesteps == total_timesteps
        times = list(reversed(times.int().tolist()))
        time_pairs = list(zip(times[:-1], times[1:])) # [(T-1, T-2), (T-2, T-3), ..., (1, 0), (0, -1)]

        # img는 gaussian noise로 시작
        img = torch.randn(shape, device = device)
        imgs = [img]

        # self-conditioning도 가능
        x_start = None

        for time, time_next in tqdm(time_pairs, desc = 'sampling loop time step'):
            time_cond = torch.full((batch,), time, device = device, dtype = torch.long)
            # self-conditioning이 활성화되면, 이전에 예측한 x_0을 현재 스텝에서 입력으로 활용
            self_cond = x_start if self.self_condition else None

            # model이 x_t로부터 x_0과 noise를 예측
            pred_noise, x_start, *_ = self.model_predictions(img, time_cond, self_cond, clip_x_start = True, rederive_pred_noise = True)

            # 마지막 timestep에서는 x_0을 직접 반환
            if time_next < 0:
                img = x_start
                imgs.append(img)
                continue
            
            #  _t
            alpha = self.alphas_cumprod[time]
            #  _t-1
            alpha_next = self.alphas_cumprod[time_next]

            # sigma를 아래와같이 설정하면 forward-process가 Markovian이 되어 generative(reverse-process)가 DDPM이 된다.
            # sigma = 0이면 x_t-1, x_0에 대하여 forward process가 deterministic DDIM이 된다.
            # 참고 (DDPM의 목적함수로 학습된 implicit probablistic model이기 때문에 DDIM이라 부른다.)
            sigma = eta * ((1 - alpha / alpha_next) * (1 - alpha_next) / (1 - alpha)).sqrt()

            # 스케일링 계수 (Noise term을 조절)
            c = (1 - alpha_next - sigma ** 2).sqrt()

            noise = torch.randn_like(img)

            img = x_start * alpha_next.sqrt() + \
                  c * pred_noise + \
                  sigma * noise

            imgs.append(img)

        ret = img if not return_all_timesteps else torch.stack(imgs, dim = 1)

        ret = self.unnormalize(ret)
        return ret

ddim sampling

DDIM은 ddpm과 objective의 최적해가 같다고 증명되었고, forward process를 non-Markovian process로 바꾸어 sampling acceleration이 가능한 ICCV2021에 소개된 논문입니다. 대게 ddpm으로 훈련된 모델을 ddim으로 샘플링해 샘플링만을 가속화하고 샘플링 퀄리티를 높일수 있습니다.

meaning of ddim sigma value

ddim에서 eta는 $\sigma_t$를 말하는 것이며 이를 조절하여 ddim을 ddpm으로 만들 수도 있습니다. ddim sampling도 ddpm처럼 img를 gaussian noise로 놓고 시작합니다. 그리고 time_pairs를 따로 정의하는데 이는 sampling acceleration을 위해 미리 sampling timestep으로 timestep을 쪼갠 배열입니다. 그 후, model_prediction으로 ddpm과 동일하게 pred_noise, x_start를 만듭니다. 다음으로 alpha, alpha_next를 현재 timestep을 기반으로 만듭니다.

위에 ddim paper를 보면 모든 t에 대해 $\sigma_t = \sqrt{(1-\alpha_{t-1})/(1-\alpha_t)}\sqrt{1-\alpha_{t}/\alpha_{t-1}}$로 설정하면 forward process는 Markovian이 되어 generative process가 DDPM이 된다고 되어있습니다(ddpm에서의 $\bar{\alpha}=\alpha$). 이를 기반으로 코드에서도 sigma를 위의 값과 eta의 곱으로 설정합니다. 그리고 noise term을 조절하는 스케일링 계수도 $c = \sqrt{1-\alpha_{t-1}-\sigma^{2}}$로 설정합니다(eq12에서의 "direction pointing to $x_t$"). 이제 img를 위 식들을 조합해서 만들어 내고, 결괏값(ret)의 배열을 unnormalize해서 반환합니다.

Trainer

# trainer class

class Trainer:
    def __init__(
        self,
        diffusion_model,
        folder,
        *,
        train_batch_size = 16,
        gradient_accumulate_every = 1,
        augment_horizontal_flip = True, # 이미지가 확률적으로 좌우반전 됨.
        train_lr = 1e-4,
        train_num_steps = 100000,
        ema_update_every = 10,
        ema_decay = 0.995,
        adam_betas = (0.9, 0.99),
        save_and_sample_every = 1000,
        num_samples = 25,
        results_folder = './results',
        amp = False,
        mixed_precision_type = 'fp16',
        split_batches = True,
        convert_image_to = None,
        calculate_fid = True,
        inception_block_idx = 2048,
        max_grad_norm = 1.,
        num_fid_samples = 50000,
        save_best_and_latest_only = False
    ):
        super().__init__()

        # accelerator

        self.accelerator = Accelerator(
            split_batches = split_batches,
            mixed_precision = mixed_precision_type if amp else 'no'
        )

        # model

        self.model = diffusion_model
        self.channels = diffusion_model.channels
        is_ddim_sampling = diffusion_model.is_ddim_sampling

        # default convert_image_to depending on channels

        if not exists(convert_image_to):
            convert_image_to = {1: 'L', 3: 'RGB', 4: 'RGBA'}.get(self.channels)

        # sampling and training hyperparameters

        assert has_int_squareroot(num_samples), 'number of samples must have an integer square root'
        self.num_samples = num_samples
        self.save_and_sample_every = save_and_sample_every

        self.batch_size = train_batch_size
        self.gradient_accumulate_every = gradient_accumulate_every
        assert (train_batch_size * gradient_accumulate_every) >= 16, f'your effective batch size (train_batch_size x gradient_accumulate_every) should be at least 16 or above'

        self.train_num_steps = train_num_steps
        self.image_size = diffusion_model.image_size

        self.max_grad_norm = max_grad_norm

        # dataset and dataloader

        self.ds = Dataset(folder, self.image_size, augment_horizontal_flip = augment_horizontal_flip, convert_image_to = convert_image_to)

        assert len(self.ds) >= 100, 'you should have at least 100 images in your folder. at least 10k images recommended'

        dl = DataLoader(self.ds, batch_size = train_batch_size, shuffle = True, pin_memory = True, num_workers = cpu_count())

        dl = self.accelerator.prepare(dl)
        self.dl = cycle(dl)

        # optimizer

        self.opt = Adam(diffusion_model.parameters(), lr = train_lr, betas = adam_betas)

        # for logging results in a folder periodically

        if self.accelerator.is_main_process:
            # diffusion_model의 가중치를 부드럽게 업데이트 해주기 위함
            self.ema = EMA(diffusion_model, beta = ema_decay, update_every = ema_update_every)
            self.ema.to(self.device)

        self.results_folder = Path(results_folder)
        self.results_folder.mkdir(exist_ok = True)

        # step counter state

        self.step = 0

        # prepare model, dataloader, optimizer with accelerator

        self.model, self.opt = self.accelerator.prepare(self.model, self.opt)

        # FID-score computation

        self.calculate_fid = calculate_fid and self.accelerator.is_main_process

        if self.calculate_fid:
            from denoising_diffusion_pytorch.fid_evaluation import FIDEvaluation

            if not is_ddim_sampling:
                self.accelerator.print(
                    "WARNING: Robust FID computation requires a lot of generated samples and can therefore be very time consuming."\
                    "Consider using DDIM sampling to save time."
                )

            self.fid_scorer = FIDEvaluation(
                batch_size=self.batch_size,
                dl=self.dl,
                sampler=self.ema.ema_model,
                channels=self.channels,
                accelerator=self.accelerator,
                stats_dir=results_folder,
                device=self.device,
                num_fid_samples=num_fid_samples,
                inception_block_idx=inception_block_idx
            )

        if save_best_and_latest_only:
            assert calculate_fid, "`calculate_fid` must be True to provide a means for model evaluation for `save_best_and_latest_only`."
            self.best_fid = 1e10 # infinite

        self.save_best_and_latest_only = save_best_and_latest_only
        
        

# dataset classes

class Dataset(Dataset):
    def __init__(
        self,
        folder,
        image_size,
        exts = ['jpg', 'jpeg', 'png', 'tiff'],
        augment_horizontal_flip = False,
        convert_image_to = None
    ):
        super().__init__()
        self.folder = folder
        self.image_size = image_size
        self.paths = [p for ext in exts for p in Path(f'{folder}').glob(f'**/*.{ext}')]

        maybe_convert_fn = partial(convert_image_to_fn, convert_image_to) if exists(convert_image_to) else nn.Identity()

        self.transform = T.Compose([
            T.Lambda(maybe_convert_fn),
            T.Resize(image_size),
            T.RandomHorizontalFlip() if augment_horizontal_flip else nn.Identity(),
            T.CenterCrop(image_size),
            T.ToTensor() # 범위를 0~1로 바꿔주는 역할도함 --> [-1, 1] (norm, train) --> [0, 1] (unorm, sample)
        ])

    def __len__(self):
        return len(self.paths)

    def __getitem__(self, index):
        path = self.paths[index]
        img = Image.open(path)
        return self.transform(img)

Trainer의 변수 설정 부분입니다. Dataset에 대하여 dataloader를 작성해줍니다. Dataset class를 잠깐보면, transform함수들이 작성되어있는데, Resize -> RandomHorizontalFlip -> CenterCrop -> ToTensor순으로 진행됩니다. dataloader또한, 분산 gpu처리를 가능하게 하기위해 accelerator에 등록해줍니다. Optimizer는 Adam을 씁니다. 그 다음으로accelerator.is_main_process(분산 환경에서 단 한번만 실행하기) 안에 EMA가 정의되어있는데, 이는 diffusion_model의 가중치를 부드럽게 업데이트해주기 위함입니다(현재 가중치를 더 많이 반영).

calculate_fid가 true라면 FIDEvaluation을 정의하여, 실제 dataset에서 num_fid_sample개의 sample을 뽑아 ground truth와 FID를 계산합니다. 첨언으로 직접 훈련시킬때 Trainer에 num_fid_sample을 32 * 10으로 줄건데, 이는 [32, 32, ..., 32]의 배열을 만들어 batch개의 sample을 만들고 inception_feature를 이를 통해 추출한 뒤 32 * 10개의 feature를 활용해 FID를 구해 반환하겠다는 의미입니다.

    def train(self):
        accelerator = self.accelerator
        device = accelerator.device

        with tqdm(initial = self.step, total = self.train_num_steps, disable = not accelerator.is_main_process) as pbar:

            while self.step < self.train_num_steps:
                self.model.train()

                total_loss = 0.

                # gradient_accumulate는 batch_Size가 큰 것을 simulation하기 위해 사용됨.
                for _ in range(self.gradient_accumulate_every):
                    # dataloader에서 한 배치의 데이터를 가져와서 GPU로 이동
                    data = next(self.dl).to(device)

                    # AMP를 사용하여 FP16연산을 자동으로 수행해주도록 해주는 함수
                    with self.accelerator.autocast():
                        loss = self.model(data)
                        # gradient_accumulate_every만큼의 가중치를 한번만 update해야하므로
                        loss = loss / self.gradient_accumulate_every
                        total_loss += loss.item()
                    
                    # FP16, multi-GPU, gradient-accumulation을 자동처리하면서 역전파
                    self.accelerator.backward(loss)

                # loss를 step1개 마다 보여줌
                pbar.set_description(f'loss: {total_loss:.4f}')

                accelerator.wait_for_everyone()
                # gradient cliping을 통해 gradient를 normalization해서 학습 안정성을 높임
                accelerator.clip_grad_norm_(self.model.parameters(), self.max_grad_norm)

                self.opt.step()
                # optimizer의 그래디언트 값을 초기화 하는 역할
                self.opt.zero_grad()

                accelerator.wait_for_everyone()

                self.step += 1


                if accelerator.is_main_process:
                    self.ema.update()

                    # 일정한 step마다 sampling 및 모델 저장을 실행하는 역할
                    if self.step != 0 and divisible_by(self.step, self.save_and_sample_every):
                        # ema를 비활성화
                        self.ema.ema_model.eval()

                        # autograd끔 (메모리 최적화)
                        with torch.inference_mode():
                            milestone = self.step // self.save_and_sample_every
                            # 샘플을 여러 배치로 나누는 과정
                            batches = num_to_groups(self.num_samples, self.batch_size)
                            
                            # sample을 수행 --> p_sample_loop()
                            all_images_list = list(map(lambda n: self.ema.ema_model.sample(batch_size=n), batches))

                        all_images = torch.cat(all_images_list, dim = 0)

                        # 25개 이미지만 sampling하는 것임
                        utils.save_image(all_images, str(self.results_folder / f'sample-{milestone}.png'), nrow = int(math.sqrt(self.num_samples)))

                        # whether to calculate fid

                        # FID 계산 -> 실제 이미지와 생성된 이미지가 얼마나 유사한지 평가
                        if self.calculate_fid:
                            fid_score = self.fid_scorer.fid_score() # *********병목**********
                            accelerator.print(f'fid_score: {fid_score}')

                        # FID 기준으로 이전보다 좋으면 best model로서 저장
                        if self.save_best_and_latest_only:
                            if self.best_fid > fid_score:
                                self.best_fid = fid_score
                                self.save("best")
                            self.save("latest")
                        else:
                            self.save(milestone)

                pbar.update(1)

        accelerator.print('training complete')

대부분의 코드를 주석으로 달았으므로 중요한 부분 몇개만 보겠습니다. 초반에 gradient_accumulate_every변수만큼 반복문을 돌아 data를 가져오고 loss를 gradient_accumulate_every만큼 나누어 backward를 하여 가중치를 누정한 후, opt.step()으로 반복문을 마치고 그 다음에 가중치를 업데이트 합니다. 이로 인해 마치 큰 배치를 학습하는 효과를 낼 수 있습니다.

마지막 부분에 step이 save_and_sample_every에 divisible가능하다면 num_samples개의 sample을 만들어 all_images에 concatenate합니다. 그리고 이를 milestone형태로 저장합니다. 그 후, calculate_fid가 True라면 위에서 정의한 fid_scorer.fid_score()를 통해 num_fid_samples만큼의 sample을 통해 FID를 계산하고, 이 점수를 기반으로 하여 만약 save_best_and_latest_only가 True라면 best fid를 가지는 모델을 저장합니다.

Experiment

import kagglehub
import os
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import torch
from denoising_diffusion_pytorch import Unet, GaussianDiffusion, Trainer
from torchvision import transforms as T, utils
import math

# Download latest version
path = kagglehub.dataset_download("jessicali9530/celeba-dataset")

print("Path to dataset files:", path)

model = Unet(
    dim = 64,
    dim_mults = (1, 2, 2), # [64, 64, 128, 128]
    flash_attn = False
)

diffusion = GaussianDiffusion(
    model,
    image_size = 64,
    timesteps = 1000,    # T --> time step
    sampling_timesteps=500, # ddim sampling을 활용
    beta_schedule = 'linear',
)

trainer = Trainer(
    diffusion,
    os.path.join(path, "img_align_celeba/img_align_celeba"),
    train_batch_size = 32,
    train_lr = 1e-4,
    train_num_steps = 100000,         # total training steps
    gradient_accumulate_every = 2,    # gradient accumulation steps
    ema_decay = 0.995,                # exponential moving average decay
    amp = True,                       # turn on mixed precision
    calculate_fid = True,              # whether to calculate fid during training
    num_fid_samples = 32 * 10,
    save_best_and_latest_only=True # 가장 best fid를 가지는 model을 milestone으로 저장하겠다.
)

# trainer.train()

trainer.load(100)

num_samples = 64
all_images_list = list(map(lambda n: trainer.ema.ema_model.sample(batch_size=n), [32, 32]))
all_images = torch.cat(all_images_list, dim = 0)

utils.save_image(all_images, "./results/sample-100-test.png", nrow = int(math.sqrt(num_samples)))

dataset: celeba
GaussianDiffusion
- image_size: 64
- timesteps: 1000
- sampling_timestep: 500 (ddim)
  - ddpm의 2배 속도로 accelerate했다는 의미
- beta_schedule: linear (default = sigmoid)
Trainer
- train_batch_size: 32
- train_lr: 0.004
- train_num_steps: 100K
- gradient_accumulate_every: 2 (사실상 batch_size = 64를 simulation)
- ema_decay: 0.995
- num_fid_samples: 320

sampling result

4090 1way로 대략 10시간정도 train을 한 후, 가장 best model의 checkpoint에서 무작위 noise 64개로부터 sampling한 결과입니다.

이 외에도 latent vector(noise)가 같을때 결과가 어떤지와 semantic interpolation등도 실험해보고 싶었지만, 시간관계상 다음에 여유가 될때 한번 해보겠습니다.

[ 딥러닝 논문 리뷰 - PRMI Lab ] - COLMAP about SfM (Structure from Motion)

Lee현서 — Thu, 13 Mar 2025 11:40:55 +0900

3D-GS, NeRF등을 공부하다보면, COLMAP을 이용해서 3d point cloud도 뽑아내고, camera pose도 뽑아냅니다. COLMAP에서 쓰는 SfM에 대해서는 간단히 공부해서 알고있었지만, 이번 기회에 자세히 정확한 원리에 대해 알면 좋을거 같아서 정리합니다.

COLMAP은 오늘 다룰 SfM과 MVS를 사용하기 쉽게 랩핑한 라이브러리입니다. SfM의 주요 결과물은 이미지를 입력으로 받아서 Camera Parameter와 3D Point Cloud를 생성하는 것이고, MVS는 SfM결과를 이용해 3D Reconstruction하는 것에 있습니다.

먼저 SfM 논문인 Structure-from-Motion Revisited (CVPR 2016)을 다루겠습니다. 해당 논문은 SfM을 발전시킨 논문입니다. SfM연구들은 incremental, hierarchical, global으로 발전하게 됩니다. incremental SfM은 이미지를 순차, 반복적으로 처리하는 기법이었지만 robustness, completness, scalability, accuracy관점에서 general한 SfM을 만들기 어려워서 해당 논문에서 이를 해결하고자 합니다.

Feature Extraction, Matching

Feature Extraction에서는 영상 내부의 작은 영역이 모든 방향에 대해 gradient가 큰 경우를 corner로 만들어 각 특징점을 추출합니다. 이러한 방법은 영상의 rotation, translation, scale에 취약한데, 이를 보안한 Diiference of Gaussian을 발전시킨SIFT, SURF, BRISK, ORB, FAST같은 기법들이 사용됩니다.

Feature를 만들었다면 각 Feature간의 높은 유사도를 가진 descriptor끼리 매칭시켜주는 과정을 Matching이라고 합니다. 추후에 이러한 Matching은 epipolar geometry로 검증합니다.

Geometric Verification

위에서 feature extraction, matching한 결과를 검증할 필요가 있습니다. 매칭된 특징점 중에는 outlier가 포함될 가능성이 높기 때문에, RANSAC을 이용한 outlier제거를 합니다.

✅ RANSAC 기반의 Geometric Verification 방법

매칭된 특징점에서 일부 샘플을 랜덤하게 선택합니다 (ex: 8-point)
랜덤 선택된 점을 이용해 F-matrix, H-matrix을 추정합니다. (epipolar geometry)
이를 이용해 다른 점들에 대해 추정된 모델과의 오차를 계산합니다. (epipolar constraint)
오차가 임계값 이하인 경우, 해당 경우 inlier로 판단합니다.
위 과정을 여러번 반복 후, 최대 개수의 inlier를 포함하는 모델을 선택합니다.

✅ Epipolar Constraint 검증

Fundamental matrix $F$를 이용해 점의 정합이 기하학적으로 맞는지 확인합니다.
${p^{'}}^{T}Fp=0$ 과 같이 매칭된 점 $ p, p^{'}$가 Epipolar Line위에 존재해야 합니다.

✅ Final Refinement

RANSAC을 통해 선택된 inliner를 이용해 최종 Fundamental Matrix or Essential Matrix를 재계산합니다.
필요하면 뒤에서 설명할 비선형 최적화 (Bundle Adjustment)를 수행하여 정밀한 정합을 얻습니다.

Initialization

최초로 이미지 2개를 등록합니다. 이는 robustness하고 performance를 위해 중요한 과정입니다. 만약 여기서 여러 카메라로부터 overlap되는 이미지를 선택하면, 중복되는 영역이 반복적으로 최적화되면서 robust, accurate한 reconstruction결과가 만들어집니다. 이와 반대로, 빈도가 낮은 영역을 가지는 이미지들로 선택하면 Bundle Adjustment단계에서 반복적으로 처리할 feature가 적어지므로 reconstruction성능은 줄어들고 연산시간이 줄어들게되는 trade-off가 발생합니다.

Image Registration

위 Geometric Verification단계에서 fundamental matrix가 계산되었고, (self)-calibration으로 intrinsic matrix를 구하고, 이로부터 essential matrix를 만든 후 extrinsic을 추정합니다.

Triangulation

Triangulation(삼각측량)은 두 개 이상의 이미지에서 매칭된 특징점을 이용해 3D 점을 추정하는 것입니다. 이렇게 구한 3D 포인트들은 SfM의 3D Point Cloud를 구성하는 데이터가 됩니다. 이는 이전에 등록된 3d points를 새로운 이미지로 projection하면 잘 관측된다는 전제하에 이루어집니다.

triangulation

⭐️Bundle Adjustment

(i, j): (3개 카메라, 4개의 point) 예시

앞선 과정들의 에러를 조정해주기 위해, 비선형 최적화를 합니다. image registration에서 등록된 카메라 pose가 triangulation으로 전파되어 불확실한 3d point 정보로 이어질 수 있습니다. 이를 위해 Bundle Adjustment(BA)를 수행합니다. BA는 reprojection error를 최소화하기 위해 camera parameter P와 3d point X를 non-linear refinement를 수행합니다. $\pi$는 projection함수이고, $\rho_{j}$는 outlier를 down-weight하기 위한 요소입니다.

위 식을 풀기 위해서 Levenberg-Marquardt라는 비선형 최적화 알고리즘을 수행합니다. 이는 Gradient-Descent + Gauss-Newton이라고 보면 됩니다. Gradient-Descnet의 Jacobian을 사용한다는 성질 + Gauss-Newton의 Hessian을 사용한다는 특성을 합쳐 최적점을 찾습니다. 해당 과정을 수행하지 않으면 3D point가 실제 구조와 다를 수 있어 매우 중요한 과정입니다.

Outlier Filtering

PnP (Perspective-n-Point)을 통해 새로운 이미지의 카메라 포즈를 추정합니다. 이는 3D point <-> 2D image 특징점 간 매칭 정보를 통해 R, t를 계산하는 알고리즘입니다. 새로운 이미지에서 기존 3D포인트와의 대응점을 찾아 카메라의 위치를 PnP로 추정하는 과정에서 RANSAC을 사용하여 이상치를 제거하는 과정이 Outlier Filtering입니다. RANSAC-PNP 알고리즘의 주요 순서는 아래와 같습니다.

임의로 4개의 2D-3D 매칭된 점을 선택합니다. (Minimal Sample Set)
이 4개의 점을 이용해 PnP로 카메라 포즈 (R,t)를 계산합니다.
계산된 (R, t)로 다른 3D point를 투영한 후 reprojection error를 계산합니다.
1. 원래 매칭된 2D포인트와 3D->2D로 투영한 포인트와의 거리를 비교합니다.
  1. Reprojection error가 임계값 이하면 inlier로 분류
2. 이 과정을 반복하면서 가장 많은 inlier를 포함하는 모델을 선택합니다.

정리

unordered input image가 들어오면 SIFT같은 알고리즘으로 Feature를 만듭니다. 그리고 Feature Matching을 해주고 Epipolar geometry를 통해 Feature를 correspondence시킵니다. 그 후, Initialization에서 이미지 1쌍을 구하고 Image Registration에서 Fundamental Matrix --> Essential matrix --> R, T를 분해합니다. 이러한 방향정보를 저장하고, 이 정보를 활용해 Triangulation을 해주어 3D point를 구합니다. 그리고 Bundle Adjustment와 outlier filtering으로 에러를 줄이고 이러한 일련의 과정을 이미지가 들어올때마다 반복하여 camera pose와 3d point를 반복적으로 수정합니다.

[ 딥러닝 코드 리뷰 - PRMI Lab] - NeRF Code 코드 분석하기

Lee현서 — Tue, 11 Mar 2025 18:38:10 +0900

https://github.com/yenchenlin/nerf-pytorch

GitHub - yenchenlin/nerf-pytorch: A PyTorch implementation of NeRF (Neural Radiance Fields) that reproduces the results.

A PyTorch implementation of NeRF (Neural Radiance Fields) that reproduces the results. - yenchenlin/nerf-pytorch

github.com

https://arxiv.org/abs/2003.08934

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-con

arxiv.org

[ 3D Vision - Study ] - Nerual Fields and 3D Representations — 현서의 개발 일지

[ 3D Vision - Study ] - Nerual Fields and 3D Representations

3D Vision Study의 목표는 NeRF를 한번 접해보기 위함이였습니다. NeRF는 ECCV 2020(oral)로서 best paper상을 받은 주인공이며, 그만큼 파급력이 높은 기술임을 알 수 있습니다. 당시에 최고의 complex view synthes

hyunseo-fullstackdiary.tistory.com

NeRF에서 사용하는 Dataset에는 llff, deepvoxel, blender("Realistic Synthetic 360") 등이 있는데 여기서는 llff기준으로 분석하겠습니다. 각 데이터셋은 Lambertian reflect등의 특성 등과 다양한 camera view등과 같은 차이가 있습니다.

DataLoader

def train():

    parser = config_parser()
    args = parser.parse_args()

    # Load data
    K = None
    if args.dataset_type == 'llff':
        images, poses, bds, render_poses, i_test = load_llff_data(args.datadir, args.factor,
                                                                  recenter=True, bd_factor=.75,
                                                                  spherify=args.spherify)
        hwf = poses[0,:3,-1]
        poses = poses[:,:3,:4]
        # poses [20, 3, 4] -> extrinsic matrix
        print("poses: llff", poses.shape)

        # images [20, 378, 504, 3], render_poses [120, 3, 5], hwf, 1dim vector [378. 504. 407.5658], ./data/nerf_llff_data/fern
        print('Loaded llff', images.shape, render_poses.shape, hwf, args.datadir)
        # render_poses는 c2w matrix

        if not isinstance(i_test, list):
            i_test = [i_test]

        if args.llffhold > 0:
            # llff에서 자동으로 test데이터셋 분리
            print('Auto LLFF holdout,', args.llffhold)
            i_test = np.arange(images.shape[0])[::args.llffhold]

        i_val = i_test
        i_train = np.array([i for i in np.arange(int(images.shape[0])) if
                        (i not in i_test and i not in i_val)])

        print('DEFINING BOUNDS')
        if args.no_ndc:
            near = np.ndarray.min(bds) * .9
            far = np.ndarray.max(bds) * 1.
            
        else:
            near = 0.
            far = 1.
        print('NEAR FAR', near, far)

llff 데이터셋에 대해서 load_llff_data를 통해 images, poses, bds, render_poses, i_test를 가져옵니다.

images: 이미지 데이터입니다.
- images는 [20, 378, 504, 3] 크기로, 378x504x3크기의 이미지이고, 20장의 이미지가 있습니다.
poses
- poses를 각각 hwf, poses로 다시 분해하여 hwf(height, width, focal_length)와 poses(extrinsic matrix)로 분해합니다. poses는 3x4 extrinsic matrix입니다. (c2w matrix)
render_poses: rendering할때 사용할 camera pose정보입니다.
- render_poses는 [120, 3, 5]로 120개의 3x5 matrix인데, 앞부분의 3x4 matrix는 render에 필요한 extrinsic matrix정보입니다.
near, far
- bds(경곗값)을 이용하여 0.9, 0.1로 near, far값을 정하여 ray에서 point 샘플링할 최소 최대 깊이를 정합니다.
- ndc를 적용한다면 1, 0으로 설정하여 진행합니다.
i_train, i_val, i_test
- train, validation, test에 해당하는 index list입니다.

Ray

# Ray helpers
def get_rays(H, W, K, c2w):
    print('c2w.shape ', c2w.shape)
    print('c2w: ', c2w)
    i, j = torch.meshgrid(torch.linspace(0, W-1, W), torch.linspace(0, H-1, H))  # pytorch's meshgrid has indexing='ij'
    i = i.t()
    j = j.t()
    # i [378, 504], j [378, 504]
    print('chekc i, j', i.shape, j.shape)
    print(i, '\n', j)

    # dirs [378, 504, 3]

    # pixel좌표계에 있는 pixel들을 normalized plane으로 옮긴 결과 --> (z-plane = 1) 3d vector
    # [X, Y, Z] -> [X, -Y, -Z]: OpenCV -> COLMAP이 사용하는 coordinate system이 다르기 떄문
    dirs = torch.stack([(i-K[0][2])/K[0][0], -(j-K[1][2])/K[1][1], -torch.ones_like(i)], -1)
    
    print('dirs.shape', dirs.shape)
    print(dirs)
    # Rotate ray directions from camera frame to the world frame

    # dir들을 c2w행렬과 dot-product하여 world-coordinate로 변환해준다
    rays_d = torch.sum(dirs[..., np.newaxis, :] * c2w[:3,:3], -1)
    rays_o = c2w[:3,-1].expand(rays_d.shape)
    print('check ray', rays_d.shape, rays_o.shape)
    print(rays_d, '\n', rays_o)
    return rays_o, rays_d

get_rays함수를 통해 normalized plane(z-plane=1)에 있는 vector를 world좌표계에 대해서 표현한 ray를 만들어 줍니다.

dirs [378, 504, 3]
- COLMAP type의 normalized plane을 meshgrid와 K(intrinsic matrix)를 통해서 구해줍니다.
rays_d
- dirs [378, 504, 3]와 c2w[:3. :3] (3x3 rotation matrix R)을 dot-product하여 [378, 504, 3]의 direction matrix를 구합니다.
rays_o
- c2w[:3, -1] (3x1 translation matrix t)을 추출합니다.

그 외에도 get_rays_np는 numpy에 대해서 다시 쓴 함수입니다. 추후에 rays_d + rays_o를 하면, 카메라를 기준으로한 normalized plane이 world 좌표계 상에서 좌표값이게 됩니다. 그리고 rays_d * depth + rays_o 처럼 depth값을 주게되면 NeRF논문상에서의 Stratified Sampling도 구현할 수 있습니다.

# render안에서 batchify_rays를 하고 그 안에서 배치마다 render_rays를 한다.
def render(H, W, K, chunk=1024*32, rays=None, c2w=None, ndc=True,
                  near=0., far=1.,
                  use_viewdirs=False, c2w_staticcam=None,
                  **kwargs):
    """Render rays
    Args:
      H: int. Height of image in pixels.
      W: int. Width of image in pixels.
      focal: float. Focal length of pinhole camera.
      chunk: int. Maximum number of rays to process simultaneously. Used to
        control maximum memory usage. Does not affect final results.
      rays: array of shape [2, batch_size, 3]. Ray origin and direction for
        each example in batch.
      c2w: array of shape [3, 4]. Camera-to-world transformation matrix.
      ndc: bool. If True, represent ray origin, direction in NDC coordinates.
      near: float or array of shape [batch_size]. Nearest distance for a ray.
      far: float or array of shape [batch_size]. Farthest distance for a ray.
      use_viewdirs: bool. If True, use viewing direction of a point in space in model.
      c2w_staticcam: array of shape [3, 4]. If not None, use this transformation matrix for 
       camera while using other c2w argument for viewing directions.
    Returns:
      rgb_map: [batch_size, 3]. Predicted RGB values for rays.
      disp_map: [batch_size]. Disparity map. Inverse of depth.
      acc_map: [batch_size]. Accumulated opacity (alpha) along a ray.
      extras: dict with everything returned by render_rays().
    """
    if c2w is not None:
        # special case to render full image
        rays_o, rays_d = get_rays(H, W, K, c2w)
    else:
        # use provided ray batch
        rays_o, rays_d = rays

    # viewdirs는 카메라의 pose를 나타낸다. MLP의 입력으로 들어가는 view direction
    # 원래라면 (phi, theta)값을 넣어주어야 하지만, rays_d값을 normalize해주어 3개의 변수로 구성되어있음
    if use_viewdirs:
        # provide ray directions as input
        viewdirs = rays_d
        if c2w_staticcam is not None:
            # special case to visualize effect of viewdirs
            rays_o, rays_d = get_rays(H, W, K, c2w_staticcam)
        viewdirs = viewdirs / torch.norm(viewdirs, dim=-1, keepdim=True)
        viewdirs = torch.reshape(viewdirs, [-1,3]).float()

        print('viewdirs: ', viewdirs)

    sh = rays_d.shape # [..., 3]
    if ndc:
        # for forward facing scenes
        rays_o, rays_d = ndc_rays(H, W, K[0][0], 1., rays_o, rays_d)

    # Create ray batch
    rays_o = torch.reshape(rays_o, [-1,3]).float() # [378, 504, 3] -> [190512, 3]
    rays_d = torch.reshape(rays_d, [-1,3]).float()

    # 1로 채워놓음
    near, far = near * torch.ones_like(rays_d[...,:1]), far * torch.ones_like(rays_d[...,:1])

    # [190512, 8] -> 3(rays_o) + 3(rays_d) + 2(near + far)
    rays = torch.cat([rays_o, rays_d, near, far], -1)

    # use_viewdirs=True -> view direction을 입력으로 사용함.
    if use_viewdirs:
        rays = torch.cat([rays, viewdirs], -1)

    # rays: [190512, 11]
    print('rays shape: ', rays.shape)

    # Render and reshape -> for OOM
    # all_ret: rendering 결괏값을 가지고 있는 배열
    all_ret = batchify_rays(rays, chunk, **kwargs)
    for k in all_ret:
        k_sh = list(sh[:-1]) + list(all_ret[k].shape[1:])
        all_ret[k] = torch.reshape(all_ret[k], k_sh)

    k_extract = ['rgb_map', 'disp_map', 'acc_map']
    ret_list = [all_ret[k] for k in k_extract]
    ret_dict = {k : all_ret[k] for k in all_ret if k not in k_extract}
    return ret_list + [ret_dict]

이제 ray를 render해야합니다. 이를 위해 batchify_rays()를 render함수 안에서 호출합니다. batchify_rays의 출력값은 rendering 결괏값을 가지고 있는 배열입니다.

r(d) -> view direction

use_viewdirs
- 학습할때 view direction을 사용하겠다는 의미입니다.
view_dirs
- view direction은 카메라의 pose를 의미합니다. 원래 paper에는 spherical coordinate로 $(\theta, \phi)$로 표현하였으나, normalized된 rays_d를 사용하여 3개의 변수로 진행됩니다.
chunk
- chunk는 batchify_rays에서 minibatch를 구성할때 쓰는 단위입니다.

def batchify_rays(rays_flat, chunk=1024*32, **kwargs):
    """Render rays in smaller minibatches to avoid OOM.
    """
    all_ret = {}
    for i in range(0, rays_flat.shape[0], chunk):
        # ray_flat을 입력으로, ray위에 있는 voxel들의 color와 volume density를 출력으로 갖는 함수.
        ret = render_rays(rays_flat[i:i+chunk], **kwargs)

        # chunk 크기만큼 batch로 구성하여, render_rays함수를 수행한 후, 결과값을 all_ret이라는 자료구조에 저장하는 코드드
        for k in ret:
            if k not in all_ret:
                all_ret[k] = []
            all_ret[k].append(ret[k])

    all_ret = {k : torch.cat(all_ret[k], 0) for k in all_ret}
    return all_ret

batchify_rays는 ray를 rendering할때 OOM을 방지하기 위해 minibatch를 구성하고 결괏값을 저자가 정의한 자료구조에 저장하는 함수입니다

Coarse Sampling

def render_rays(ray_batch,
                network_fn,
                network_query_fn,
                N_samples,
                retraw=False,
                lindisp=False,
                perturb=0.,
                N_importance=0,
                network_fine=None,
                white_bkgd=False,
                raw_noise_std=0.,
                verbose=False,
                pytest=False):
    """Volumetric rendering.
    Args:
      ray_batch: array of shape [batch_size, ...]. All information necessary
        for sampling along a ray, including: ray origin, ray direction, min
        dist, max dist, and unit-magnitude viewing direction.
      network_fn: function. Model for predicting RGB and density at each point
        in space.
      network_query_fn: function used for passing queries to network_fn.
      N_samples: int. Number of different times to sample along each ray.
      retraw: bool. If True, include model's raw, unprocessed predictions.
      lindisp: bool. If True, sample linearly in inverse depth rather than in depth.
      perturb: float, 0 or 1. If non-zero, each ray is sampled at stratified
        random points in time.
      N_importance: int. Number of additional times to sample along each ray.
        These samples are only passed to network_fine.
      network_fine: "fine" network with same spec as network_fn.
      white_bkgd: bool. If True, assume a white background.
      raw_noise_std: ...
      verbose: bool. If True, print more debugging info.
    Returns:
      rgb_map: [num_rays, 3]. Estimated RGB color of a ray. Comes from fine model.
      disp_map: [num_rays]. Disparity map. 1 / depth.
      acc_map: [num_rays]. Accumulated opacity along each ray. Comes from fine model.
      raw: [num_rays, num_samples, 4]. Raw predictions from model.
      rgb0: See rgb_map. Output for coarse model.
      disp0: See disp_map. Output for coarse model.
      acc0: See acc_map. Output for coarse model.
      z_std: [num_rays]. Standard deviation of distances along ray for each
        sample.
    """
    N_rays = ray_batch.shape[0] # ray의 개수
    rays_o, rays_d = ray_batch[:,0:3], ray_batch[:,3:6] # [N_rays, 3] each
    viewdirs = ray_batch[:,-3:] if ray_batch.shape[-1] > 8 else None

    bounds = torch.reshape(ray_batch[...,6:8], [-1,1,2])
    near, far = bounds[...,0], bounds[...,1] # [-1,1]

    # N_samples: 64
    # 0~1사이에 N_samples 갯수만큼 균일하게 나누어진 실수값을 가짐
    t_vals = torch.linspace(0., 1., steps=N_samples)
    if not lindisp: # lindisp -> inverse depth
        z_vals = near * (1.-t_vals) + far * (t_vals)
    else:
        # 카메라에서 point가 떨어진 깊이를 저장하는 변수 -> lindisp=False: z_vals는 거리
        # [near, far] 균일한 값을 저장하게 됨 -> 내분을 이용한듯
        z_vals = 1./(1./near * (1.-t_vals) + 1./far * (t_vals))

    z_vals = z_vals.expand([N_rays, N_samples])

    # perturb은 stratified sampling에 해당됨. 
    # i번째 point위치와 i+1번째 point위치 사이의 랜덤한 위치를 선택하는 sampling 알고리즘
    if perturb > 0.:
        # get intervals between samples
        mids = .5 * (z_vals[...,1:] + z_vals[...,:-1])
        upper = torch.cat([mids, z_vals[...,-1:]], -1)
        lower = torch.cat([z_vals[...,:1], mids], -1)
        # stratified samples in those intervals
        t_rand = torch.rand(z_vals.shape)

        # Pytest, overwrite u with numpy's fixed random numbers
        if pytest:
            np.random.seed(0)
            t_rand = np.random.rand(*list(z_vals.shape))
            t_rand = torch.Tensor(t_rand)

        z_vals = lower + (upper - lower) * t_rand

renders_ray의 앞부분입니다. 이는 실제로 Volumetric rendering을 하는 함수입니다. 이전에 rays에 [.., 11]로 묶었던 배열을 다시 푸는 것으로 시작합니다.

N_rays
- Ray의 개수입니다. 보통 1024
N_samples
- 하나의 ray에서 sampling (Coarse Sampling)할 sample의 개수로 config/fern.txt에는 64로 설정되어 있습니다.
t_vals
- [0, 1]을 N_samples만큼 균일하게 나누어진 배열입니다.
z_vals
- 카메라에서 실제로 point(sampled)가 떨어진 거리를 의미합니다. 만약 lndsip(=inverse-depth)=True이면 inverse depth를 저장하고 False이면 해당 값이 실제로의 depth를 의미합니다. 여기서 [near, far]에서 균일하게 자른 값을 저장합니다.

마지막에 perturb > 0.인 경우에 실행하는 분기는 NeRF의 stratified sampling을 수행하는 과정입니다. mids, upper, lower를 이용하여 t_rand([0. 1] 난수)를 통해 lower + (upper - lower) * t_rand를 통해 stratified sampling을 수행해줍니다.

    # 주어진 image plane에서 주어진 Camera pose로 ray를 그렸을 때, world 좌표계에서 Voxel point좌표를 알 수 있음
    # 논문에서도 o(rays_o) + t(z_vals)d(rays_d)
    pts = rays_o[...,None,:] + rays_d[...,None,:] * z_vals[...,:,None] # [N_rays, N_samples, 3]

위에서 구한 z_vals로 부터 rays_o + rays_d * (depth = z_vals)를 수행해서 sampling할 point의 좌표를 pts에 저장합니다. [N_rays, N_samples, 3]

Fine Sampling

    raw = network_query_fn(pts, viewdirs, network_fn)
    # raw: prediction from model
    rgb_map, disp_map, acc_map, weights, depth_map = raw2outputs(raw, z_vals, rays_d, raw_noise_std, white_bkgd, pytest=pytest)

    # finesampling할 개수
    if N_importance > 0:

        rgb_map_0, disp_map_0, acc_map_0 = rgb_map, disp_map, acc_map  

        # coarse_sampling에서 만든 z_val에서 깊이 중간값을 가져옴
        z_vals_mid = .5 * (z_vals[...,1:] + z_vals[...,:-1])

        # sample_pdf함수를 통해 앞선 weight를 이용해서, inverse translation sampling을 진행한다.
        z_samples = sample_pdf(z_vals_mid, weights[...,1:-1], N_importance, det=(perturb==0.), pytest=pytest)
        z_samples = z_samples.detach()

        z_vals, _ = torch.sort(torch.cat([z_vals, z_samples], -1), -1)
        pts = rays_o[...,None,:] + rays_d[...,None,:] * z_vals[...,:,None] # [N_rays, N_samples + N_importance, 3]

        run_fn = network_fn if network_fine is None else network_fine
#         raw = run_network(pts, fn=run_fn)

        # network_query_fn은 pts와 view direction을 입력으로 하여, raw라는 출력값을 갖는 MLP함수
        # Network의 결괏값을 post process없이 그대로 출력해서 raw라는 변수명을 붙임임
        # 두번째 network_query_fn: fine network

        raw = network_query_fn(pts, viewdirs, run_fn)

        # raw2outputs은 raw를 입력으로 하여, rgb_map, disp_map, acc_map, weights, depth_map형태로 변환하는 후처리 함수.
        # NeRF논문에서 volume rendering수식이 들어가는 부분분
        rgb_map, disp_map, acc_map, weights, depth_map = raw2outputs(raw, z_vals, rays_d, raw_noise_std, white_bkgd, pytest=pytest)

    ret = {'rgb_map' : rgb_map, 'disp_map' : disp_map, 'acc_map' : acc_map}
    if retraw:
        ret['raw'] = raw
    if N_importance > 0:
        ret['rgb0'] = rgb_map_0
        ret['disp0'] = disp_map_0
        ret['acc0'] = acc_map_0
        ret['z_std'] = torch.std(z_samples, dim=-1, unbiased=False)  # [N_rays]

    for k in ret:
        if (torch.isnan(ret[k]).any() or torch.isinf(ret[k]).any()) and DEBUG:
            print(f"! [Numerical Error] {k} contains nan or inf.")

    return ret

coarse sampling에서 만든 point들을 NeRF 모델에 넣고 weight를 구한뒤, inverse transform sampling을 통해 실제로 유의미한 weight의 sample을 추출합니다.

network_query_fn으로 raw한 결괏값을 뽑아내고, raw2outputs으로 유의미한 결괏값들을 만듭니다.

N_importance
- N_sample과 비슷하게 1개의 ray상에서 fine sampling할 sample의 개수를 의미합니다. (llff에서는 n_sample과 동일하게 64로 지정합니다.)
z_vals_mid
- corase_sampling에서 만든 z_val에서 중간값을 가져옵니다.
sample_pdf (inverse transform sampling)
- corase sampling을 통해 구해진 weight(volume density)를 가지고 cumulate pdf를 통해 inverse transform sampling을 해 fine sampling을 수행합니다.

network_query_fn
- pts와 view direction을 입력으로 raw(rgb, alpha)값을 뽑아내는 MLP입니다.
raw2outputs
- raw를 postprocess하여 rgb_map, disp_map, acc_map, weights, depth_map으로 변환해주는 함수입니다. 실제로 volume rendering수식이 들어가는 부분입니다.

fine sampling한 z_samples에 corase sampling한 z_vals를 합친 후 정렬해 z_vals를 만듭니다. 그리고 이를 통해 실제 rays_o + rays_d * z_vals를 통해 sampling할 point정보를 구합니다.

또 다시 network_query_fn, raw2outputs를 통해 rgb값과 동시에 유의미한 정보를 뽑아냅니다. 그 후, ret에 각각의 유의미한 정보를 dictionary 형태로 저장합니다.

volumetric-rendering

Render Path (inference)

# inference에 사용되는 rendering 코드드
def render_path(render_poses, hwf, K, chunk, render_kwargs, gt_imgs=None, savedir=None, render_factor=0):

    H, W, focal = hwf

    if render_factor!=0:
        # Render downsampled for speed
        H = H//render_factor
        W = W//render_factor
        focal = focal/render_factor

    rgbs = []
    disps = []

    t = time.time()
    for i, c2w in enumerate(tqdm(render_poses)):
        print(i, time.time() - t)
        t = time.time()

        # render()를 통해 rgb, disp, acc를 갖는다.
        # rgb: 최종 결과 image map
        # disp: disparity map으로써 inverse of depth
        # acc: accumulated opacity(alpha)
        rgb, disp, acc, _ = render(H, W, K, chunk=chunk, c2w=c2w[:3,:4], **render_kwargs)
        rgbs.append(rgb.cpu().numpy())
        disps.append(disp.cpu().numpy())
        if i==0:
            print(rgb.shape, disp.shape)

        """
        if gt_imgs is not None and render_factor==0:
            p = -10. * np.log10(np.mean(np.square(rgb.cpu().numpy() - gt_imgs[i])))
            print(p)
        """

        if savedir is not None:
            rgb8 = to8b(rgbs[-1])
            filename = os.path.join(savedir, '{:03d}.png'.format(i))
            imageio.imwrite(filename, rgb8)


    rgbs = np.stack(rgbs, 0)
    disps = np.stack(disps, 0)

    return rgbs, disps

render_path는 --render_only를 실행할때 주면 실행되는 inference code입니다. 위에서 설명한 render()함수를 통해 rgb, disp 정보를 구합니다. 자세히 보진 않았지만, pre-trained된 모델이 지정된 폴더안에 있으면 바로 그 가중치를 통해 실행하는것 같습니다.

MLP

def create_nerf(args):
    """Instantiate NeRF's MLP model.
    """
    embed_fn, input_ch = get_embedder(args.multires, args.i_embed)

    input_ch_views = 0
    embeddirs_fn = None
    if args.use_viewdirs:
        embeddirs_fn, input_ch_views = get_embedder(args.multires_views, args.i_embed)
    output_ch = 5 if args.N_importance > 0 else 4
    skips = [4]
    model = NeRF(D=args.netdepth, W=args.netwidth,
                 input_ch=input_ch, output_ch=output_ch, skips=skips,
                 input_ch_views=input_ch_views, use_viewdirs=args.use_viewdirs).to(device)
    grad_vars = list(model.parameters())

    model_fine = None
    if args.N_importance > 0:
        model_fine = NeRF(D=args.netdepth_fine, W=args.netwidth_fine,
                          input_ch=input_ch, output_ch=output_ch, skips=skips,
                          input_ch_views=input_ch_views, use_viewdirs=args.use_viewdirs).to(device)
        grad_vars += list(model_fine.parameters())

    network_query_fn = lambda inputs, viewdirs, network_fn : run_network(inputs, viewdirs, network_fn,
                                                                embed_fn=embed_fn,
                                                                embeddirs_fn=embeddirs_fn,
                                                                netchunk=args.netchunk)

    # Create optimizer
    optimizer = torch.optim.Adam(params=grad_vars, lr=args.lrate, betas=(0.9, 0.999))

    start = 0
    basedir = args.basedir
    expname = args.expname

    ##########################

    # Load checkpoints
    if args.ft_path is not None and args.ft_path!='None':
        ckpts = [args.ft_path]
    else:
        ckpts = [os.path.join(basedir, expname, f) for f in sorted(os.listdir(os.path.join(basedir, expname))) if 'tar' in f]

    print('Found ckpts', ckpts)
    if len(ckpts) > 0 and not args.no_reload:
        ckpt_path = ckpts[-1]
        print('Reloading from', ckpt_path)
        ckpt = torch.load(ckpt_path)

        start = ckpt['global_step']
        optimizer.load_state_dict(ckpt['optimizer_state_dict'])

        # Load model
        model.load_state_dict(ckpt['network_fn_state_dict'])
        if model_fine is not None:
            model_fine.load_state_dict(ckpt['network_fine_state_dict'])

    ##########################

    render_kwargs_train = {
        'network_query_fn' : network_query_fn,
        'perturb' : args.perturb,
        'N_importance' : args.N_importance,
        'network_fine' : model_fine,
        'N_samples' : args.N_samples,
        'network_fn' : model,
        'use_viewdirs' : args.use_viewdirs,
        'white_bkgd' : args.white_bkgd,
        'raw_noise_std' : args.raw_noise_std,
    }

    # NDC only good for LLFF-style forward facing data
    if args.dataset_type != 'llff' or args.no_ndc:
        print('Not ndc!')
        render_kwargs_train['ndc'] = False
        render_kwargs_train['lindisp'] = args.lindisp

    render_kwargs_test = {k : render_kwargs_train[k] for k in render_kwargs_train}
    render_kwargs_test['perturb'] = False
    render_kwargs_test['raw_noise_std'] = 0.

    return render_kwargs_train, render_kwargs_test, start, grad_vars, optimizer

create_nerf는 보기에는 복잡해보이지만, NeRF를 정의하는 부분입니다. 모델 정의와 함께 다양한 최적화 방식도 정의하고 render에 넘겨줄 매개변수도 정의합니다.

model
- coarse network
model_fine
- fine network

# 인자인 fn이 network의 forward함수에 해당. 
def batchify(fn, chunk):
    """Constructs a version of 'fn' that applies to smaller batches.
    """
    if chunk is None:
        return fn
    def ret(inputs):
        return torch.cat([fn(inputs[i:i+chunk]) for i in range(0, inputs.shape[0], chunk)], 0)
    return ret


# embeddirs_fn은 positional encoding 부분
def run_network(inputs, viewdirs, fn, embed_fn, embeddirs_fn, netchunk=1024*64):
    """Prepares inputs and applies network 'fn'.
    """
    inputs_flat = torch.reshape(inputs, [-1, inputs.shape[-1]])
    embedded = embed_fn(inputs_flat)

    if viewdirs is not None:
        input_dirs = viewdirs[:,None].expand(inputs.shape)
        input_dirs_flat = torch.reshape(input_dirs, [-1, input_dirs.shape[-1]])
        embedded_dirs = embeddirs_fn(input_dirs_flat)
        embedded = torch.cat([embedded, embedded_dirs], -1)

    outputs_flat = batchify(fn, netchunk)(embedded)
    outputs = torch.reshape(outputs_flat, list(inputs.shape[:-1]) + [outputs_flat.shape[-1]])
    return outputs

위의 create_network에서 run_network를 사용하는데, run_network는 network_query_fn를 구성합니다. run_network안에는 batchify가 있는데, batchify의 인자 fn에 NeRF의 forward함수를 넘겨주고 chunk단위로 반복하여 결괏값을 출력합니다. ~~구성이 많이 복잡하다고 생각합니다...~~

def raw2outputs(raw, z_vals, rays_d, raw_noise_std=0, white_bkgd=False, pytest=False):
    """Transforms model's predictions to semantically meaningful values.
    Args:
        raw: [num_rays, num_samples along ray, 4]. Prediction from model.
        z_vals: [num_rays, num_samples along ray]. Integration time.
        rays_d: [num_rays, 3]. Direction of each ray.
    Returns:
        rgb_map: [num_rays, 3]. Estimated RGB color of a ray.
        disp_map: [num_rays]. Disparity map. Inverse of depth map.
        acc_map: [num_rays]. Sum of weights along each ray.
        weights: [num_rays, num_samples]. Weights assigned to each sampled color.
        depth_map: [num_rays]. Estimated distance to object.
    """
    raw2alpha = lambda raw, dists, act_fn=F.relu: 1.-torch.exp(-act_fn(raw)*dists)

    # dists는 ray간의 거리를 나타냄
    dists = z_vals[...,1:] - z_vals[...,:-1]
    dists = torch.cat([dists, torch.Tensor([1e10]).expand(dists[...,:1].shape)], -1)  # [N_rays, N_samples]
    
    # 카메라 좌표계에 있는 point들을 World좌표계로 이동
    # z축 기준 거리를 실제 3D 거리로 보정
    dists = dists * torch.norm(rays_d[...,None,:], dim=-1)

    # rgb는 MLP의 출력인 raw에서 앞쪽 3개에 해당하는 값
    rgb = torch.sigmoid(raw[...,:3])  # [N_rays, N_samples, 3]

    # Gaussian Noise로 생성되어짐짐
    # 실제로 Gaussian Noise를 적용해서 퀄리티 향상을 줄 수 있다고 Appendix에 나와있음음
    noise = 0.
    if raw_noise_std > 0.:
        noise = torch.randn(raw[...,3].shape) * raw_noise_std

        # Overwrite randomly sampled data if pytest
        if pytest:
            np.random.seed(0)
            noise = np.random.rand(*list(raw[...,3].shape)) * raw_noise_std
            noise = torch.Tensor(noise)
    
    # alpha는 (1-exp(-sigma_i * delta_i))에 해당하는 값
    # MLP출력값인 raw의 volume density(sigma)값과 dists(delta)값의 곱으로 계산됨
    # Target Point의 불투명도를 나타낸다.
    alpha = raw2alpha(raw[...,3] + noise, dists)  # [N_rays, N_samples]
    # weights = alpha * tf.math.cumprod(1.-alpha + 1e-10, -1, exclusive=True)
    
    # weights는 T_i(1-exp(-sigma_i * delta_i))에 해당하는 값
    weights = alpha * torch.cumprod(torch.cat([torch.ones((alpha.shape[0], 1)), 1.-alpha + 1e-10], -1), -1)[:, :-1]

    # rgb_map은 C(r)에 해당하는 값 -> ray위의 N개의 모든 점에 대해 summation하여 계산됨.
    # sum(weights * rgb)으로 표현가능능
    rgb_map = torch.sum(weights[...,None] * rgb, -2)  # [N_rays, 3]

    # weights와 z_vals을 곱하고, 전체를 summation함으로써, Volume Density값으로 Depth Map을 형성
    # C(r)식에서 c_i대신에 z_vals가 들어갔다고 생각하면 됨. 카메라로부터 멀어지면 값이 커지고, weights가 커져도 값이 커짐짐
    depth_map = torch.sum(weights * z_vals, -1)

    # disparity map이며, 이는 depth map을 inverse한 map으로 표현되어 있음
    disp_map = 1./torch.max(1e-10 * torch.ones_like(depth_map), depth_map / torch.sum(weights, -1))

    # acc_map은 weights들을 summation하여 나타냄. fine network의 입력값들을 sampling할때 사용됨.
    acc_map = torch.sum(weights, -1)

    if white_bkgd:
        rgb_map = rgb_map + (1.-acc_map[...,None])

    return rgb_map, disp_map, acc_map, weights, depth_map

raw2outputs에서는 앞서말했다 싶이 유의미한 결과를 MLP의 결과로부터 추출합니다. 설명은 주석으로 달았습니다.

dists = dists * torch.norm(rays_d[..., None,:], dim=-1)
- 해당 코드는 단순 깊이 정보인 dists(z_vals로 부터 계산)을 rays_d 벡터에 맞게 길이를 조정해준 것입니다.
  - 깊이 간격 (dists): tensor([[1., 1.], [1., 1.]])
  - 광선 방향 벡터 크기 (ray_norms): tensor([[1.], [1.4142]])
  - 3D 거리 (dists_3D): tensor([[1., 1.], [1.4142, 1.4142]])
    - 위와같이 예시에서 실제 vector의 거리는 1.4142배 커졌습니다.

weights = alpha * torch.cumprod(torch.cat([torch.ones((alpha.shape[0], 1)), 1.-alpha + 1e-10], -1), -1)[:, :-1]
- torch.ones으로 cumprod이 가능하게 함.
- 1e-10은 연산이 0에 수렴하는걸 방지하기 위함.
- cumprod([1, ...(구조분해)[1-alpha]])의 [:, :-1]의 크기는 [num_rays, 1(weight)] 로 나올 것입니다.

# Model
class NeRF(nn.Module):
    def __init__(self, D=8, W=256, input_ch=3, input_ch_views=3, output_ch=4, skips=[4], use_viewdirs=False):
        """ 
        """
        super(NeRF, self).__init__()
        # D: 네트워크의 깊이(길이)
        self.D = D
        # W: 네트워크의 너비
        self.W = W
        # input_ch: 인풋채널 크기
        self.input_ch = input_ch
        # input_ch_views: 방향채널 크기
        self.input_ch_views = input_ch_views
        # skips: skip-connection적용 여부
        self.skips = skips
        # use_viewdirs: 방향정보 적용 여부
        self.use_viewdirs = use_viewdirs
        
        self.pts_linears = nn.ModuleList(
            [nn.Linear(input_ch, W)] + [nn.Linear(W, W) if i not in self.skips else nn.Linear(W + input_ch, W) for i in range(D-1)])
        
        ### Implementation according to the official code release (https://github.com/bmild/nerf/blob/master/run_nerf_helpers.py#L104-L105)
        self.views_linears = nn.ModuleList([nn.Linear(input_ch_views + W, W//2)])

        ### Implementation according to the paper
        # self.views_linears = nn.ModuleList(
        #     [nn.Linear(input_ch_views + W, W//2)] + [nn.Linear(W//2, W//2) for i in range(D//2)])
        
        if use_viewdirs:
            # 중간에 feature를 변환하는 layer -> positional encoding 적용
            self.feature_linear = nn.Linear(W, W)
            # 밀도(alpha) 예측
            self.alpha_linear = nn.Linear(W, 1)
            # rgb 예측하는거, 
            self.rgb_linear = nn.Linear(W//2, 3)
        else:
            self.output_linear = nn.Linear(W, output_ch)

    def forward(self, x):
        input_pts, input_views = torch.split(x, [self.input_ch, self.input_ch_views], dim=-1)
        h = input_pts

        # MLP foward pass
        for i, l in enumerate(self.pts_linears):
            h = self.pts_linears[i](h)
            h = F.relu(h)
            if i in self.skips:
                # skip connection을 해준다.
                h = torch.cat([input_pts, h], -1)

        if self.use_viewdirs:
            # 바로 alpha(density)값 예측
            alpha = self.alpha_linear(h)
            # feature 추출
            feature = self.feature_linear(h)
            h = torch.cat([feature, input_views], -1)
        
            for i, l in enumerate(self.views_linears):
                h = self.views_linears[i](h)
                h = F.relu(h)

            rgb = self.rgb_linear(h)
            outputs = torch.cat([rgb, alpha], -1)
        else:
            outputs = self.output_linear(h)

        # 결과로 rgb, alpha값이 둘다 나옴
        return outputs

NeRF를 정의한 부분입니다. 대부분의 정보는 주석을 참고하시면 됩니다. NeRF모델에서 skip-connection이라든지 viewdirection을 추가하는 모듈, density(alpha)값을 반환하는 부분, 중간중간 ReLU이 적용됨을 볼 수 있습니다.

참고로 density를 뽑아낼때 feature_linear(h)를 통해 해당 feature를 input_views와 concatenate하여 views_linear에 입력으로 사용해 density를 뽑아냅니다. 최종적으로 forward했을때 결괏값 outputs을 rgb, alpha로 묶어서 반환합니다. 그외에도 load_weights_from_keras등도 있는데 생략하겠습니다.

Loss Function

# Misc
img2mse = lambda x, y : torch.mean((x - y) ** 2)
mse2psnr = lambda x : -10. * torch.log(x) / torch.log(torch.Tensor([10.]))
to8b = lambda x : (255*np.clip(x,0,1)).astype(np.uint8)

	img_loss = img2mse(rgb, target_s)
        trans = extras['raw'][...,-1]
        loss = img_loss
        psnr = mse2psnr(img_loss)

        if 'rgb0' in extras:
            img_loss0 = img2mse(extras['rgb0'], target_s)
            loss = loss + img_loss0
            psnr0 = mse2psnr(img_loss0)

        loss.backward()
        optimizer.step()

        # NOTE: IMPORTANT!
        ###   update learning rate   ###
        decay_rate = 0.1
        decay_steps = args.lrate_decay * 1000
        new_lrate = args.lrate * (decay_rate ** (global_step / decay_steps))
        for param_group in optimizer.param_groups:
            param_group['lr'] = new_lrate

다시 train() 함수로 돌아와서 이제야 render() 결과를 통해 rgb, disp, acc, extras를 구했습니다. 그 후, 최적화를 하려면 loss function이 있어야 합니다.

nerf loss function

rgb는 fine network의 결괏값이고, extras['rgb0']은 coarse network의 결괏값입니다. 위 수식처럼 coarse network, fine network 각각에 대해 loss를 구하고 더한 후에, back propagation합니다.

Positional Encoding

# Positional encoding (section 5.1)
class Embedder:
    def __init__(self, **kwargs):
        self.kwargs = kwargs
        self.create_embedding_fn()
        
    def create_embedding_fn(self):
        embed_fns = [] # 변환할 함수 리스트
        d = self.kwargs['input_dims'] # 입력 차원 (보통 3)
        out_dim = 0 # 출력차원 (몇개의 encoding을 만들 것인지)

        # 원본 입력을 포함할 것인지에 대한 여부
        if self.kwargs['include_input']:
            embed_fns.append(lambda x : x)
            out_dim += d
        

        max_freq = self.kwargs['max_freq_log2']
        N_freqs = self.kwargs['num_freqs']
        
        # freq_bands는 N_freqs개의 주파수 값이 됨
        if self.kwargs['log_sampling']: # 로그스케일 or 선형 스케일
            freq_bands = 2.**torch.linspace(0., max_freq, steps=N_freqs)
        else:
            freq_bands = torch.linspace(2.**0., 2.**max_freq, steps=N_freqs)
        
        # 각 주파수 freq에 대해 sin, cos을 적용
        # freq -> [2^0, 2^1, ... ,2^(L - 1)]
        for freq in freq_bands:
            # p_fn -> [sin, cos]
            for p_fn in self.kwargs['periodic_fns']:
                # phi가 큰 영향을 끼치지는 않음
                embed_fns.append(lambda x, p_fn=p_fn, freq=freq : p_fn(x * freq))
                out_dim += d
                    
        self.embed_fns = embed_fns
        self.out_dim = out_dim
        
    def embed(self, inputs):
        return torch.cat([fn(inputs) for fn in self.embed_fns], -1)


def get_embedder(multires, i=0):
    if i == -1:
        return nn.Identity(), 3
    
    embed_kwargs = {
                'include_input' : True,
                'input_dims' : 3,
                'max_freq_log2' : multires-1,
                'num_freqs' : multires,
                'log_sampling' : True,
                'periodic_fns' : [torch.sin, torch.cos],
    }
    
    embedder_obj = Embedder(**embed_kwargs)
    embed = lambda x, eo=embedder_obj : eo.embed(x)
    return embed, embedder_obj.out_dim

Embedder는 positional encoding을 통해 고주파 피쳐를 만드는 과정입니다. 2중 for문에서 freq_band(log scale), p_fn(sin, cos)을 통해 아래와같은 수식으로 임베딩을 합니다. 그리고 N_freq를 통해 임베딩할 차원을 결정합니다. paper 참고.

positional encoding

[딥러닝 논문 리뷰 - PRML Lab] - 3D Gaussian Splatting (3D-GS) & code (tile rasterize)

Lee현서 — Sat, 22 Feb 2025 17:11:40 +0900

이번에 볼 논문은 2023 SIGRAPH에서 소개된 3D Gaussian Splatting입니다. 복학 준비를 하면서 3D Vision관련 프로젝트나 연구분야를 설정하는 중에 NeRF보다 발전된(효율적인) 형태인 해당 논문을 발견했습니다. 그리고 이후 CVPR이나 top tier논문에서도 다양하게 발전하는 양상을 보고 리뷰하기로 했습니다. 이후에는 InstantNGP과 같은 관련 논문을 리뷰해볼 생각입니다.

논문 링크: https://arxiv.org/pdf/2308.04079

참고 블로그 링크: https://xoft.tistory.com/51

[논문 리뷰] 3D Gaussian Splatting (SIGGRAPH 2023) : 랜더링 속도/퀄리티 개선

3D Gaussian Splatting for Real-Time Radiance Field Rendering, Bernhard Kerbl, SIGGRAPH 2023 NeRF분야에서 뜨거운 이슈가 된 논문입니다. NeRF에서 해결하고자 하는 Task와 동일하게, 여러 이미지와 촬영 pose 값이 주어지

xoft.tistory.com

xoft님의 3D-GS가 가장 도움이 되었으며 생소한 3D Gaussian개념부터 구체적인 알고리즘을 상세히 분석할 수 있었습니다.

radiance field 방법론들은 여러개의 사진이나 비디오로부터, 새로운 view를 만들어 내는 방향으로 발전해왔습니다. 하지만 여전히 high-quality의 결과물을 만들기 위한 네트워크를 train, rendering하는데 많은 비용이 발생합니다. 해당 논문은 이러한 문제점을 일부 해결하고자 1080p, real-time (>= 30fps)에서 train, rendering하는 시간을 획기적으로 줄이기 위한, 3가지 key element를 소개합니다.

첫번째로 NeRF와 동일하게 SfM(Structure-from Motion)으로 Calibration된 카메라로부터 시작합니다. 그 후, SfM 프로세스로부터 생성된 sparse point cloud를 3D Gassian으로 초기화 합니다. 이전의 포인트 기반 솔루션들은 Multi-View Stero(MVS) 데이터가 필요했지만, 3D-GS는 SfM으로부터 생성된 point로만 고품질의 결과를 얻을 수 있습니다. 3D Gaussian을 통해 모든 volumetric 공간에 대해서 미분 가능하고 $\alpha$-blending을 통해 효율적으로 rasterization할 수 있습니다.

두번째로 3D 위치, $\alpha$ (불투명도), anistropic covariance (이방성 공분산), 뒤에서 볼 SH 계수(구면 조화 계수)와 같은 3D Gaussian의 속성을 interleaved optimization하는 것입니다. SH (Spherical Harmonic)는 Graphics에서 color값을 계산할 때 view-dependent한 특성을 고려하고자 할 때 사용되는 개념입니다.

세번째로 빠른 GPU 정렬 알고리즘을 통해 수행되는 tile-based rasterization입니다. 이를 통해 visibility order를 반영하는 anistropic splatting을 수행할 수 있으며 필요한 공유 메모리에 backward 계수들을 저장하여 빠른 backward pass가 가능해집니다.

이러한 3D-GS은 기존의 SOTA급 qualit를 뽑아낼 수 있고 real-time이 가능한 방법론이라고 할 수 있겠습니다. 더 자세한 방법은 아래에서 살펴보겠습니다.

Differentiable 3D Gaussian Splatting

3D-GS process

sparse한 SfM의 기본요소부터 시작하려면 미분 가능한 체적표현이 가능하다는 속성을 상속하는 동시에 빠른 렌더링이 가능하도록 구조화되지 않고 명시적인 기본 요소가 필요하다고 합니다. 이를 위해 2D로 쉽게 splat할 수 있는 3D Gaussian을 사용하여 빠른 $\alpha$-blending을 가능토록 했습니다.

점평균 $\mu$를 중심으로 하고, Gaussian에 대한 covariance matrix인 3D Gaussian은 아래와 같이 정의됩니다.

이후에 w2c (word2cam)과 같은 행렬이 주어지면 camera space에서의 공분산 행렬 $\Sigma^{'}$은 아래와 같이 정의됩니다.

위에서 $J$는 c2i (camera2image) 변환의 affine근사의 Jacobian입니다(Taylor Expansion으로 유도). 자세한 설명은 포스트 위에 첨부한 xoft님의 블로그를 참고하시면 됩니다. 여기서 재미있는 특성이 $\Sigma^{'}$의 3번쨰 행,열을 없애버리면 2D상에 법선(normal)이 있는 점에서 시작하는 것과 같은 2x2 covariance matrix를 얻을 수 있다는 점입니다.

Equation(5)의 quadratic form에 대해서 잠깐 첨언하자면 $J$는 Jacobian (affine approximation) perspective projection이라고 했습니다. 이는 1st Talor approximation이기 때문에, Gaussian의 점평균에서 멀어질수록 approximation error가 생기는 것이 자명합니다. 최근에는 이러한 perspective error를 3D-GS의 한계로 제시하고 해결하려는 연구도 지속되고 있습니다. pixel space에서 쓰는 covariance matrix (2x2)에서는 z-axis를 쓰지 않기 떄문에, 아래와 같이 실제 cuda코드에 구현되어 있습니다.

Jacobian (z-axis = 0)

이를 최적화 하기위한 명시적인 접근은 $\sum$을 direct로 최적화하는 것입니다. 하지만 covariance matrix의 특징중에 하나가 positive-definite일 때만 공간상에서 의미있는 특징을 가지는 것입니다. 이러한 상황에서 gradient descent를 사용하면 빈번하게 유효하지 않은 covariance matrix를 생성하게 되어 최적화가 힘들어 지게 될것입니다.

저자들은 이를 해결하고자 보다 표현력이 뛰어난 표현을 선택했습니다. $\Sigma$은 타원체의 구성을 설명하는 것과 유사한데, 따라서 scaling matrix인 $S$, rotation matrix인 $R$을 통해 $\Sigma$을 찾아낼 수 있습니다.

실제 구현에서는 rotation을 위한 행렬을 quaternion $q$로 별도로 지정합니다. 이들은 독립적으로 최적화되며, 이러한 최적화에 적합한 anistropic covariance matrix를 통해 3D Gaussian을 만들어내면 scene의 다양한 기하학적 구조에 적용할 수 있어 아래와 같이 매우 compact한 표현이 가능해집니다.

covariance matrix를 아래와 같이 약간 바꿔서 쓸 수 있습니다.

$$
\begin{align}
\sum &= RSS^TR^T \\
     &= R\begin{bmatrix}
        s_1^2 & 0 & 0 \\
        0 & s_2^2 & 0 \\
        0 & 0 & s_3^2
    \end{bmatrix} R^T
\end{align}
$$

covariance matrix의 형태가 3D ellipsoid (anistropic) matrix와 같습니다. 즉 3D-GS는 3D상의 불투명한 타원체를 primitive kernel로 사용함을 알 수 있습니다.

ellisoid 형태

Quadratic form $A^{-1}MA$를 다룰 때, $M$: transformation in $A$: coordinate system이라고 해석하면 좋습니다.

정방향 행렬에서, eigendecomposition을 해석할때와 비슷한데, 선형변환에도 방향이 보존되는 axis(eigenvectors)로 이루어진 coordinate system에서, 각 axis가 어느정도 가중치를 가지고 있는지 (eigenvalues)를 분석하는것기 eigendecomposition이기 때문입니다. 이는 또한, PCA와 관련됩니다.

출처: https://velog.io/@gjghks950/3D-Gaussian-Splatting-%EC%99%84%EB%B2%BD-%EB%B6%84%EC%84%9D-feat.-CUDA-Rasterizer

3D Gaussian Splatting 완벽 분석 (feat. CUDA Rasterizer)

3D Gaussian Splatting 을 완벽하게 이해해보자!

velog.io

Density of the Projected Gaussian

3D Gaussian splatting은 실제 rendering시에 Gaussian density(강도)와 opacity를 곱해서 사용합니다. 실제 cuda 코드에서 사용되는 3D 공간 위의 점 $p$에 대한 $i$th Gaussian의 density $f_{i}(p)$를 아래와 같이 정의합니다.

$$f_i(p)=\sigma(\alpha_i)\exp\bigg(-\frac{1}{2}(\mu_i-p)^T\Sigma_i^{-1}(\mu_i-p)\bigg)$$

exponential 안의 값은 Mahalanobis Distance(=power)인데, 이는 어떤 분포를 고려한 ellipsoid내에서의 실질 거리라고 볼 수 있습니다. 이는, 어떤 3D point를 2D로 projection했을때 해당 point가 pixel과 가까울수록, power값이 1에 가까워져 non-opaque해지는 직관적인 해석과 들어맞습니다.

또한, 실제 코드에서는 위의 $\Sigma_i^{-1}$이 positive semidefinite이기 때문에, small $\lambda$를 더해주어 아래와 같이 covariance matrix가 positive definite가 되도록 바꿔줍니다.

$$x^TA^TAx + \lambda x^Tx>0$$

그리고 실제 코드에서는 $\Sigma^{-1}$를 conic이라고 명명합니다.

마지막으로 2D splat의 radius를 99.7%이상 cover 가능한

$$r = 3 \times \max_i standard\ deviation_i$$

위와같이 정의해서, Gaussian culling용도로 사용합니다. 3D Gaussian의 standard deviation은 eigenvalue와 같으므로, Characteristic equation으로 구합니다. 실제 코드에서도 Characteristic equation는 closed form equation이여서 근의공식으로 풉니다.

Optimization

우선 3D to 2D의 변환은 투영이 모호하기 때문에 3D Gaussian이 잘못 배치될 수 있습니다. 그래서 최적화 단계에서 Gaussian을 잘못 배치한 경우 없애고 다시 생성할 수 있어야 합니다. 저자들은 $\alpha$에 sigmoid를 사용했으며, covariance의 scale에는 exponential function을 사용했습니다.

저자들은 초기 Gaussian을 3D point상에 가장 가까운 세 점까지의 거리의 평균과 동일한 normal(법선)을 갖는 isotropic Gaussian으로 설정합니다. 또한, Plenoxels [Fridovich-Keil and Yu et al. 2022]과 비슷하게 exponential decay scheduling을 사용했지만 position만 제외했다고 합니다. Loss function은 L1과 구조적 동일성을 위한 D-SSIM을 결합한 식으로 구성됩니다. (초기 $\lambda = 0.2$)

Adaptive Control of Gaussians

초기 sparse SfM부터 시작하여 adaptive하게 저자들의 방법대로 Gaussian의 수와 volume에 대한 밀도를 조절하여 더 장면을 잘 나타낼 수 있는 dense한 집합으로 이동할 수 있습니다. 저자들은 초기 warm-up 후 매 100 iteration마다 densify하고 $\alpha$가 threshold인 $\epsilon_{\alpha}$보다 작은 것을 제거합니다.

3D Gaussian에 대한 adaptive control에서는 빈공간을 채워줘야 합니다. 이에 대해 2가지 상황이 있는데 첫번째로 기하학적 feature가 누락된 영역 (under-reconstruction)에 집중해야하고, 두번째로 기하학적 feature가 넓은 영역을 덥는 (over-reconstruction)에도 집중해야합니다. 이들은 large view-space gradient을 띄고 있는데 그 이유는 optimization과정에서 gaussian을 움직이려고 하기 때문입니다.

이러한 2가지 경우는 densification하기 좋은 후보군이기 때문에 평균 크기의 position gradient 가 $\tau_{pos} = 0.0002$이상이라면 densify합니다.

densify details

Under-Reconstruction의 경우에는 빈 공간에 새로운 Gaussian을 추가해주어야 한다. 이를 위해 동일한 크기의 복사본을 position gradient만큼 이동하여 배치합니다.

Over-Reconstruction의 경우는 큰 Gaussian을 작은 Gaussian으로 분할해야 합니다. 저자들은 실험적으로 설정한 scale parameter인 $\phi = 1.6$으로 새로운 Gaussian을 만듭니다. 또한, 큰 Gaussian을 sampling PDF로 사용해서 작은 Gaussian을 배치합니다.

https://xoft.tistory.com/51

pseudo code에 대한 부연설명. M, S, C, A등의 파라미터들은 매 iteration마다 update되지만, 초록색 부분은 100 iter마다 업데이트.

초록색 부분의 clone의 경우 Gaussian의 개수와 volume은 증가, split은 volume은 유지하면서 Gaussian은 증가하게 됨. 이 때문에 카메라의 가까운 영역에 floater들이 생기고 Gaussian들이 무작위로 증가하는 형태로 나타남.

이러한 과정은 입력 카메라의 floater를 만들어 정체될 수 있습니다. 저자들은 이를 해결하기 위해 $N=3000$ iter마다 $\alpha$값을 0에 가까운 값으로 설정합니다 (M, S, C, A는 100iter동안 0이 아닌 값으로 변경될 것이고, 100iter 후에는 densify단계에서 RemoveGaussian으로 필요없는 Gaussian 삭제). 이를 통해 threshold보다 작은 값을 가지는 $\alpha$를 제거하여 Gaussian에 대한 전체적인 $\alpha$를 상승시킵니다. 또한, world space에서 큰 자리를 차지하는 gaussian을 주기적으로 제거하여 총 Gaussian의 수를 효과적으로 제어할 수 있다고 합니다 (큰 크기의 Gaussian이 중첩되는 경우도 방지). 유클리드 공간상에서 모든 3D Gaussian들은 기본 요소로서 존재하며 다른 방법론과 같이 공간 압축, 워핑, 투영등이 필요하지 않습니다.

https://xoft.tistory.com/51

이는 논문의 Appendix에 있는 Pseudo code입니다. 빨간색 부분은 변수 초기화, 파란색 부분은 inference후 loss계산 후 최적화 하는 부분, 초록색 부분은 위에서 언급한 Gaussian을 다루는 부분입니다. 파란색 부분에서 Rasterize부분이 보이는데 이제 이 부분에 대해 저자들이 어떻게 구현했는지 설펴보겠습니다.

Fast Differentiable Rasterizer for Gaussians

tile-based rasterization

저자들은 이전 솔루션의 문제였던 $\alpha$-blending에서의 픽셀당 정렬 비용의 문제를 피하기 위해 Gaussian splatting용 tile-based rasterizer를 소개합니다. 해당 rasterizer는 임의의 혼합된 gaussian에 대해 효율적인 역전파가 가능하고, 적은 추가적인 메모리 비용과 픽셀당 오버헤드가 일정하게 유지됩니다.

먼저 화면을 16x16의 tile로 나누고 (CreateTiles), view frustrum을 고려하여 각 tile에 대해 유효한 3D Gaussian을 선별하는 것으로 시작합니다. 이는 view frustrum과 99% 신뢰 구간의 Gaussian만 취하며 view frustrum에서 멀리 떨어진 extreme한 position을 개별적으로 제거하는 guard band를 사용합니다. (Cull Gaussian, pseucode상에서는 creat tiles가 후행됩니다)

겹치는 tile 수만큼 projection된 2D Gaussian을 instance화 합니다. 이렇게 생성된 instance들은 view space depth와 tile ID 쌍으로 조합하여 Key를 만듭니다. 그 후, Key로 Single GPU Radix Sort를 병렬적으로 수행하여, tile마다 2D splat에 대해 depth ordering을 수행합니다. 이를 통해 pseudo code상의 BlendInOrder에서 key를 기반으로 가까운 gaussian을 먼저 반영해 그릴 수 있습니다. 이로서 tile안에서 작은 pixel크기를 차지하는 gaussian들이 무시될 수 있었지만, 이로서 artifact가 적어지고 수렴이 잘 될 수 있었다고 합니다. 기존 방법들은 pixel마다의 정렬이 필요했지만, tile-based의 GPU Radix sort를 사용함으로서 병렬성이 늘어나고 amoritized(분할 상환)이 가능해져 장면을 표현하는데 사용하는 3D Gaussian의 수를 늘릴 수 있었습니다.

그 다음으로는 각 tile에 대한 list를 초기화합니다. 그 후 반복을 통해서 각 tile에 대해 list를 순회하면서 thread block을 만들어 Rasterization을 수행합니다. 각 thread block은 공동 메모리에 Gaussian의 패킷을 싹다 저장합니다. 그 후, 각 tile안의 pixel들에 대해 앞에서 만든 list를 순회하면서 color와 $\alpha$를 누적하여 병렬처리를 진행합니다. 여기서 pixel의 $\alpha$값이 target(saturation)에 도달하거나, 장기적으로 tile의 thread가 모두 쿼리되면, 해당 thread가 중지됩니다. 참고로 위에서 density of the projected gausian부분에서 설명한 방법대로 alpha-blending을 합니다.

해당 방법에서 $\alpha$는 rasterization의 유일한 정지조건입니다. 또한, 이전 연구와 다르게 Rasterization동안의 gradient update하는 Gaussian(기본 요소)의 수를 제한하지 않았습니다. 이는 Depth Complexity를 다양화하고 scene에 따른 hyperparameter를 튜닝하지 않고도 임의의 scene을 커버할 수 있게되었습니다. 이를 빠르게 하려고 저자들은 공유 메모리에 픽셀당 accumulate된 임의의 list를 따로 저장하는 방법을 선택할 수 있었지만, 이에 대한 동적 메모리 관리 overhead를 피하기 위해 tile별 list를 다시 순회하도록 했습니다. 이는, Forward pass에서 정렬된 Gaussian의 list와 tile range를 재사용할 수 있기 때문입니다. backward pass(Gradient Update)에서는 기울기 계산의 용이성을 위해 뒤에서 앞으로 순회하게 됩니다.

순회는 tile안의 pixel에 영향을 준 가장 마지막 point(Gaussian)부터 시작하여 이를 공유 메모리에 로드하는 것이 공동으로 수행됩니다. 그리고 forward pass중에 해당 색상에 기여한 마지막 포인트보다 깊이가 낮거나 같은 경우에 overlap(expensive) 테스트와 포인트 처리를 진행합니다. backward pass에서는 original blending 프로세스에서 사용된 누적 opacity 값이 필요합니다. 저자들은 backward pass에서 점점 줄어드는 opacity list를 순회하는 대신 forward pass가 끝날때 누적된 total opacity를 저장하여 중간의 opacity를 복구할 수 있게됩니다. 또한, 각 point는 프로세스에서 끝난 total opacity를 저장하여 뒤에서 앞으로 순회할 때 각 포인트의 $\alpha$로 나누어 gradient계산에 필요한 계수를 얻을 수 있습니다.

Fast Differentiable Rasterizer for Gaussians With Code

preprocess -> duplicatewithkeys -> sortpairs

위에 설명한 fast rasterization을 그림과 코드로 간단하게나마 설명해보겠습니다.

참고: https://clean-dragon.tistory.com/16

Forward

Forward는 PreProcess -> duplicateWithkeys -> SortPairs -> identifyTileRanges -> Render순으로 이루어져 있습니다.

먼저 PreProcess전에 각 픽셀마다 render에 쓰일 정보가 저장될 메모리 공간을 할당합니다.

	struct ImageState
	{
		uint2* ranges;
		uint32_t* n_contrib;
		float* accum_alpha;

		static ImageState fromChunk(char*& chunk, size_t N);
	};

ranges: pixel에 누적될 gaussian의 범위
n_contrib: pixel에 누적될 gaussian의 개수
accum_alpha: pixel에 누적되는 alpha값

위에 설명에서, gaussian의 범위, 개수, 누적 alpha값을 개별 pixel처리에서 사용한다 했는데, 위와 같이 pixel 마다 저장한 것을 볼 수 있습니다.

PreProcess

	const float* cov3D;
	if (cov3D_precomp != nullptr)
	{
		cov3D = cov3D_precomp + idx * 6;
	}
	else
	{
		computeCov3D(scales[idx], scale_modifier, rotations[idx], cov3Ds + idx * 6);
		cov3D = cov3Ds + idx * 6;
	}

	// Compute 2D screen-space covariance matrix
	float3 cov = computeCov2D(p_orig, focal_x, focal_y, tan_fovx, tan_fovy, cov3D, viewmatrix);

3D, 2D의 gaussian의 covariance matrix의 계산, 좌표의 계산, tile에 어느 gaussian이 겹쳐있는지 계산, gaussian의 opacity가 픽셀마다 얼마나 계산되어야 하는지 정합니다. 이는 view frustrum안에 있는 gaussian만을 골라내어 계산합니다.

그리고 EWA를 통해 gaussian의 radius를 구하고, 이 radius를 기준으로 gaussian이 어느 타일과 겹치는지 계산하며 gaussian의 평균과 픽셀의 거리마다 opacity가 다르게 적용할 수 있는 conic이라는 matrix를 계산합니다.

DuplicateWithKeys

	duplicateWithKeys << <(P + 255) / 256, 256 >> > (
		P,
		geomState.means2D,
		geomState.depths,
		geomState.point_offsets,
		binningState.point_list_keys_unsorted,
		binningState.point_list_unsorted,
		radii,
		tile_grid)
	CHECK_CUDA(, debug)

PreProcess에서 계산된 radius를 통해, gaussian과 겹친 tile의 id(32 bit), depth(32 bit)를 key에 저장합니다. 그리고 gaussian의 값을 value에 저장합니다 (겹친 tile만큼)

SortPairs

위에서 구한 key, value를 통해 radix sort를 이용해 정렬합니다. 이를 통해 tile별로 depth마다 정렬되어있는 gaussian들이 저장됩니다.

identifytileRanges -> Render

identifyTileranges

위에 정렬한 key, value를 가지고 tile마다 gaussian의 ID의 시작과 끝을 저장합니다. 여기서 저장한 range를 통해 나중에 rendering할때 range를 이용해 key, value를 통해 gaussian을 인덱싱할 수 있습니다.

Render

위에서 말했듯 Gaussian을 이용해서 이미지를 rendering할 때, CUDA에서 tile마다 thread block을 설정하고, tile 안에 pixel마다 thread가 할당됩니다.

	renderCUDA<NUM_CHANNELS> << <grid, block >> > (
		ranges,
		point_list,
		W, H,
		means2D,
		colors,
		conic_opacity,
		final_T,
		n_contrib,
		bg_color,
		out_color);

먼저 위와같은 renderCUDA()를 호출하는 부분을 보겠습니다. 인자로 앞서 계산한 정보와 gaussian parameter정보를 받습니다.

	auto block = cg::this_thread_block();
	uint32_t horizontal_blocks = (W + BLOCK_X - 1) / BLOCK_X;
	uint2 pix_min = { block.group_index().x * BLOCK_X, block.group_index().y * BLOCK_Y };
	uint2 pix_max = { min(pix_min.x + BLOCK_X, W), min(pix_min.y + BLOCK_Y , H) };
	uint2 pix = { pix_min.x + block.thread_index().x, pix_min.y + block.thread_index().y };
	uint32_t pix_id = W * pix.y + pix.x;
	float2 pixf = { (float)pix.x, (float)pix.y };

	// Check if this thread is associated with a valid pixel or outside.
	bool inside = pix.x < W&& pix.y < H;
	// Done threads can help with fetching, but don't rasterize
	bool done = !inside;

	// Load start/end range of IDs to process in bit sorted list.
	uint2 range = ranges[block.group_index().y * horizontal_blocks + block.group_index().x];
	const int rounds = ((range.y - range.x + BLOCK_SIZE - 1) / BLOCK_SIZE);
	int toDo = range.y - range.x;

Block ID와 thread ID를 계산하고, 각 pixel마다 thread별로 계산하도록 위치를 정합니다. 그리고 identifyTileranges에서 계산한 ranges를 이용해 tile내의 gaussian의 범위를 가져옵니다. 이 행위는 thread를 이용해서 gaussian을 병렬적으로 인덱싱합니다.

여기서는 16x16개의 thread를 사용하는데, thread로 한번에 가져올 수 있는 gaussian의 개수는 최대가 256개입니다. 그래서 rounds를 계산하여, 256개 이상의 gaussian이 tile안에 있다면 계속 인덱싱 하여 가져올 수 있게 합니다. 위의 상황에서 toDo=300인데, BLOCK_SIZE=256이면 for문을 2번 돌면 그만인 것입니다.

	__shared__ int collected_id[BLOCK_SIZE];
	__shared__ float2 collected_xy[BLOCK_SIZE];
	__shared__ float4 collected_conic_opacity[BLOCK_SIZE];

위 shared는 grid내의 각 thread끼리 공유하는 공유 메모리를 할당하는 부분입니다.

	for (int i = 0; i < rounds; i++, toDo -= BLOCK_SIZE)
	{
		// End if entire block votes that it is done rasterizing
		int num_done = __syncthreads_count(done);
		if (num_done == BLOCK_SIZE)
			break;

		// Collectively fetch per-Gaussian data from global to shared
		int progress = i * BLOCK_SIZE + block.thread_rank();
		if (range.x + progress < range.y)
		{
			int coll_id = point_list[range.x + progress];
			collected_id[block.thread_rank()] = coll_id;
			collected_xy[block.thread_rank()] = points_xy_image[coll_id];
			collected_conic_opacity[block.thread_rank()] = conic_opacity[coll_id];
		}
		block.sync();

		// Iterate over current batch
		for (int j = 0; !done && j < min(BLOCK_SIZE, toDo); j++)
		{
			// Keep track of current position in range
			contributor++;

			// Resample using conic matrix (cf. "Surface 
			// Splatting" by Zwicker et al., 2001)
			float2 xy = collected_xy[j];
			float2 d = { xy.x - pixf.x, xy.y - pixf.y };
			float4 con_o = collected_conic_opacity[j];
			float power = -0.5f * (con_o.x * d.x * d.x + con_o.z * d.y * d.y) - con_o.y * d.x * d.y;
			if (power > 0.0f)
				continue;

			// Eq. (2) from 3D Gaussian splatting paper.
			// Obtain alpha by multiplying with Gaussian opacity
			// and its exponential falloff from mean.
			// Avoid numerical instabilities (see paper appendix). 
			float alpha = min(0.99f, con_o.w * exp(power));
			if (alpha < 1.0f / 255.0f)
				continue;
			float test_T = T * (1 - alpha);
			if (test_T < 0.0001f)
			{
				done = true;
				continue;
			}

			// Eq. (3) from 3D Gaussian splatting paper.
			for (int ch = 0; ch < CHANNELS; ch++)
				C[ch] += features[collected_id[j] * CHANNELS + ch] * alpha * T;

			T = test_T;

			// Keep track of last range entry to update this
			// pixel.
			last_contributor = contributor;
		}
	}

해당 부분은 병렬처리를 하는 부분입니다. 현재 pixel의 위치와 gaussian의 위치 간의 차이를 계산하고, 그 계산한 결과에 conic_opacity를 적용하여, 최종적으로 pixel에 대한 opacity를 계산합니다. 그 후, density(gaussian intensity)를 opacity와 곱하여 alpha를 만들어내며, alpha blending식을 통해 pixel에 대한 color를 계산합니다.

그리고 gaussian의 color가 저장된 메모리로부터 gaussian의 ID를 이용해 color값을 위와같은 형태로 저장합니다. 위 과정을 tile안에 있는 gaussian마다 반복합니다.

~~cuda코딩도 시간날때 짬짬히 봐둬야겠습니다..~~

Experiment

quantitative evaluation

이전 연구와 정량적 연구 결과입니다.

다른 연구와의 비교 그림입니다.

Ablation Study

이는 SfM 초기화 설정에 대한 ablation study입니다.

하지만 3D-GS방법은 장면이 잘 보이지 않는 구역에서 artifact가 관찰된다던지, 학습 중에 본 뷰와 겹치지 않는 뷰에서 artifact가 관찰된다는 한계점이 존재합니다.

[ 딥러닝 논문 리뷰 - PRMI Lab ] - DiT (Scalable Diffusion Models with Transformers)

Lee현서 — Sun, 12 Jan 2025 16:03:43 +0900

오늘은 OpenAI에서 만든 SORA의 근간이 되는 기술을 공부해보고 싶어서 찾다가 DiT(Diffusion Transformer)라는 논문이 있어서 정리해 보려고 합니다. 새로운 기술에 대한 논문이기 보다는, 모델의 구조와 관련된 논문이라고 생각됩니다. 그리고 최근에 4090데탑을 맞춰서, NerF, Diffusion 모델들을 코드를 분석하며 돌려보도록 하겠습니다.

Diffusion Transformers (Preliminaries)

DDPM (Denoising Diffusion Probablitistic Model)

DDPM 리뷰: https://hyunseo-fullstackdiary.tistory.com/426

[ 딥러닝 논문 리뷰 - PRMI Lab ] - Denoising Diffusion Probabilistic Model (DDPM)

Generative model에 있어서 이전에 GAN, VAE, Normalizing Flow모델등을 알아봤었습니다. 요즘에는 DDPM이 GAN보다 성능이 좋다고 들었습니다. 원래는 StyleGan을 통해 발전된 GAN에 대해 알아보려고 했으나, DDPM

hyunseo-fullstackdiary.tistory.com

Classifier guidance (Diffusion Models Beat GANs on Image Synthesis)

논문 링크: https://arxiv.org/abs/2105.05233

Diffusion Models Beat GANs on Image Synthesis

We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional imag

arxiv.org

위 그림에서 기존의 $p_{\theta}(x_t|x_{t+1})$을 $y$에 대해 condition한 수식을 $\mathbf{Z}p_{\theta}(x_t|x_{t+1})p_{\phi}(y|t)$로 표현할 수 있음을 보여줍니다. $p_{\theta}(x_t|x_{t+1})$을 Gaussian Distribution으로 표현하고 있습니다.

$p_{\phi}(y|x_t)$또한 Diffusion procedure의 성질을 이용해 Taylor expansion을 통해 정리합니다. 그럼 최종적으로 구하고자 했던 식은 $\sum g$로 mean shifted된 형태로 표현되게 됩니다.

해당 논문을 정리하면 위와 같은 그림으로 표현할 수 있겠습니다. label y로 conditional하게 image를 sampling하기 위해서는 $x_t, y$로 훈련해서 얻은 별도의 classifier의 gradient $\nabla g$를 DDPM sampling 과정의 mean에 더해주면 됩니다. 논문에서는 classifier guidance의 strength를 조정하는 $s$를 classifier의 gradient에 곱해줘서 강도를 조절해줍니다.

Classifier-free guidance (Classifier-Free Guidance)

논문 링크: https://arxiv.org/pdf/2207.12598

해당 논문은 제목 그대로 classifier를 만들지 않고 conditioning하는 방법에 대해 다룹니다. 간단히 위 pseudo code에서 $p_{uncound}$를 통해서 라벨 $c$를 뽑아내고, forward pass의 $z_t,\ t,\ c$를 함께 활용하여 모델을 훈련시킵니다.

해당 논문에서는 unconditional model, conditional model 2개를 동시에 훈련시키는 방향으로 훈련합니다. unconditional model은 $c = \emptyset$로 설정합니다. 기본 $p_{uncond} = 0.5$로 설정합니다.

기본적으로 위 pseudo code에서 $w$가 0이면 non-guided model이라고 정의합니다. 또한 $c$는 text_embedding으로서, CLIP, T5와 같은 모델을 주로 사용합니다.

LDM (Latent Diffusion Model), VQ-VAE (Vector Quantinize-VAE)

논문 링크: https://arxiv.org/pdf/2112.10752

LDM Architecture

LDM은 DM에서의 모든 pixel space에서의 계산 비효율성을 개선하기 위해 Encoder, Decoder구조를 도입한 것입니다. VQ-VAE구조를 사용하여 의미없는 고주파 특성을 보존하는 대신 저주파의 특성(=의미)을 학습하는 Semantic Compression을 합니다. LDM을 살펴보기 전에 VQ-VAE에 대해 간단히 보겠습니다.

VQ-VAE

VQ-VAE는 VQ(Vector Quantinize)를 활용하여 latent space인 $z_{e}(x)$를 Embedding space(codebook, 일종의 사전)을 이용하여 vector mapping을 합니다. 따라서 제한된 개수의 codebook vector를 사용하여 discrete sampling을 하는 것입니다.

mapping

codebook과의 mapping은 위 수식과 같이 L2가 가장 작은 vector와 매핑해 one-hot 확률분포 $q(z|x)$를 만듭니다. 그 후, 순서에 맞는 vector를 추출해 $z_{q}(x)$를 생성합니다.

loss function

VQ-VAE의 loss식에는 sg(stop gradient)가 포함되어 있습니다. 그 이유는 VQ operation이 비선형적이고 미분 불가능하기 떄문입니다. 그래서 gradient를 복사하여 decoder -> encoder로 그대로 복사 해줍니다. loss는 앞에서부터 각각 Reconstruction, codebook, commitment loss라고 합니다.

Reconstruction Loss
- Decoder와 encoder를 최적화하는 부분입니다.
- decoder의 최종 output이 최적화 됩니다.
Codebook loss
- encoder의 출력물에 sg를 걸고 code book을 업데이트 하는 과정입니다.
  - - 이러한 과정 때문에 reconstruction loss인 $log p(z|z_{q}(x))$가 $e_i$의 update에 관여를 하지 않는 이유입니다.
- codebook이 encoder의 결과와 비슷하게 최적화 됩니다.
Commitment loss
- encoder의 출력이 codebook의 결과와 비슷하게 최적화 됩니다.

LDM regularization

다시 LDM으로 돌아와서 high-variance latent space를 피하기 위해 저자들이 적용한 regularization을 소개합니다. KL-reg, VQ-reg로 나눌 수 있습니다.

KL-reg
- VAE처럼 KL-divergence를 통해 $z$를 특정 분포로 정규화 하여, 잠재 공간을 효율적으로 압축되게 합니다.
VQ-reg
- VQ-VAE에서 봤던것 처럼, 잠재 공간을 discrete하게 만들어, Semantic Compression을 가능하게 합니다.
- VQ-reg는 decoder에 넣어서 layer를 따로 구성했다고 합니다.

LDM의 Loss식은 위와 같습니다. $e_{\theta}(z_t,t)$는 time conditional U-net입니다. 하지만 $z_t$는 $\epsilon$: Encoder를 통해서 $z$를 만들고 DM을 통해 노이즈를 추가해 $z_t$를 만들어, 이로부터 학습을 진행합니다.

conditioning mechanism

LDM에서 domain specific encoder $\tau_{\theta}$를 활용하여 $y$를 intermediate representaiton으로 바꿉니다. 그 후, 해당 representation(=embedding)을 cross-attention을 통해 U-net에 정보를 전달합니다.

representation을 통해 K.Q.V(learnable matrix)를 만들어준다음에, 수식에 나와있는 shape에 맞추어 cross-attention을 진행해줍니다. 참고로 $d_{\epsilon}^i$와 같이 $i$로 notation되어 있는건 U-net의 layer 번호, N은 batch size등을 의미합니다.

DiT (Scalable Diffusion Models with Transformers)

논문 링크: https://arxiv.org/pdf/2212.09748

참고 블로그: https://kyujinpy.tistory.com/132

[Diffusion Transformer 논문 리뷰3] - Scalable Diffusion Models with Transformers

*DiT를 한번에 이해할 수 있는(?) A~Z 논문리뷰입니다! *총 3편으로 구성되었고, 마지막 3편은 제 온 힘을 다하여서.. 논문리뷰를 했습니다..ㅎㅎ *궁금하신 점은 댓글로 남겨주세요! DiT paper: https://ar

kyujinpy.tistory.com

DiT의 저자들은 transformer architecture를 diffusion model에 scaling properties를 가질 수 있도록 설계했다고 합니다. 그리하여 DiT는 Vision Transformer(ViT)를 사용하여, patch sequence를 통해 image task를 해결합니다.

DiT-1. Patchfy

https://kyujinpy.tistory.com/132

먼저 빨간색 박스에 해당되는 Noised Latent, Patchify에 대해 알아보겠습니다.

Noised Latent는 VAE Encoder ---> forward Process입니다. 위의 LDM에서 설명한 그림(윗부분) 그대로 입니다. 그 다음 블록인 Patchify는 $I \times I \times C$인 Noised Latent를 $p \times p$의 patch로 쪼개어 length가 $T = (I/p)^2$이고 hidden dimension이 $d$인 sequence로 만드는 부분입니다. $T \times d$의 patch가 생성되는데, 더 작은 값의 $p$는 longer sequence = higher Gflops를 의미한다고 합니다. 추가적으로 만들어진 sequence에 ViT에도 적용된 sin-cos positional encoding도 적용해 줍니다.

DiT-2. Embed

https://kyujinpy.tistory.com/132

Embed layer의 detail은 Appendix쪽에 있습니다. 우선, input timestep를 embed하기 위해 2개의 MLP를 활용하여 256d의 embedding을 만드는데, MLP output의 결과를 SiLU activation function에 넣어서 최종적으로 256d를 만듭니다.

과정을 간단히 보면 time step t이 sinusoidal PE(Positional Encoding)를 통해 [B, 256(=frequency dimension)]로 변환됩니다. MLP를 통해 [B, $d_{hidden}$]이 됩니다. $d_{hidden}$은 DiT에 활용되는 hyperparameter인 Transformer hidden size입니다. 그 후, SiLU가 적용되는 일련의 과정이 2번 반복됩니다. 즉 결과는 [B, $d_{hidden}$]가 되어, 이는 time step정보 $t$를 Transformer가 잘 처리할 수 있도록 변환된 vector가 됩니다.

또한, Timestep과 label 정보에 대하여 embedding정보로 들어오게 되면, 이는 각각 256d의 vector일 것인데, 두개의 vector를 더한 상태(+) 로 MLP에 넣어주게 됩니다.

DiT-3. DiT Block

https://kyujinpy.tistory.com/132

DiT Block을 들어가기 전에 adaLN에 대해 살펴보겠습니다. 기존의 Layer Normalization(LN)은 마지막에 learnable parameter를 통해 shift, scale을 합니다. 하지만, DiT에서는 Embed layer의 timestep, label의 embedding을 통해 shift, scale을 진행합니다. 이러한 adaLN은 styleGAN과 같은 GAN의 style trasfer에서 자주 사용됩니다.

adaLN-Zero Block

adaLN은 2개(timestep, label)의 shift, scale 인자가 필요하다고 했습니다. 총 4개의 embeding vector가 MLP의 출력으로 나오는 것입니다. adaLN-Zero는 추가로 scale factor $\alpha$를 추가하여 총 6개의 output이 나오도록한 모델 구조를 말합니다. 그리고 이러한 scale factor $\alpha$는 0으로 두고 시작하기 때문에 adaLN-Zero로 명명하는 것입니다.

뿐만 아니라 $\alpha = 0$인 상황에서는 input token(patch sequence)만이 살아남습니다 (위 그림 참고). 그래서 초기의 DiT block은 identity function이 됩니다.

details

추가로 adaLN-Zero인 경우 transformer hidden size의 6배에 해당하는 vector를 출력하게 된다고 합니다. 또한, Core transformer(FFNN, SA)에는 GELU가 사용된다고 합니다.

DiT-4. Transformer Decoder

https://kyujinpy.tistory.com/132

Transformer Decoder에는 Layer Normalization을 적용하고, linear and Reshape을 적용하여 각 patch마다 기존 channel size의 2배가 되는 output을 출력하게 합니다. 이에 output은 위 처럼 예측된 noise값과 covariance값이 되고, 이후 VAE decoder에 noise값을 넣어서 실제 image를 synthesis를 하게 됩니다.

Diffusion 과정의 역전파 과정에서 \[x_{t-1} \sim \mathcal{N}(\mu_{\theta}(x_t, t), {\scriptstyle \sum}_{\theta}(x_t, t))\]를 통해 $x_{t-1}$로 샘플링하여 이전 단계로 이동합니다. 원래 DDPM에서는 해당 분산을 고정했지만, 공분산을 학습하게 함으로써 Diffusion 과정이 시간 단계 $t$에서의 노이즈 수준을 더 잘 반영할 수 있도록 합니다. 또한, 고해상도 이미지를 생성할 때 공분산을 작게 설정하면 품질이 높아지고 큰 공분산은 더 창의적인 결과를 생성할 수 있도록 조절할 수 있습니다.

Result

other conditioning strategies

그 외에도 DiT Block에 위의 adaLN, adaLN-Zero와 더불어 Cross-attention, In-Context conditioning을 했을때의 성능 비교를 한 표입니다. In-context는 $t, c$ embedding을 additional tokens로서 input sequence에 포함시켜서 ViT의 cls token과 비슷하게 처리하는 방법입니다. Cross-attention block은 $t, c$를 length-two sequence로 Self-Attention 뒤에 Cross-Attention레이어를 추가하여 Query로 이미지 토큰, Key/Value로 $t, c$를 두어 학습하는 방법입니다. 결과적으로는 XL/2 adaLN-Zero가 가장 뛰어난 것을 볼 수 있습니다.

256x256, 512x512 image SOTA

위표에서와 같이 FID, IS가 이전의 모델과 비교하였을 때 우수함을 확인할 수 있습니다.