Vector Differentiation Formulas: Complete Derivation

数学

Introduction

When studying machine learning theory, you often encounter operations that differentiate a scalar with respect to a vector.

This article derives the following formulas for “differentiating a scalar with respect to a vector”.

\begin{align*}
\newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}}
\newcommand{\f}[2]{\frac{#1}{#2}}
\newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}}
\newcommand{\T}{\mathsf{T}}
\newcommand{\dis}{\displaystyle}
\newcommand{\eq}[1]{{\rm Eq}(\ref{#1})}
\newcommand{\n}{\notag\\}
\newcommand{\t}{\ \ \ \ }
\newcommand{\argmax}{\mathop{\rm arg\, max}\limits}
\newcommand{\argmin}{\mathop{\rm arg\, min}\limits}
\def\l<#1>{\left\langle #1 \right\rangle}
\def\us#1_#2{\underset{#2}{#1}}
\def\os#1^#2{\overset{#2}{#1}}
\newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.}
\newcommand{\s}[1]{{\scriptstyle #1}}
\newcommand{\c}[2]{\textcolor{#1}{#2}}
\end{align*}

\begin{align*}
\pd{}{\bm{x}}\( \bm{x}^\T \bm{y} \) = \bm{y}, \t \pd{}{\bm{y}}\( \bm{x}^\T \bm{y} \) = \bm{x}.
\end{align*}

\begin{align*}
\pd{}{\bm{x}} \( \bm{x}^\T A \bm{x} \) = \( A + A^\T \) \bm{x}.
\end{align*}

※Vector differentiation of a scalar function $f$ is defined as follows.

\begin{align*}
\pd{f}{\bm{x}} = \( \pd{f}{x_1}, \pd{f}{x_2}, \dots, \pd{f}{x_n} \)^\T
\end{align*}

Review of Vector Inner Products and Matrices

Before deriving the formulas, let us review vectors and matrices.

In what follows, let $\bm{x}, \bm{y}$ be $n$-dimensional column vectors and $A$ be an $n$-th order square matrix.

\begin{align*}
\bm{x} = \mat{x_1\\ x_2\\ \vdots \\ x_n}, \t \bm{y} = \mat{y_1\\ y_2\\ \vdots \\ y_n}.
\end{align*}

\begin{align*}
A = \mat{a_{11} & a_{12} & \dots & a_{1n} \\
a_{21} & a_{22} & \dots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n1} & a_{n2} & \dots & a_{nn} \\}.
\end{align*}

Then, the inner product of vectors $\bm{x}, \bm{y}$ can be expressed as follows.

\begin{align}
\bm{x}^\T \bm{y} &= x_1 y_1 + x_2 y_2 + \cdots + x_n y_n \n
&= \sum_i x_i y_i.
\end{align}

Also, the $i$-th component of the product of matrix $A$ and vector $\bm{x}$ can be expressed as follows.

\begin{align} (A\bm{x})_i &= \begin{pmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ \vdots & \vdots & \ddots & \vdots \\ \style{color: #C73D2F}{a_{i1}} & \style{color: #C73D2F}{a_{i2}} & \style{color: #C73D2F}{\dots} & \style{color: #C73D2F}{a_{in}} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \dots & a_{nn} \end{pmatrix}_{i:} \begin{pmatrix} \style{color: #C73D2F}{x_1}\\ \style{color: #C73D2F}{x_2}\\ \style{color: #C73D2F}{\vdots} \\ \style{color: #C73D2F}{x_n} \end{pmatrix} \notag\\ \notag\\ &= a_{i1}x_1 + a_{i2}x_2 + \cdots + a_{in}x_n \notag\\ \notag\\ &= \sum_j a_{ij}x_j. \end{align}

With the above in mind, let us derive the formulas.

Derivation of $\frac{\partial }{\partial \bm{x}}(\bm{x}^\mathsf{T} \bm{y})=\bm{y}$

\begin{align*}
\pd{}{\bm{x}}\( \bm{x}^\T \bm{y} \) = \(\pd{\bm{x}^\T \bm{y}}{x_1}, \dots, \pd{\bm{x}^\T \bm{y}}{x_i}, \dots, \pd{\bm{x}^\T \bm{y}}{x_n} \)^\T
\end{align*}

Therefore, calculating $\pd{\bm{x}^\T \bm{y}}{x_i}$ gives the following.

\begin{align*}
\pd{}{x_i} \( \bm{x}^\T\bm{y} \) &= \pd{}{x_i} \( \sum_j x_j y_j \) \n
&= \sum_j \delta_{ij} y_j \n
&= y_i
\end{align*}

Here, $\delta_{ij}$ is the Kronecker delta: $\delta_{ij} \equiv \case{1 & (i=j) \\ 0 & (i \neq j)}$.

Therefore, the following holds.

\begin{align*}
\pd{}{\bm{x}}\( \bm{x}^\T \bm{y} \) = \bm{y}.
\end{align*}


Similarly, $\pd{}{\bm{y}} \( \bm{x}^\T \bm{y} \)$ is

\begin{align*}
\pd{}{y_i} \( \bm{x}^\T\bm{y} \) &= \pd{}{y_i} \( \sum_j x_j y_j \) \n
&= \sum_j x_j \delta_{ij} \n
&= x_i
\end{align*}

Therefore, the following holds.

\begin{align*}
\pd{}{\bm{y}}\( \bm{x}^\T \bm{y} \) = \bm{x}.
\end{align*}

Derivation of $\frac{\partial }{\partial \bm{x}}(\bm{x}^\mathsf{T} A \bm{x})=(A + A^\mathsf{T})\bm{x}$

First, let us focus on $\bm{x}^\T A \bm{x}$.

\begin{align*}
\bm{x}^\T A \bm{x} &= \sum_i x_i \( A \bm{x} \)_i \n
&= \sum_i x_i \( \sum_j a_{ij} x_j \) \n
&= \sum_i \sum_j a_{ij} x_i x_j
\end{align*}

Since it can be expressed as above, $\pd{}{x_i} (\bm{x}^\T A \bm{x})$ can be calculated as follows.

\begin{align*}
\pd{}{x_i} (\bm{x}^\T A \bm{x}) &= \pd{}{x_i} \( \sum_{\mu} \sum_{\nu} a_{\mu \nu} x_{\mu} x_{\nu} \) \n
&= \sum_{\mu} \sum_{\nu} a_{\mu \nu} \pd{}{x_i} \( x_{\mu} x_{\nu} \) \n
&= \sum_{\mu} \sum_{\nu} a_{\mu \nu} \( \delta_{i \mu} x_{\nu} + x_{\mu} \delta_{i \nu} \) \n
&= \( \sum_{\mu} \sum_{\nu} a_{\mu \nu} \delta_{i \mu} x_{\nu} \) + \( \sum_{\mu} \sum_{\nu} a_{\mu \nu} x_{\mu} \delta_{i \nu} \) \n
&= \(\sum_{\nu} a_{i \nu} x_{\nu} \) + \(\sum_{\mu} a_{\mu i} x_{\mu} \) \n
&= \(\sum_{\nu} (A)_{i \nu} x_{\nu} \) + \(\sum_{\mu} (A^\T)_{i \mu} x_{\mu} \) \n
&= ( A\bm{x} )_i + ( A^\T \bm{x} )_i\ .
\end{align*}

Therefore, the following holds.

\begin{align*}
\pd{}{\bm{x}} \( \bm{x}^\T A \bm{x} \) = \( A + A^\T \) \bm{x}.
\end{align*}

In particular, when $A$ is a symmetric matrix, $A = A^\T$ holds, so the following holds.

When $A$ is a symmetric matrix,

\begin{align*}
\pd{}{\bm{x}} \( \bm{x}^\T A \bm{x} \) = 2 A \bm{x}.
\end{align*}

Copied title and URL