Derivation Of “Differentiate By Vector” Formula

Introduction
Review of inner product of vectors and matrices
Derivation of $\frac{\partial }{\partial \bm{x}}(\bm{x}^\mathsf{T} \bm{y})=\bm{y}$
Derivation of $ \frac{\partial }{\partial \bm{x}}(\bm{x}^\mathsf{T} A \bm{x})=(A + A^\mathsf{T})\bm{x}$

Introduction

When studying machine learning theory, we often see the operation of “differentiating a scalar by a vector”.

In this article, we derive the following formula for “differentiating a scalar by a vector”.

\begin{align*} \newcommand{\mat}[1]{\begin{pmatrix} #1 \end{pmatrix}} \newcommand{\f}[2]{\frac{#1}{#2}} \newcommand{\pd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\d}[2]{\frac{{\rm d}#1}{{\rm d}#2}} \newcommand{\T}{\mathsf{T}} \newcommand{$}{\left(} \newcommand{$}{\right)} \newcommand{\{}{\left\{} \newcommand{\}}{\right\}} \newcommand{\[}{\left[} \newcommand{\]}{\right]} \newcommand{\dis}{\displaystyle} \newcommand{\eq}[1]{{\rm Eq}(\ref{#1})} \newcommand{\n}{\notag\\} \newcommand{\t}{\ \ \ \ } \newcommand{\argmax}{\mathop{\rm arg\, max}\limits} \newcommand{\argmin}{\mathop{\rm arg\, min}\limits} \def\l<#1>{\left\langle #1 \right\rangle} \def\us#1_#2{\underset{#2}{#1}} \def\os#1^#2{\overset{#2}{#1}} \newcommand{\case}[1]{\{ \begin{array}{ll} #1 \end{array} \right.} \newcommand{\s}[1]{{\scriptstyle #1}} \definecolor{myblack}{rgb}{0.27,0.27,0.27} \definecolor{myred}{rgb}{0.78,0.24,0.18} \definecolor{myblue}{rgb}{0.0,0.443,0.737} \definecolor{myyellow}{rgb}{1.0,0.82,0.165} \definecolor{mygreen}{rgb}{0.24,0.47,0.44} \newcommand{\c}[2]{\textcolor{#1}{#2}} \end{align*}

\begin{align*} \pd{}{\bm{x}}$ \bm{x}^\T \bm{y} $ = \bm{y}, \t \pd{}{\bm{y}}$ \bm{x}^\T \bm{y} $ = \bm{x}. \end{align*}

\begin{align*} \pd{}{\bm{x}} $ \bm{x}^\T A \bm{x} $ = $ A + A^\T $ \bm{x}. \end{align*}

The vector derivative of a *scalar function $f$ is defined as follows

\begin{align*} \pd{f}{\bm{x}} = $ \pd{f}{x_1}, \pd{f}{x_2}, \dots, \pd{f}{x_n} $^\T \end{align*}

Review of inner product of vectors and matrices

Before deriving the formulas, we will review vectors and matrices.

Let $\bm{x}, \bm{y}$ denote $n$-dimensional column vectors, and let $A$ denote an $n$-dimensional square matrix.

\begin{align*}
\bm{x} = \mat{x_1\\ x_2\\ \vdots \\ x_n}, \t \bm{y} = \mat{y_1\\ y_2\\ \vdots \\ y_n}.
\end{align*}

\begin{align*}
A = \mat{a_{11} & a_{12} & \dots & a_{1n} \\
a_{21} & a_{22} & \dots & a_{2n} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n1} & a_{n2} & \dots & a_{nn} \\}.
\end{align*}

Then the inner product of the vectors $\bm{x}, \bm{y}$ can be expressed as follows.

\begin{align}
\bm{x}^\T \bm{y} &= x_1 y_1 + x_2 y_2 + \cdots + x_n y_n \n
&= \sum_i x_i y_i.
\end{align}

Also, the $i$ component of the product of the matrix $A$ and the vector $\bm{x}$ can be expressed as follows

\begin{align}
(A\bm{x})_i &= \mat{a_{11} & a_{12} & \dots & a_{1n} \\
\vdots & \vdots & \ddots & \vdots \\
\c{myred}{a_{i1}} & \c{myred}{a_{i2}} & \c{myred}{\dots} & \c{myred}{a_{in}} \\
\vdots & \vdots & \ddots & \vdots \\
a_{n1} & a_{n2} & \dots & a_{nn} \\}_{i:}
\mat{\c{myred}{x_1}\\ \c{myred}{x_2}\\ \c{myred}{\vdots} \\ \c{myred}{x_n}} \n
\n
&= a_{i1}x_1 + a_{i2}x_2 + \cdots + a_{in}x_n \n
\n
&= \sum_j a_{ij}x_j.
\end{align}

Let us derive the formula in light of the above.

Derivation of $\frac{\partial }{\partial \bm{x}}(\bm{x}^\mathsf{T} \bm{y})=\bm{y}$

\begin{align*}
\pd{}{\bm{x}}$ \bm{x}^\T \bm{y} $ = $\pd{\bm{x}^\T \bm{y}}{x_1}, \dots, \pd{\bm{x}^\T \bm{y}}{x_i}, \dots, \pd{\bm{x}^\T \bm{y}}{x_n} $^\T
\end{align*}

Calculating $\pd{\bm{x}^\T \bm{y}}{x_i}$ yields the following

\begin{align*}
\pd{}{x_i} $ \bm{x}^\T\bm{y} $ &= \pd{}{x_i} $ \sum_j x_j y_j $ \n
&= \sum_j \delta_{ij} y_j \n
&= y_i
\end{align*}

where $\delta_{ij}$ is Kronecker’s delta: $\delta_{ij} \equiv \case{1 & (i=j) \ 0 & (i \neq j)}$.

Therefore, the following equation holds

\begin{align*} \pd{}{\bm{x}}$ \bm{x}^\T \bm{y} $ = \bm{y}. \end{align*}

Similarly, $\pd{}{\bm{y}} $ \bm{x}^\T \bm{y} $$ is

\begin{align*}
\pd{}{y_i} $ \bm{x}^\T\bm{y} $ &= \pd{}{y_i} $ \sum_j x_j y_j $ \n
&= \sum_j x_j \delta_{ij} \n
&= x_i
\end{align*}

Therefore, the following equation holds.

\begin{align*} \pd{}{\bm{y}}$ \bm{x}^\T \bm{y} $ = \bm{x}. \end{align*}

Derivation of $ \frac{\partial }{\partial \bm{x}}(\bm{x}^\mathsf{T} A \bm{x})=(A + A^\mathsf{T})\bm{x}$

We first focus on $\bm{x}^\T A \bm{x}$.

\begin{align*}
\bm{x}^\T A \bm{x} &= \sum_i x_i $ A \bm{x} $_i \n
&= \sum_i x_i $ \sum_j a_{ij} x_j $ \n
&= \sum_i \sum_j a_{ij} x_i x_j
\end{align*}

The $\pd{}{x_i} (\bm{x}^\T A \bm{x})$ can be calculated as follows

\begin{align*}
\pd{}{x_i} (\bm{x}^\T A \bm{x}) &= \pd{}{x_i} $ \sum_{\mu} \sum_{\nu} a_{\mu \nu} x_{\mu} x_{\nu} $ \n
&= \sum_{\mu} \sum_{\nu} a_{\mu \nu} \pd{}{x_i} $ x_{\mu} x_{\nu} $ \n
&= \sum_{\mu} \sum_{\nu} a_{\mu \nu} $ \delta_{i \mu} x_{\nu} + x_{\mu} \delta_{i \nu} $ \n
&= $ \sum_{\mu} \sum_{\nu} a_{\mu \nu} \delta_{i \mu} x_{\nu} $ + $ \sum_{\mu} \sum_{\nu} a_{\mu \nu} x_{\mu} \delta_{i \nu} $ \n
&= $\sum_{\nu} a_{i \nu} x_{\nu} $ + $\sum_{\mu} a_{\mu i} x_{\mu} $ \n
&= $\sum_{\nu} (A)_{i \nu} x_{\nu} $ + $\sum_{\mu} (A^\T)_{i \mu} x_{\mu} $ \n
&= ( A\bm{x} )_i + ( A^\T \bm{x} )_i\ .
\end{align*}

Therefore, the following equation holds.

\begin{align*} \pd{}{\bm{x}} $ \bm{x}^\T A \bm{x} $ = $ A + A^\T $ \bm{x}. \end{align*}

In particular, when $A$ is a symmetric matrix, we have $A = A^\T$, so the following holds

When $A$ is a symmetric matrix

\begin{align*} \pd{}{\bm{x}} $ \bm{x}^\T A \bm{x} $ = 2 A \bm{x}. \end{align*}