Best Fit Straight Line
Author: Han de Bruijn
Date: 2017 November
Suppose we have a collection of points in the plane and we want to draw a
straight line through these points, in such a way that the line is a "best
fit".
An equation for a straight line can always be set up as follows:
$$
\cos(\theta) (x - p) + \sin(\theta) (y - q) = 0
$$
Here $\theta$ is the angle of the line's normal with the x-axis and $(p,q)$
is a point supporting the line. The distance of an arbitrary point $\vec{r}_k
= (x_k,y_k)$ to the line is given by the length of the projection of the point's
vector $\vec{r}_k$ onto the line. The normal $\vec{n}$ of the line is given by
$\vec{n} = (\cos(\theta),\sin(\theta))$. Hence the length of the projection is:
$$
\left| \frac{(\vec{r}_k \cdot \vec{n})}{(\vec{n} \cdot \vec{n})} \vec{n} \right| =
\left| \cos(\theta) (x_k - p) + \sin(\theta) (y_k - q) \right|
$$
For the straight line to be a "best fit", it will be required that the sum of
the weighted squares of all distances to the line shall be a minimum:
$$
\sum_k w_k \left[ \cos(\theta) (x_k - p) + \sin(\theta) (y_k - q) \right]^2
= \mbox{minimum}(p,q,\theta)
$$
Here the weights $w_k$ are chosen in such a way that $\sum_k w_k = 1$ .
Working out a bit:
$$
\cos^2(\theta) \sum_k w_k (x_k - p)^2 +
\sin^2(\theta) \sum_k w_k (y_k - q)^2 +
$$ $$
2 \sin(\theta) \cos(\theta) \sum_k w_k (x_k - p) (y_k - q)
= \mbox{minimum}(p,q,\theta)
$$
Let us solve just one part of the puzzle, namely: how the points $(p,q)$
must be selected in such a way that a minimum may be reached with respect to
this choice. For certain parts of the above expression this would mean that:
$$
\sum_{k} w_k (x_k - p)^2 = \mbox{minimum} \quad \mbox{and} \quad
\sum_{k} w_k (y_k - q)^2 = \mbox{minimum}
$$
It is sufficient to consider only the expression in $x$. Differentiate to $p$:
$$
\frac{d}{dp} \left[\sum_k w_k x_k^2 - 2 p \sum_k w_k x_k + p^2 \sum_k w_k
\right] = - 2 \sum_k w_k x_k + 2 p \sum_k w_k = 0
$$ $$
\quad \Longrightarrow \quad p = \mu_x = \sum_k w_k x_k
$$
And in very much the same way:
$$
q = \mu_y = \sum_k w_k y_k
$$
Define second order momenta with respect to the midpoint as usual:
$$
\sigma_{xx} = \sum_k w_k (x_k - \mu_x)^2 \quad \mbox{and} \quad
\sigma_{yy} = \sum_k w_k (y_k - \mu_y)^2
$$ $$
\sigma_{xy} = \sum_k w_k (x_k - \mu_x) (y_k - \mu_y)
$$
We can concentrate now on minimalization with respect to the angle $\theta$:
$$
\cos^2(\theta) \sigma_{xx} + \sin^2(\theta) \sigma_{yy}
+ 2 \sin(\theta) \cos(\theta) \sigma_{xy} = \mbox{minimum}(\theta)
$$
Extreme values may be found by differentiation to the independent variable:
$$
- 2 \sin(\theta) \cos(\theta) \sigma_{xx} + 2 \cos(\theta) \sin(\theta) \sigma_{yy}
+ 2 \cos^2(\theta) \sigma_{xy} - 2 \sin^2(\theta) \sigma_{xy} = 0
$$
Which leads to the equation:
$$
- \sin(2 \theta) (\sigma_{xx} - \sigma_{yy}) + \cos(2 \theta) 2 \sigma_{xy} = 0
$$
A solution is:
$$
\tan(2\theta) = \frac{2 \sigma_{xy}}{\sigma_{xx} - \sigma_{yy}}
\quad \Longrightarrow \quad \theta =
\frac{1}{2} \left[ \arctan\left(\frac{2 \sigma_{xy}}{\sigma_{xx} - \sigma_{yy}}\right) + k\cdot\pi \right]
$$
with $k = 0,1,2, \cdots$ . But the extreme value can still be a minimum or a maximum.
In order to decide about this, we have to evaluate the expression:
$$
M(\theta) = \cos^2(\theta) \sigma_{xx} + \sin^2(\theta) \sigma_{yy}
+ 2 \sin(\theta)\cos(\theta) \sigma_{xy}
$$
for two different values of $\,\theta$ , namely with $\,k=0,1$ .
Another way to arrive at the same result (and perhaps somewhat more) goes as follows.
It is noticed that the points cloud $(x_k,y_k)$ can be surrounded by a
Best Fit Ellipse :
$$
\frac{1}{2}\frac{\sigma_{yy}(x-\mu_x)^2-2\sigma_{xy}(x-\mu_x)(y-\mu_y)+\sigma_{xx}(y-\mu_y)^2}
{\sigma_{xx}\sigma_{yy}-\sigma_{xy}^2} = 1
$$
Where the different quantities are defined as has been done above.
That ellipse - in general - has two main axes: a long one and a short one. The lengths of these axes
are calculated by solving an eigenvalue ($\lambda$) problem:
$$
\begin{vmatrix} \sigma_{xx}-\lambda & \sigma_{xy} \\ \sigma_{xy} & \sigma_{yy}-\lambda \end{vmatrix} = 0
$$
Giving:
$$
(\sigma_{xx}-\lambda)(\sigma_{yy}-\lambda)-\sigma_{xy}^2=0 \quad \Longleftrightarrow \quad
\lambda^2 - (\sigma_{xx}+\sigma_{yy})\lambda + (\sigma_{xx}\sigma_{yy}-\sigma_{xy}^2) = 0
\\ \Longleftrightarrow \quad
\lambda_{\pm} = \frac{\sigma_{xx}+\sigma_{yy}}{2} \pm
\sqrt{\left(\frac{\sigma_{xx}+\sigma_{yy}}{2}\right)^2 - (\sigma_{xx}\sigma_{yy}-\sigma_{xy}^2)}
$$
The long axis of the best fit ellipse is expected to be coincident with the best fit line to the points cloud.
The short axis is perpendicular to the long axis. Both axes are calculated as eigenvectors:
$$
\begin{bmatrix} \sigma_{xx}-\lambda_{\pm} & \sigma_{xy} \\ \sigma_{xy} & \sigma_{yy}-\lambda_{\pm} \end{bmatrix}
\begin{bmatrix} \cos(\theta) \\ \sin(\theta) \end{bmatrix} = 0 \quad \Longrightarrow \\
\begin{cases} \left[\sigma_{xx}-\lambda_{\pm}\right]\cos(\theta) + \sigma_{xy}\sin(\theta) = 0 \\
\sigma_{xy}\cos(\theta) + \left[\sigma_{yy}-\lambda_{\pm}\right]\sin(\theta) = 0 \end{cases} \quad \Longrightarrow \\
\tan(\theta) = - \frac{\sigma_{xx}-\lambda_{\pm}}{\sigma_{xy}} = - \frac{\sigma_{xy}}{\sigma_{yy}-\lambda_{\pm}}
$$
These are a much simpler formulas than the one found for $\tan(\theta)$ before.
The long axis corresponds with the smallest eigenvalue and the short axis corresponds with the largest eigenvalue.
Herewith we may also have an estimate for the error
in the angle $\theta$ , which is also the error in the angle of the slope of the straight line:
$$
\Delta A = \tan(\Delta \theta) = \sqrt{\frac{\lambda_{-}}{\lambda_{+}}} =
\frac{\lambda_{-}}{\sqrt{\sigma_{xx}\sigma_{yy}-\sigma_{xy}^2}}
$$