= Mouse Picking Demystified =
April 5, 2005
== Introduction ==
There comes a time in every 3D game where the user needs to click on something in the scene. Maybe he needs to select a unit in an RTS, or open a door in an RPG, or delete some geometry in a level editing tool. This conceptually simple task is easy to screw up since there are so many little steps that can go wrong.
The problem is this: ''given the mouse's position in window coordinates, how can I determine what object in the scene the user has selected with a mouse click?''
One method is to generate a ray using the mouse's location and then intersect it with the world geometry, finding the object nearest to the viewer. Alternatively we can determine the actual 3-D location that the user has clicked on by sampling the depth buffer (giving us (''x,y,z'') in viewport space) and performing an inverse transformation. Technically there is a third approach, using a selection or object ID buffer, but this has numerous limitations that makes it impractical for widespread use.
This article describes using the ''inverse transformation'' to derive world space coordinates from the mouse's position on screen.
Before we worry about the inverse transformation, we need to establish how the standard forward transformation works in a typical graphics pipeline.
== The View Transformation ==
The standard view transformation pipeline takes a point in model space and transforms it all the way to viewport space (sometimes known as window coordinates) or, for systems without a window system, screen coordinates. It does this by transforming the original point through a series of coordinate systems:
{{{
#!latex-math-hook
\begin{eqnarray*}
&Model \\ &\downarrow \\
&World \\ &\downarrow \\
&View \\ & \downarrow \\
&Clip \\ & \downarrow \\
&Normalized Device\\ &\downarrow \\
&Viewport
\end{eqnarray*}
}}}
Not each step is discrete. OpenGL has the {{{GL_MODELVIEW}}} matrix, '''{{{M}}}''', that transforms a point from model space to view space, using a right-handed coordinate system with +Y up, +X to the right, and -Z into the screen.
{{{
#!latex-math-hook
\begin{eqnarray*}
{\boldsymbol M}{\boldsymbol p}&=&{\boldsymbol v} \\
\text{where} \\
{\boldsymbol M}&=&\text{modelview transformation matrix} \\
\boldsymbol{p} &=& \text{point in model space} \\
\boldsymbol{v} &=& \text{point in view space} \\
\end{eqnarray*}
}}}
Another matrix, {{{GL_PROJECTION}}}, then transforms the point from view space to homogeneous clipping space. Clip space is a right-handed coordinate system (+Z into the screen) contained within a canonical clipping volume extending from {{{(-1,-1,-1)}}} to {{{(+1,+1,+1)}}}:
{{{
#!latex-math-hook
\begin{eqnarray*}
{\boldsymbol P}{\boldsymbol v}&=&{\boldsymbol c} \\
\text{where} \\
{\boldsymbol P}&=&\text{projection matrix} \\
\boldsymbol{v} &=& \text{point in view space} \\
\boldsymbol{c} &=& \text{point in clip space} \\
\end{eqnarray*}
}}}
After clipping is performed the perspective divide transforms the homogeneous coordinate to a Cartesian point in normalized device space. Normalized device coordinates are left-handed, with ''w'' = 1, and are contained within the canonical view frustum from {{{(-1,-1,-1)}}} to {{{(+1,+1,+1)}}}:
{{{
#!latex-math-hook
\begin{eqnarray*}
{\boldsymbol n}&=&\frac{{\boldsymbol c}}{{\boldsymbol c}_w} \\
\text{where} \\
{\boldsymbol n}&=&\text{point in normalized device coordinate} \\
{\boldsymbol c}&=&\text{clipped point in homogeneous clipping space} \\
\end{eqnarray*}
}}}
Finally there is the viewport scale and translation, which transforms the normalized device coordinate into viewport (window) coordinates. Another axis inversion occurs here; this time +Y goes down instead of up (some window systems may place the origin at another location, such as the bottom left of the window, so this isn't always true). Viewport depth values are calculated by rescaling normalized device coordinate Z values from the range {{{(-1,1)}}} to {{{(0,1)}}}, with 0 at the near clip plane and 1 at the far clip plane. Note: any user specified depth bias may impact our calculations later.
{{{
#!latex-math-hook
\begin{eqnarray*}
{\boldsymbol V}{\boldsymbol n}&=&{\boldsymbol w} \\
\text{where} \\
{\boldsymbol V}&=&\text{viewport transformation matrix} \\
{\boldsymbol n}&=&\text{point in normalized device coordinates} \\
{\boldsymbol w}&=&\text{point in viewport/window coordinates} \\
\end{eqnarray*}
}}}
This pipeline allows us to take a model space point, apply a series of transformation, and get a window space point at the end.
Our ultimate goal is to transform the mouse position (in window space) all the way ''back'' to world space. Since we're not rendering a model, model space and and world space are the same thing.
== The Inverse View Transformation ==
To go from mouse coordinate to world coordinates we have to do the exact opposite of the view transformation:
{{{
#!latex-math-hook
\begin{quote}
Viewport \to NDC \to Clip \to View \to World/Model
\end{quote}
}}}
That's a lot of steps, and it's easy to screw up, and if you screw up just a little that's enough to blow everything apart.
=== Viewport to NDC to Clip ===
The first step is to transform the viewport coordinates into clip coordinates. The viewport transformation takes a normalized device coordinate and transforms it into a viewport coordinate:
{{{
#!latex-math-hook
\begin{eqnarray*}
{\boldsymbol V}{\boldsymbol n}&=&{\boldsymbol v}\\
&=&
\begin{pmatrix}
\frac{ {\boldsymbol n}_x + 1 }{2}w\\
\frac{ 1-{\boldsymbol n}_y}{2}h\\
\frac{ {\boldsymbol n}_z + 1 }{ 2 }
\end{pmatrix}\\
\end{eqnarray*}
\begin{eqnarray*}
\text{where}\\
\boldsymbol{V}&=&\text{viewport transformation matrix}\\
\boldsymbol{n}&=&\text{normalized device coordinate}\\
\boldsymbol{v}&=&\text{point in viewport/window space}\\
\end{eqnarray*}
}}}
So we need to do the inverse of this process by rearranging to solve for '''n''':
{{{
#!latex-math-hook
\begin{equation*}
\label{eqn:vp to ndc}
{\boldsymbol n} = \begin{pmatrix}
\frac{ 2{\boldsymbol v}_x }w - 1\\
\frac{ 2{\boldsymbol v}_y }{h}\\
2{\boldsymbol v}_z - 1\\
1
\end{pmatrix}
\end{equation*}
}}}
Okay, not so bad. The only real question is the {{{z}}} component of '''v'''. We can either calculate that value by reading it back from the framebuffer, or ignore it by substituting 0, in which case we'll be computing a ray passing through '''v''' that we'll then have to intersect with world geometry to find the corresponding point in 3-space.
From here we need to go to clip coordinates, which, if you recall, are the homogeneous versions of the NDC coordinates (i.e. ''w'' != 1.0). Since ''w'' is already 1.0 and the transformation back to clip
coordinates is a scale by ''w'', this step can be skipped and we can assume that our NDC coordinates are the same as our clip coordinates.
=== Clipping Space to View Space ===
A point in view space is transformed to clipping space with the {{{GL_PROJECTION}}} matrix:
{{{
#!latex-math-hook
\begin{eqnarray*}
{\boldsymbol P}{\boldsymbol v}&=&{\boldsymbol c} \\
\text{where} \\
{\boldsymbol P}&=&\text{projection matrix} \\
\boldsymbol{v} &=& \text{point in view space} \\
\boldsymbol{c} &=& \text{point in clip space} \\
\end{eqnarray*}
}}}
Given this we can do the opposite by multiplying the clipping space coordinate by the inverse of the {{{GL_PROJECTION}}} matrix. This isn't as bad as it sounds since we can avoid computing a true 4x4 matrix inverse if we just construct the inverse projection matrix at the same time we build the projection matrix.
A typical OpenGL projection matrix takes the form:
{{{
#!latex-math-hook
\begin{eqnarray*}
{\boldsymbol P} = \begin{pmatrix}
a & 0 & 0 & 0 \\
0 & b & 0 & 0 \\
0 & 0 & c & d \\
0 & 0 & e & 0 \end{pmatrix}
\end{eqnarray*}
}}}
The specific coefficient values depend on the nature of the perspective projection matrix (for more information I recommend you look at the documentation for [http://www.opengl.org/documentation/specs/man_pages/hardcopy/GL/html/glu/perspective.html gluPerspective]. These co-efficients should scale and bias the ''x'', ''y'', and ''z'' components of a point while assigning -''z'' to ''w''.
To transform from view coordinates to clip coordinates:
{{{
#!latex-math-hook
\begin{eqnarray*}
{\boldsymbol P}{\boldsymbol v}&=&
{\boldsymbol c}\\
&=&
\begin{pmatrix}
a{\boldsymbol v}_x\\b{\boldsymbol v}_y\\c{\boldsymbol v}_z
+ d{\boldsymbol v}_w\\e{\boldsymbol v}_z
\end{pmatrix}\\
\text{where}\\
{\boldsymbol P}&=&\text{projection matrix as described earlier}\\
{\boldsymbol v}&=&\text{point in view coordinates}\\
{\boldsymbol c}&=&\text{point in clipping coordinates}\\
\end{eqnarray*}
}}}
So solving for '''v''' we get:
{{{
#!latex-math-hook
\begin{equation*}
{\boldsymbol v} =
\begin{pmatrix}
\frac{{\boldsymbol c}_x}{a}\\
\frac{{\boldsymbol c}_y}{b}\\
\frac{{\boldsymbol c}_w}{e}\\
\frac{{\boldsymbol c}_z}{d}-\frac{c{\boldsymbol c}_w}{de}
\end{pmatrix}
\end{equation*}
}}}
Encoding the viewspace to clipspace transformation in a matrix yields the inverse projection matrix:
{{{
#!latex-math-hook
\begin{equation*}
{\boldsymbol P}^{-1} = \begin{pmatrix}
\frac{1}{a} & 0 & 0 & 0 \\
0 & \frac{1}{b} & 0 & 0 \\
0 & 0 & 0 & \frac{1}{e} \\
0 & 0 & \frac{1}{d} & -\frac{c}{de}
\end{pmatrix}
\end{equation*}
}}}
Computing the view coordinate from a clip coordinate is now:
{{{
#!latex-math-hook
\begin{equation*}
{\boldsymbol P}^{-1}{\boldsymbol c}={\boldsymbol v}
\end{equation*}
}}}
There's no guarantee that ''w'' will be 1, so we'll want to rescale appropriately:
{{{
#!latex-math-hook
\begin{equation*}
{\boldsymbol v}' = \frac {{\boldsymbol v}}{{\boldsymbol v}_w}
\end{equation*}
}}}
== Viewspace to Modelspace ==
Finally we just need to go from view coordinates to world coordinates by multiplying the view coordinates against the inverse of the modelview matrix. Again we can avoid doing a true inverse if we just
logically break down what the modelview transform accomplishes when working with the camera: it is a translation (centering the universe around the camera) and then a rotation (to reflect the camera's
orientation). The inverse of this is reversed rotation (accomplished with a transpose) followed by a translation with the negation of the modelview matrix's translation component after it has been rotated by the inverse rotation.
If given our initial modelview matrix '''M''', consisting of a 3x3 rotation submatrix '''R''' and a 3-element translation vector '''t''':
{{{
#!latex-math-hook
\begin{equation*}
{\boldsymbol M} =
\begin{pmatrix}
{\boldsymbol R}_{11} & {\boldsymbol R}_{12} & {\boldsymbol R}_{13} & {\boldsymbol t}_x\\
{\boldsymbol R}_{21} & {\boldsymbol R}_{22} & {\boldsymbol R}_{23} & {\boldsymbol t}_y\\
{\boldsymbol R}_{31} & {\boldsymbol R}_{32} & {\boldsymbol R}_{33} & {\boldsymbol t}_z\\
0 & 0 & 0 & 1
\end{pmatrix}
\end{equation*}
}}}
Then we can construct the inverse modelview using the transpose of the rotation submatrix and the camera's translation vector:
{{{
#!latex-math-hook
\begin{eqnarray*}
{\boldsymbol R}^T{\boldsymbol t}& = &{\boldsymbol t'}\\
{\boldsymbol M}^{-1} &=&
\begin{pmatrix}
{\boldsymbol R}^T_{11} & {\boldsymbol R}^T_{12} & {\boldsymbol R}^T_{13} & -{\boldsymbol t'}_x\\
{\boldsymbol R}^T_{21} & {\boldsymbol R}^T_{22} & {\boldsymbol R}^T_{23} & -{\boldsymbol t'}_y\\
{\boldsymbol R}^T_{31} & {\boldsymbol R}^T_{32} & {\boldsymbol R}^T_{33} & -{\boldsymbol t'}_z\\
0 & 0 & 0 & 1
\end{pmatrix}
\end{eqnarray*}
}}}
If you're specifying the modelview matrix directly, for example by calling {{{glLoadMatrix}}}, then you already have it lying around and you can build the inverse as described earlier. If, on the other hand, the modelview matrix is built dynamically using something like {{{gluLookAt}}} or a sequence of {{{glTranslate}}}, {{{glRotate}}}, and {{{glScale}}} calls, you can use {{{glGetFloatv}}} to retrieve the current modelview matrix.
Now that we have the inverse modelview matrix we can use it to transform our view coordinate into world space:
{{{
#!latex-math-hook
\begin{eqnarray*}
{\boldsymbol M}^{-1}{\boldsymbol v}&=&{\boldsymbol w}\\
\text{where}\\
{\boldsymbol M}^{-1}&=&\text{inverse of the modelview matrix}\\
{\boldsymbol v}&=&\text{point in viewspace}\\
{\boldsymbol w}&=&\text{point in worldspace}
\end{eqnarray*}
}}}
If the depth value under the mouse was used to construct the original viewport coordinate, then '''w''' should correspond to the point in 3-space where the user clicked. If the depth value was not read then we have an arbitrary point in space with which we can construct a ray from the viewer's position:
{{{
#!latex-math-hook
\begin{eqnarray*}
\label{eqn:ray}
\overrightarrow{{\boldsymbol r}}&=&{\boldsymbol a} + t({\boldsymbol w}-{\boldsymbol a})\\
\text{where}\\
\overrightarrow{{\boldsymbol r}}&=&\text{ray}\\
{\boldsymbol a}&=&\text{viewer's position in worldspace}\\
{\boldsymbol w}&=&\text{point in worldspace}\\
\end{eqnarray*}
}}}