AR + Lightfield Camera + Facial Keypoints with Neural Network
Author: Liheng Lai, Seong Hyun Park



Facial Keypoints with Neural Network
Overview
In this project, we will explore ways to detect facial keypoints using neural network

Part 1: Nose Tip Detection
The second approach I used results in higher loss and the prediction performance is worse than the first approach. As we can tell from the image outcome, the predicted points and the ground truth align a lot closer in approach 1 than in approach 2.
But no matter which approach it is, there are still some failure cases. I suspect that the varying orientations of the faces make the nose tip harder to detect, though even some forward-facing faces also struggle to be detected easily.

Part 2: Full Facial Keypoints Detection

The `FaceCNN` model is a CNN designed for facial recognition or related tasks, accepting single-channel grayscale images as input. It has six convolutional layers with increasing channel depths, starting from 16 and reaching 64. Each convolutional layer uses a kernel size of 3, with varying padding values (3, 2, or 1) to control the output dimensions. After each convolution, a max pooling layer with a kernel size and stride of 2 reduces the spatial dimensions by half. The architecture includes an optional sixth convolutional layer for added complexity. Following the convolutional layers, the output is flattened to a 1D tensor and passed through three fully connected layers, which reduce dimensions sequentially from 256 to 200, 150, and finally 116. The output is reshaped into a `(58, 2)` tensor. ReLU activation functions are applied after each convolution and fully connected layer, except the output. For training, I used a learning rate of 1e^-3 and batch size of 4.

I believe the negative predictions occur due to the significant variation in face orientations and additional obstructions in the images. When the model becomes uncertain, it appears to generate an average face as a way to minimize loss.

We can observe general patterns of light and dark, prompting them to focus on patches or edges in the input image. For instance, Filters 4 specifically highlight prominent features at corners.

Part 3: Train With Larger Dataset
My architecture is a Wide ResNet-50, which is essentially a ResNet-50 with all the internal 3x3 convolution layers doubled in width (i.e., twice the number of filters). The final linear layer has been modified to output 2 × 68 neurons, and it includes the 0.5 bias initialization I mentioned earlier. The model has the average pooling layer at the end.
I trained this model using a batch size of 64 and a learning rate of 3e-4 (adjusted from previous runs) for a total of 30 epochs using MSE loss.
Here are some images from test set:
Here are some images that we choose to test:
The model generally performs well when the face is fully shown and facing right towards the camera (second image). When the face is tilted or partially blocked by something (first and third image), the model doesn't work that well.
Part 4: Pixelwise Classification
The heatmaps generated by the function are based on a Gaussian distribution, with the mean centered at the specified keypoint coordinates \((x_0, y_0)\). The spread of the Gaussian is determined by the standard deviation (\(\sigma\)), which defaults to 8.0, resulting in a variance of 64. Each heatmap is computed using the formula \(\text{heatmap}(x, y) = \exp\left(-\frac{(x - x_0)^2 + (y - y_0)^2}{2\sigma^2}\right)\), ensuring the highest intensity at the keypoint and a smooth decrease with increasing distance. A mesh grid of size \((\text{height}, \text{width})\) is used to calculate the squared Euclidean distance for all grid points relative to the keypoints. The function normalizes each heatmap so that the maximum value is 1 and sets heatmaps corresponding to out-of-bounds keypoints to zero.

This U-Net model consists of four levels, each containing two convolutional layers with batch normalization and a ReLU activation function, along with a max-pooling layer in the encoder. In the decoder, the max-pooling layers are replaced with up-convolutional layers. The convolutional filter counts for each block are 32, 64, 128, and 256, respectively, with the bottleneck layer using 512 filters. Skip connections link the encoding layers to their corresponding decoding layers. The input to the model is a 3-channel image and the output channel is 68. The model outputs a 68-channel probability heatmap that’s of the same size as the input image.
Here are some images from test set:
Here are some images we picked to run some test:

The model doesn’t really work when the face is not fully present, as shown in the first image. The model works better when the face is facing more directly to the camera, with less shade and contrasts, as demonstrated by figure2 and 3.

Bells & Whistles
For each of the keypoint, instead of using a 2d gaussian, I applied a binary mask with radius of 8. Below is a visual representation of the masks for the first 18 keypoints.

This new approach seems to work better than the original 2d gaussian approach, having a lower validation accuracy. (0.0026 compared to 0.0015) We can also tell from the mouth area. The model trained with this approach has better performance around the mouth area.



Poor Man's Augmented Reality
Overview
In this AR project, I will explore ways to capture 2D points in a video, convert them into 3D coordinates, and project an object on top of them.

Keypoints with known 3D world coordinates
In order to find the sequence of 2D points in a video, we must define coordinates. There are various ways of obtaining these keypoints. I manually found coordinates for those keypoints. Then I measured the length, width, and height of its box and turned each one of them into 3D coordinates. First, I defined 20 key points and set point 0 as origin for 3D coordinates.

Propogating Keypoints to other Images in the Video
There are several ways to propagate points from the first frame to subsequent frames. I chose to use the Harris Corner Detector as I drew a checker pattern on the box and thought it would work well. So for each frame, even though it is not using the previous frame's key points, it is able to detect corners.

Next, as you may have noticed, there are too many points that are not related to the box. So I sample points that have high Harris matrix values; otherwise, I discard them. After that process, it looks much better but is still not using the previous frame's coordinates.

It captures corners much better, and they are something we can use. So, the next step I took was to calculate the cost using the Euclidean distance between the previous coordinates and the current coordinates. I set a threshold to determine if the distance was too far away; if it was, I discarded it. After this process, it properly uses the previous frame's coordinates and propagates the correct next key points. Point 5 and 14 were lost along the way, but we were able to get other coordinates properly. Also, I colored the origin point to blue.


Calibrating the Camera
Now that we have both continuous 2D coordinates and 3D coordinates, we map the 3D points in homogeneous coordinates to 2D image points. Using the matrix below, we can solve the system of equations. By applying least squares, we obtain the projection matrix to calibrate the camera.
\[ \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} \sim \begin{bmatrix} m_{11} & m_{12} & m_{13} & m_{14} \\ m_{21} & m_{22} & m_{23} & m_{24} \\ m_{31} & m_{32} & m_{33} & m_{34} \end{bmatrix} \cdot \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \]
Projecting a cube in the Scene
We now have 3D coordinates, 2D key points for every frame, and a projection matrix. With these, I projected a cube on top of the box. Here, as you can see, point 14 and 5 are missing due to its distance value was exceeding the threshhold, but I was able to project cube on top of it.

Lightfield Camera
Overview
In this project, we will explore ways to achieve complex effects using shifting and averaging, resulting in real light field data.

Depth Refocusing
We first get the data from Stanford Light Field Archive. The set of images are taken in different position with slight different angles, but I used one that are recified and cropped. This essentially enables refocusing and aperture adjustment by capturing mutliple images on a grid perpendicular to the optical axis. We will show aperture adjustment in later.
For each image, we compute a shift vector based on depth of focus and apply the shift to achieve the target depth.

Aperture Adjustment
This adjustment can be done by varying the light field images used for averaging. This implementation is relatively simpler compared to the previous section. All we need to do is take fewer images by checking with the radius. Additionally, I used (8, 8) as the center of the image. As you can see, as the aperture number increases, the background blurs, focusing on a specific part of the image.

What I learned
From this project, I was able to learn about lightfields and their applications in computational photography. It was interesting to see how simply shifting and averaging images can create different depths of focus and adjust the aperture.

Bells and Whistles
For the Lighted Camera B&W, I implemented interactive refocusing. When you run the code, it will prompt you to pinpoint a focus point. Then, it calculates the optimal depth level and uses refocusing to adjust the focus accordingly. To obtain the correct depth, I calculate the point's coordinates and convert them into the corresponding depth level. I then used the previous implementation to achieve the correct focus.

Acknowledgment
I used the Unemployables Portfolio Template for this website.