Unemployables Portfolio

AR + Lightfield Camera + Facial Keypoints with Neural Network

Author: Liheng Lai, Seong Hyun Park

Facial Keypoints with Neural Network

Overview

In this project, we will explore ways to detect facial keypoints using neural network

Part 1: Nose Tip Detection

lr=1e^-3, filter_size=3

lr=5e^-3, filter_size=5

The second approach I used results in higher loss and the prediction performance is worse than the first approach. As we can tell from the image outcome, the predicted points and the ground truth align a lot closer in approach 1 than in approach 2.

But no matter which approach it is, there are still some failure cases. I suspect that the varying orientations of the faces make the nose tip harder to detect, though even some forward-facing faces also struggle to be detected easily.

Part 2: Full Facial Keypoints Detection

Ground truths

Training and validation loss (test loss here)

The `FaceCNN` model is a CNN designed for facial recognition or related tasks, accepting single-channel grayscale images as input. It has six convolutional layers with increasing channel depths, starting from 16 and reaching 64. Each convolutional layer uses a kernel size of 3, with varying padding values (3, 2, or 1) to control the output dimensions. After each convolution, a max pooling layer with a kernel size and stride of 2 reduces the spatial dimensions by half. The architecture includes an optional sixth convolutional layer for added complexity. Following the convolutional layers, the output is flattened to a 1D tensor and passed through three fully connected layers, which reduce dimensions sequentially from 256 to 200, 150, and finally 116. The output is reshaped into a `(58, 2)` tensor. ReLU activation functions are applied after each convolution and fully connected layer, except the output. For training, I used a learning rate of 1e^-3 and batch size of 4.

Correct

Incorrect

I believe the negative predictions occur due to the significant variation in face orientations and additional obstructions in the images. When the model becomes uncertain, it appears to generate an average face as a way to minimize loss.

Filters of the first convolutional layer

We can observe general patterns of light and dark, prompting them to focus on patches or edges in the input image. For instance, Filters 4 specifically highlight prominent features at corners.

Part 3: Train With Larger Dataset

Ground Truths

My architecture is a Wide ResNet-50, which is essentially a ResNet-50 with all the internal 3x3 convolution layers doubled in width (i.e., twice the number of filters). The final linear layer has been modified to output 2 × 68 neurons, and it includes the 0.5 bias initialization I mentioned earlier. The model has the average pooling layer at the end.

I trained this model using a batch size of 64 and a learning rate of 3e-4 (adjusted from previous runs) for a total of 30 epochs using MSE loss.

Here are some images from test set:

Here are some images that we choose to test:

The model generally performs well when the face is fully shown and facing right towards the camera (second image). When the face is tilted or partially blocked by something (first and third image), the model doesn't work that well.

Part 4: Pixelwise Classification

The heatmaps generated by the function are based on a Gaussian distribution, with the mean centered at the specified keypoint coordinates \((x_0, y_0)\). The spread of the Gaussian is determined by the standard deviation (\(\sigma\)), which defaults to 8.0, resulting in a variance of 64. Each heatmap is computed using the formula \(\text{heatmap}(x, y) = \exp\left(-\frac{(x - x_0)^2 + (y - y_0)^2}{2\sigma^2}\right)\), ensuring the highest intensity at the keypoint and a smooth decrease with increasing distance. A mesh grid of size \((\text{height}, \text{width})\) is used to calculate the squared Euclidean distance for all grid points relative to the keypoints. The function normalizes each heatmap so that the maximum value is 1 and sets heatmaps corresponding to out-of-bounds keypoints to zero.

This U-Net model consists of four levels, each containing two convolutional layers with batch normalization and a ReLU activation function, along with a max-pooling layer in the encoder. In the decoder, the max-pooling layers are replaced with up-convolutional layers. The convolutional filter counts for each block are 32, 64, 128, and 256, respectively, with the bottleneck layer using 512 filters. Skip connections link the encoding layers to their corresponding decoding layers. The input to the model is a 3-channel image and the output channel is 68. The model outputs a 68-channel probability heatmap that’s of the same size as the input image.

At Epoch 30, Training Loss: 0.0005 - At Epoch 30, Validation Loss: 0.0026

Here are some images from test set:

Here are some images we picked to run some test:

The model doesn’t really work when the face is not fully present, as shown in the first image. The model works better when the face is facing more directly to the camera, with less shade and contrasts, as demonstrated by figure2 and 3.

Bells & Whistles

For each of the keypoint, instead of using a 2d gaussian, I applied a binary mask with radius of 8. Below is a visual representation of the masks for the first 18 keypoints.

At Epoch 30, Training Loss: 0.0007 - At Epoch 30, Validation Loss: 0.0015

This new approach seems to work better than the original 2d gaussian approach, having a lower validation accuracy. (0.0026 compared to 0.0015) We can also tell from the mouth area. The model trained with this approach has better performance around the mouth area.

Poor Man's Augmented Reality

Overview

In this AR project, I will explore ways to capture 2D points in a video, convert them into 3D coordinates, and project an object on top of them.

Keypoints with known 3D world coordinates

In order to find the sequence of 2D points in a video, we must define coordinates. There are various ways of obtaining these keypoints. I manually found coordinates for those keypoints. Then I measured the length, width, and height of its box and turned each one of them into 3D coordinates. First, I defined 20 key points and set point 0 as origin for 3D coordinates.

Keypoints

Propogating Keypoints to other Images in the Video

There are several ways to propagate points from the first frame to subsequent frames. I chose to use the Harris Corner Detector as I drew a checker pattern on the box and thought it would work well. So for each frame, even though it is not using the previous frame's key points, it is able to detect corners.

Keypoints

Next, as you may have noticed, there are too many points that are not related to the box. So I sample points that have high Harris matrix values; otherwise, I discard them. After that process, it looks much better but is still not using the previous frame's coordinates.

Keypoints after Sampling

It captures corners much better, and they are something we can use. So, the next step I took was to calculate the cost using the Euclidean distance between the previous coordinates and the current coordinates. I set a threshold to determine if the distance was too far away; if it was, I discarded it. After this process, it properly uses the previous frame's coordinates and propagates the correct next key points. Point 5 and 14 were lost along the way, but we were able to get other coordinates properly. Also, I colored the origin point to blue.

Matching keypoints

Calibrating the Camera

Now that we have both continuous 2D coordinates and 3D coordinates, we map the 3D points in homogeneous coordinates to 2D image points. Using the matrix below, we can solve the system of equations. By applying least squares, we obtain the projection matrix to calibrate the camera.

\[ \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} \sim \begin{bmatrix} m_{11} & m_{12} & m_{13} & m_{14} \\ m_{21} & m_{22} & m_{23} & m_{24} \\ m_{31} & m_{32} & m_{33} & m_{34} \end{bmatrix} \cdot \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix} \]

Projecting a cube in the Scene

We now have 3D coordinates, 2D key points for every frame, and a projection matrix. With these, I projected a cube on top of the box. Here, as you can see, point 14 and 5 are missing due to its distance value was exceeding the threshhold, but I was able to project cube on top of it.

Result

Lightfield Camera

Overview

In this project, we will explore ways to achieve complex effects using shifting and averaging, resulting in real light field data.

Depth Refocusing

We first get the data from Stanford Light Field Archive. The set of images are taken in different position with slight different angles, but I used one that are recified and cropped. This essentially enables refocusing and aperture adjustment by capturing mutliple images on a grid perpendicular to the optical axis. We will show aperture adjustment in later.

For each image, we compute a shift vector based on depth of focus and apply the shift to achieve the target depth.

Sample Image

Near Focus

Far Focus

Gif

Aperture Adjustment

This adjustment can be done by varying the light field images used for averaging. This implementation is relatively simpler compared to the previous section. All we need to do is take fewer images by checking with the radius. Additionally, I used (8, 8) as the center of the image. As you can see, as the aperture number increases, the background blurs, focusing on a specific part of the image.

Sample Image

Small Aperture

Large Aperture

Gif

What I learned

From this project, I was able to learn about lightfields and their applications in computational photography. It was interesting to see how simply shifting and averaging images can create different depths of focus and adjust the aperture.

Bells and Whistles

For the Lighted Camera B&W, I implemented interactive refocusing. When you run the code, it will prompt you to pinpoint a focus point. Then, it calculates the optimal depth level and uses refocusing to adjust the focus accordingly. To obtain the correct depth, I calculate the point's coordinates and convert them into the corresponding depth level. I then used the previous implementation to achieve the correct focus.

Clicked Top

Result

Clicked Middle

Result

Clicked Bottom

Result

Acknowledgment

I used the Unemployables Portfolio Template for this website.