Fiducial Focus Augmentation for Facial Landmark Detection- Summarised

By Vishal Chudasama, Senior Engineer at Sony Research India

06th December 2023
Brijraj Singh summarises the paper titled ‘ LLM Based Generation of Item-Description for Recommendation System’ that was accepted at the RECSYS-23 conference in Singapore, this September.

In this blog, Vishal Chudasama summarizes the paper titled ‘Fiducial Focus Augmentation for Facial Landmark Detection’ co-authored by Purbayan Kar, Naoyuki Onoe, Pankaj Wasnik, and Professor Vineeth Balasubramanian of IIT Hyderabad and accepted at the British Machine Vision Conference hosted in Aberdeen, Scotland between 20th-24th November 2023.


Facial Landmark Detection (FLD) is a classic problem in Computer Vision with extensive research having been carried out on this topic. While deep learning methods have led to significant improvements in the performance on the FLD task, detecting landmarks in challenging settings, such as head pose changes, exaggerated expressions, or uneven illumination, continue to remain a challenge due to high variability and insufficient samples. This inadequacy can be attributed to the model’s inability to effectively acquire appropriate facial structure information from the input images.


To address above issue, we propose a novel image augmentation technique, specifically designed for the FLD task to enhance the model’s understanding of facial structures. To learn facial structures effectively, we try to leverage the ground truth landmark coordinates as an inductive bias for facial structure. To this end, we introduce n×n black patches around the landmark locations in the training images, gradually reducing them over the epoch and then completely removing them for the rest of the training, as illustrated in Figure 1. Since the patches cover key semantic regions of the face, e.g., eyes, nose, lips and jawline, when the model learns to predict these patches, it is able to learn the entire facial structure significantly better, as compared to an architecture without this inductive bias. One could view this augmentation technique as similar to Curriculum Learning (CL) [1], a strategy that trains a machine learning model from simpler data to more difficult data, mimicking the meaningful order found in human-designed learning curricula.
Figure 1: In row (a), 5×5 black patches are created around the landmark joints (along with other standard augmentations) in the initial epochs and reduced over the epochs. Rows (b) and (c) show corresponding GradCAM-based saliency maps of the network’s last layer with and without FiFA, respectively. It is clearly seen that activations are more prominent around the desired landmarks when FiFA is used as additional augmentation.
Figure 2: Qualitative results on WFLW test set. Landmarks shown in green are produced by our method, while the ones in red by the state-of-the-art approach [2].
To effectively utilize the newly proposed augmentation technique, we employ a Siamese architecture-based training mechanism with a Deep Canonical Correlation Analysis (DCCA)-based loss to achieve collective learning of high-level feature representations from two different views of the input images. We also employ a Transformer + CNN-based network with a custom hourglass module as the robust backbone for the Siamese framework. Our approach outperforms existing state-of-the-art approaches across various benchmark datasets both quantitatively and qualitatively as shown in Figure 2.
We performed extensive experimentation and ablation studies to validate the effectiveness of the proposed approach. Our method shows significant improvements over prior works on the benchmark datasets COFW, AFLW, 300W, WFLW. One of the benefits of this approach is that it is network independent and can be extended beyond Facial Landmark Detection task.

To know more about Sony Research India’s Research Publications, visit the ‘Publications’ section on our ‘Open Innovation’s page:

Open Innovation with Sony R&D – Sony Research India


[1] Guy Hacohen and Daphna Weinshall. On the power of curriculum learning in training deep networks. In International Conference on Machine Learning, pages 2535–2544. PMLR, 2019.
[2] Yinglin Zheng, Hao Yang, Ting Zhang, Jianmin Bao, Dongdong Chen, Yangyu Huang, Lu Yuan, Dong Chen, Ming Zeng, and Fang Wen. General facial representation learning in a visual-linguistic manner. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18697–18709, 2022.
Skip to content