Landmark Recognition and Captioning on Google Landmark Dataset v2

8 min readDec 20, 2020

Introduction and Motivation: Click and Upload is a trend these days. Many times we wish to remember and relive the moment spent in that picture. The very first thing that comes to our mind is Where it was clicked! Social Sites upload makes us search internet to come up with good captions. What if we get a caption and description of the image together? How efficient and easy would that be? For landmarks, it becomes quite difficult to identify objects or components as a mountain is a mountain, but Mt. Fuji is a landmarked mountain, because of its history or characteristics.

When you search for a particular landmarked image 😒

Captioned Image: Swim time at the lake, the pragserwildsea or lake prags lake braies is a lake in the prags in south tyrol, italy 😎

Delving deep into the history of oriented descriptions with different images on the largest dataset ever- Google Landmark Dataset v2, involving various Text processing and Deep learning techniques, here we present our project which helps in recognizing Landmarks and Captioning them.

Prerequisites: This read, assumes familiarity with Keras, Tensorflow, Numpy, Pandas, Seaborn, Classical Classifiers, Deep Learning Classifiers like Multi-layered Perceptron, Convolution Neural Networks, Recurrent Neural Networks, Transfer Learning, Backpropagation, Text Processing, Python syntax and data structures.

Data Collection: Google Landmark Recognition v2 Dataset — the largest dataset up to date, the dataset is divided into two sets of images, to evaluate two different computer vision tasks: recognition and retrieval. Here we chose the recognition set, it contains URLs of images . There are 4,132,914 images in the train set and 117,577 images in the test set. We downloaded data and processed from 512*512 to 128*128 size of each image from 256GB data to 8Gb train 300mb test. Now we chose those image folders (total folders being 14915) which had number of images more than 800 so that we can convert it into 500 test and 300 train images in each folder to reduce variance in images and to balance it. Now, Train images = 1,48,000 and Test images = 62,512 is taken. As for image captioning landmark dataset of Google doesn’t have image captioned file or image:caption mapping which had to be created by us. We created 8000 captions for images belonging to 10 labels chosen from 12K labels.

Image Preprocessing:

Libraries used for preprocessing are : PIL , urllib , os ,multiprocessing ,tqdm, sys ,csv. The image data was in the form of URLs stored in a csv file. Images are downloaded from URLs using the following code by giving arguments as : filename and output directory . Since, the image size is very big and numbers of images are about 5M, hence images is reduced to 128*128 pixel size using the resize function of the pillow library.

Image Cleaning: Altering the pixel dimensions of the image is called ‘resampling.’ The reverse mapping function is applied to the output pixel, so that the obtained ‘resampling pixel’ is reversed to obtain the original input pixel. To make the image more suitable for definite applications, contrast enhancement must be used. It improves the visibility and the transparency of the image and the original image is more acceptable to process the computer. Noise Removal: At the time of image acquisition or during transmission, noises are produced. It degrades the image quantification to different range. In general, the noises in the image can be classified as: Impulse noise (Salt & Pepper noise), Gaussian noise (Amplifier noise), Speckle noise (Multiplicative noise), Poisson noise (photon noise).

Text Cleaning: Punctuations, Numerical Values, Contraction Mapping, Stop words Elongated Words, Emoticons, Negation, Hash Tags, Non-ascii characters, Contractions Mapping, Short Words were all removed as part of text cleaning

Proposed Model Architecture:

For landmark Recognition and Image Captioning, we first preprocess the data same as Baseline which involved noise removal, then we extract features through various model such as; CNN, VGG16, FAST, HOG. Then feed these features into our Image Captioning module to predict captions for images.

Feature Extraction:

The reason for selecting these features(CNN, VGG16,RESNET50) is that usually Deep Learning algorithms perform well on huge dataset, so extracting features from them and implementing for lesser dataset along with traditional feature extractors (FAST, HOG, SIFT) proved more beneficial.

SIFT algorithm is found as a very involved algorithm. The major advantage of these features over edge features are that they are scale invariant, they are robust to orientation. SIFT features of an image are extracted from opencv-contrib-python==3.4.2.16 library through cv2.xfeatures2d function. After extracting keypoints and descriptors of every image of our train data it was observed that sift gave 128 dimensional features for each image. Total (580787, 128) features are extracted.

Better edge detection and identification from FAST, with gradient calculation from HOG to identify on basis of pixelized gradient histograms, combined with CNN and VGG16. or extracting CNN features, 4 layers of Conv2D with 64,64,32 and 16 filters respectively are used along with Max Pooling and dropout to extract best features and avoid model overfitting.

After that Flatten layer is added to obtain features. In order to extract image features from VGG model, a layer of 128*128*64 filters along with Max Pooling followed by a layer of 64*64*128 filters with Max Pooling and last layer layer of 32*32*128 filters with Max Pooling applied. After these 3 dense layers are added and features are obtained by Flattening the outputs from different layers. The dataset for train is of 5000 and test is 3000. From which: CNN gave (5K/3K,784); VGG16 gave (5K/3K,8192); HOG (5K/3K,3192); FAST(5K/3K,49152)

HOG and FAST extracted feature visualization😉

Image Captioning:

Caption Tokenization: For initiating the task for captions, the RNN model needs to be trained with a relevant dataset. Tokenizer of keras library is used with nb_words = 8000. It is essential to train the RNN model for predicting the next word in the sentence. However, training the model with strings is ineffective without definite numerical alphas values.

Word2Vec: Word2vec is a two-layer neural net that processes text by “vectorizing” words. Its input is a text corpus and its output is a set of vectors

Word embedding: A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.

Encoding/Decoding: The encoder-decoder 135641 samples train and validate 57623 samples after glove embedding. Embedding layer of Vocab_size = 1896 and dim_embedding = 200 is added and in some cases outperforms classical statistical machine translation methods.

LSTM networks: Well-suited to classifying, processing and making predictions based on time series data, with 256 Dense Layer and Cross-entropy loss function. Recurrent training of network on predefined weights from pretrained models. The LSTM function above can be described by the following equations where LSTM(xt) returns pt+1 and the tuple (mt, ct) is passed as the current hidden state to the next hidden state.

Results:

Complete Google Landmark dataset split into Train images = 1,48,000 and Test images = 62,512 is taken and ran on provided server and following accuracy is obtained by feature extraction and classification: Accuracy on this dataset ranges from 65–70% for classical classifier.

Classification results obtained for 8k image dataset on which state of art is implemented. Various classifiers are applied on the extracted features : LGBM and Stacked Classifier gives highest accuracy of 89–90% among all the classifiers and also outperforms the state of art (71%).

Other classifiers that outperforms are Logistic Regression , KNN , and Random Forest. On the other hand Decision tree performs less than the state of art and sift features + other classifiers gives maximum of 47–56% of accuracy.

Image Captioning on Landmark dataset is a difficult task to achieve, because the components in an image eg, boy/girl/cat/dog/ objects, cannot be classified or differentiated when it comes to landmarks. A lake is a lake, a mountain is a mountain but all of them hold high their importance. With such belief in mind, our featured model gives us a 62.11% BLEU rate(approximated bilingual translation metric) with correct labels prediction and their description. Though the captions seem vague it can be improved with more training on captioned.

Validation and Training Loss Curve for image recognition and captioning together

Baseline ref[1] & [2] comparison with our model 😎

Stacked Classifier: Using all the classical classifiers at level 0 and Logistic Regression at level 1 of Stacking Classifier we obtained the best result.

Epilogue: A whole new perspective of image recognition comes with bringing out the past moments/memories by rejoicing the place and reinventing the moment or through flaunting them over social media platforms. This is achieved by image Captioning and can also help you decide your Instagram caption soon. In future developments we plan on doing this on a much larger set and not just landmarks but to all varieties. Some predictions are as follows:

References:

https://www.researchgate.net/publication/328820153_Monument_Recognition_Using_Deep_Neural_Networks

2. https://ieeexplore.ieee.org/search/searchresult.jsp?newsearch=true&queryText=landmark%20recognition%20in%20cnn%20

Authors:

Ayushi Sinha (M.Tech CSE-AI, IIITD)[LinkedIn] : Feature Extraction (CNN, VGG16, RESNET50, HOG) and Combination, Ensembler Classifier, Feature-Image-Caption Extension, Image Captioning on extracted features using LSTM, Data Acquisition for Captioned Model, Text Preprocessing, Blog writing.
Ashi Sahu (M.Tech CSE-DE, IIITD)[LinkedIn]: Image Preprocessing, Feature Extraction and Classification(SIFT), Blog GIF, Data Acquisition for Captioned Model,Blog Writing.
Ravi Rathee (M.Tech CSE-IS, IIITD)[LinkedIn]: Baseline Implementation on CNN, Data Preparation and Exploration, Feature Extraction(FAST) Data Acquisition for Captioned Model, Image Captioning on Inception V3.

Special thanks to:
Our Professor- Dr. Tanmoy Chakraborty [LinkedIn] [Website][Facebook] and Our guide, Nirav Diwan. [LinkedIn]

Landmark Recognition and Captioning on Google Landmark Dataset v2

Written by AYUSHI SINHA