Doom or Lasagna, an AWS Rekognition Challenge.



Having just resurfaced to the real world after completing a recent AWS Cloud computing course this past fall, a friend of mine shared with me a photo showcasing just how similar pictures of id Software's DOOM video game are to close up images of lasagna. Despite the captivating sequel idea of the doom slayer fighting off endless waves of hell spawn emerging from the depths of an Italian dish, the thought dawned on me that this would actually be a really fun challenge for an image classifier! Better yet, why not tackle two problems at once. Take the recent experience gained in my cloud computing course and put it to the test by familiarizing myself with AWS Rekognition image classification toolkit. Using AWS Rekognition, I'd attempted to classify between pictures of Doom 2016/Eternal’s Martian landscape and close-up photos of lasagna. With that curious spark in place I began my quest.



The classification of images using a convolutional neural network is often a difficult process. Reviewing the AWS Rekognition documentation clearly showcased that much of the complexity of model training is abstracted from the user. However, the analysis and evaluation of completed model performance remains a core task left to the developer. The decision to train a machine learning model on the ability to distinguish between images of Lasagna and a Maritain landscape may seem odd. However, in the context of convolutional image classification this is challenging problem. Both images share a strikingly similar color and texture pallet. My proposed hypothesis was that the classification of visually similar images would often result in miss-classified labels. While the ability to differentiate between these two image groups may appear easy for humans (though my friend and I certainly had trouble at some points), the classification of visually similar images is a classic edge case example used when first introducing students to the world of image classification. Take for instance the legendary image classification example “blueberry muffin VS. chihuahua. Ultimately, image classification is a technically challenging process that can become increasingly difficult in attempting to differentiate between visually similar pictures.

Image Collection and Aggregation


Image classification models rely on an image dataset. An image dataset is a repository of images used for the training and testing of a model. In order to develop an image classification model, I first had to establish an image dataset for the training and testing the AWS classification model. Given this project relied on the comparison between images from the video game Doom, and photos of Lasagna, the image collection process began by obtaining images from these two groups. I opted to leverage Google’s image search tool to obtain additional photos. This tool was selected specifically for its ability to search for visually similar images, as well as the compatibility with the batch image tool I would later use to download multiple images to my computer. To scrap images from the internet, I leveraged the Firefox extension “download all images” by Joe Ertaba



I started the image scraping process by specifying for images from Doom that appeared as orange, Martian landscapes featuring minimal user interfaces obstructing landscape views. Given the nature of Doom as a first-person shooter video game, many image results featured an obstructing weapon model in the middle of the photo. To ensure the classification model would not associate a weapon model as a critical parameter for classifying the mars landscape, I opted to restrict the Doom images to without any obstructing weapons.


Overall, I found the process of finding acceptable Martian images to be particularly difficult. While the stylistic pallet of Doom’s level design is often a marrying of orange and black, most images found online did not feature the video game’s landscape as the image’s primary focus. Also, given the nature of Doom as a fast-paced first-person shooter. A sizable number of images had to be discarded due to large amounts of motion blur resulting from the player’s camera movement in-game. This would later prove to be the primary reason for my relatively small sized dataset.



The process of scraping images of lasagna was significantly easier. Not only are there more publicly available images of this dish, but the range of acceptable images was also significantly larger than the doom landscape. Specifically, the only restricting factor I encountered with scraping images of lasagna, was ensuring to the best of my ability, images were purely focused with lasagna as the center point. Once again, I leveraged the Firefox extension to obtain a similarly large number of photos from google images.

Model Creation


Image classification can be sub-categorized into two distinct model types: Supervised and unsupervised learning. Supervised learning is commonly utilized in the context of classification models, while unsupervised learning is often used in regression-based analysis. Classification models leveraging supervised learning opt to teach a machine learning model how to correctly identify objects and scenes in an image. This is achieved by pre-labelling input data such that a ground truth can be establish and by which the machine learning model can use to evaluate itself. Given the use of AWS Rekognition’s custom labels, this project utilized a supervised learning approach.


In reading further into the nuances of AWS Rekognition, it became clear to me that utilizing a custom label model was the best approach for this project. Not only would I be able to fine tune my own model utilizing custom training tags specialized for the doom/lasagna use case, but also would provide me a more grounded understanding of how AWS Rekognition operates. I began creating an AWS Rekognition custom label model by first establishing an image repository. AWS Rekognition offers four discrete options for image hosting, a user may upload images directly to AWS Rekognition from their local computer, host images through S3, leverage an existing dataset, or utilize AWS SageMaker Ground Truth. In my case I opted to host my image dataset through AWS S3 bucket storage system, more on why later. I began by creating a new bucket with a singular folder inside.


Within the image_folder, I created two additional sub-folders to contain each image group (lasagna and doom_landscape).


Once my images were uploaded to my S3 bucket, I then created a new AWS Rekognition Custom Label project. Looking back at some of the past services covered in my cloud computing course (looking at you CloudFormation), creating a new Rekognition project is quite simple and straight forward. Here a user must simply provide a project name.


Once a project name is provided, users must create a dataset (if a dataset is not already established). Dataset creation follows the following requirements. Users provide a name and select an image location. Per the previous section, I opted to leverage an S3 bucket and provide a URI to the S3 folder location during the creation of this dataset.



Classification models attempt to distinguish images by learning relationship patterns between input data and known ground truths. The process of properly labeling input data is often a tedious process. Human labeling can produce noise or error into a model as some images may be incorrectly labeled, ultimately reducing model effectiveness. By leveraging an S3 storage bucket, AWS Rekognition provides users the option to automatically label images based on their folder directory. While additional labels may be added later, auto image labeling saved myself a significant amount of time given 100+ total images used for this classification project (Side note, a real-world dataset would likely have images in the thousand+ range, 100+ images is tiny. Though I am only one person).

Finally, users must ensure their S3 bucket is correctly configured via a provided bucket policy. Failure to apply the policy to the S3 bucket will prevent users from training their model, this step is critical!

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Sid": "AWSRekognitionS3AclBucketRead20191011",
 "Effect": "Allow",
 "Principal": {
 "Service": "rekognition.amazonaws.com"
 },
 "Action": [
 "s3:GetBucketAcl",
 "s3:GetBucketLocation"
 ],
 "Resource": "NAME-OF-S3-Bucket"
 },
 {
 "Sid": "AWSRekognitionS3GetBucket20191011",
 "Effect": "Allow",
 "Principal": {
 "Service": "rekognition.amazonaws.com"
 },
 "Action": [
 "s3:GetObject",
 "s3:GetObjectAcl",
 "s3:GetObjectVersion",
 "s3:GetObjectTagging"
 ],
 "Resource": "arn:aws:s3:::doom-or-lasagna-small-ds/*"
 },
 {
 "Sid": "AWSRekognitionS3ACLBucketWrite20191011",
 "Effect": "Allow",
 "Principal": {
 "Service": "rekognition.amazonaws.com"
 },
 "Action": "s3:GetBucketAcl",
 "Resource": "S3 BUCKET ARN"
 },
 {
 "Sid": "AWSRekognitionS3PutObject20191011",
 "Effect": "Allow",
 "Principal": {
 "Service": "rekognition.amazonaws.com"
 },
 "Action": "s3:PutObject",
 "Resource": "S3 BUCKET ARN + /*",
 "Condition": {
 "StringEquals": {
 "s3:x-amz-acl": "bucket-owner-full-control"
 }
 }
 }
 ]
}

Once the policy has been attached, we can now now attach custom labels to each image. This process highlights the nature of AWS Rekognition custom label as a supervised learning model. Selecting the “Start Labeling” button will allow users to select images in groups or individually and attach labels accordingly.


Lets take for instance the following image. Here I have attached a label indicating a tower exists in the image, as well as the location mars. Labels should be attached only if the given topic exists in the image. Finally, the image was auto labeled through my s3 bucket as a member of the “doom” parent group. A potential label to be included may be a “cliff” tag, given the top left corner of the image. Additional labels will increase model complexity, resulting in longer training times or the potential for miss classification. Too few labels will likely result in model over fitting.


To begin training a new classification model on our properly labeled dataset, users must navigate back to the project directory and select their project followed by the “train new model” option under the “models” group.

Here the user must choose a parent project for the model, specifying the project ARN from a list of created AWS Rekgonition projects. Users should select the previously created project ARN. Additionally, a training dataset must be selected, here we will select the previously created dataset containing our training/testing images. Lastly, the training model must be informed on how to create a corresponding test dataset.


Image classification relies on multiple datasets to properly train and test a given model. A training dataset (as the name might suggest) is used to train the neural weights of each layer. A classification model (typically a convolutional neural network or CNN) is composed of multi-layered neural weights (each layer referred to as a perceptron). Each neuron weight typically corresponds to specific hyperparameters (a value used to control the learning process) in the input data, whereby neurons from one perceptron are connected to the weights in another perceptron. Should the sum of the weights surpass a confidence activation threshold, the model may or may not make a classification.

Ultimately, the data provided by the training dataset is used to establish the weights of the neural network. Therefore, should future testing input data be parsed through the network and similar weights are activated, then a corresponding classification can be made.


AWS Rekognition offers three different options for providing a test dataset. Users may either create a new test dataset (following the same procedures for creating our initial training dataset), copy an existing test dataset, or utilize a train-test split. For this project, I leveraged the traditional approach to model training and selected the train-test split. Here the model will be trained using a random sample consisting of 80% of the original training dataset, the remaining 20% will be dedicated for testing the model after training has completed. Segmenting the training and testing dataset ensures both datasets are representative of the original data.


The classification model is now ready to be trained. AWS Rekognition charges users $1 per hour of training. Training times vary significantly depending on the complexity of the model. For this project, training times varied between 1-3 hours in length.

Model results


In order to familiarize myself with AWS Rekcognition, as well as validating model performance. I opted to create three different models. The first model was trained purely on a dataset consisting of 72 images of lasagna without any images from DOOM. The training dataset contained only two possible labels (lasagna and Italian food) This model was used to develop an initial understanding of the model creation process, as well as the test results.


Model performance is presented to the user as a suite of evaluation results. These results include an F1 score, an average precision score and an overall recall. The F1 value scores the classification model on the accuracy of the test data and is a calculation of both the precision and recall values. The model precision is the fraction of true positives over all true and false positives found in the model. Finally, the recall value is the computed fraction of test set labels that were correctly predicted.


The following image displays the evaluation results from the first trained model. The following model took approximately 1 hour and 30 minutes of training to produce the following results.



The results from this model are surprising. Per the previous image, all three metric scores received a perfect 100%. While an initial impression might assume this is a fantastic result, a perfect result is often representative of a model overfitting the data provided. This could have occurred from either too few dataset images, repeating images in both the training and testing datasets. Additionally, it is possible that my custom tags are too simplistic.


Upon completion of the first classification model, I created a secondary model incorporating images of both lasagna and landscapes from Doom. Primarily concerned with the surprising amount of time AWS took to train my initial simplistic model, the secondary model contained half the number of images used in the first model. While the risk of utilizing a small dataset remained a possibility, I anticipated the greater diversity in images likely would offset any potential issues encountered (as the model would have a greater degree of input variation and as such more flexibility in image classification). The following image displays the evaluation results from the second model.



Once again, the model evaluation results are surprising. Unlike the first model, the resulting accuracy scores more closely aligned with my initial expectations. While the model is not perfectly accurate, the results suggest a model that more realistically identifies custom tags in the test dataset.


The results from this model clarify the issue with my previous model evaluation. It is likely my previous model’s tags were too broadly applicable to all dataset images. Given these simplified tags were applied to all training images, increasing the number of custom tags, and diversifying the image pool addressed the previous model’s overfitting. Furthermore, the model training time was cut by 30 minutes. This suggests to me that the number of custom tags impacts the computational performance of the model less than the number of dataset images. When looking at the classification results, an additional surprise can be found.



Success! The previous image showcases a false positive classification, whereby the model incorrectly classifies an image of lasagna as an image of a Martian landscape. Understanding why the model classified this image of lasagna as an image of mars is confusing. While AWS Rekognition provides very little information to the user beyond pure classifications, my assumption for this false positive remains the visually similar color pallet. The inclusion of yellow, orange, dark red and brown, combined with the black plate shares many of the same features found in labelled images of mars.


To further test my assumption, I repeated the model training process for a final time. For this last model, I increased the dataset size to include a total sample of 113 images of Doom and Lasagna photos. Again, I increased the number of custom tags, providing more opportunities for the model to produce nuanced classifications. The following custom tags were added:

- Doom_landscape

- Lasagna

- Cliff

- Curvy noodle

- Flat noodle

- Smoke

- Tower



The results from this model showcase a clear ability to distinguish images of lasagna from images from Doom. In fact, there wasn't a single occurrence of miss-classifications between images of DOOM and images of Lasagna. However, the model appears to have difficulty distinguishing between flat and curvy noodle-based lasagna. While both images share a similar color palette, it becomes increasingly clear that the model is having trouble differentiating between degrees of “curvy-ness” found in each image. While some images of lasagna showcase near perfect flat noodles, other images showcase lasagna slices at a slight angle that gives the illusion of curved noodles. This result reinforces the need for clear images with unambiguous traits.


The Take Away


This project offered myself the perfect opportunity to merge my own passion for gaming witha personal interest in advanced machine learning. I anticipated that leveraging this project to implement a classification tool would provide a valuable experience in cloud/cluster based convolutional neural network computations. This proved to be true, as image classification through a convolutional neural network is a typically complex process. However, AWS Rekognition simplifies much of the complexity away from the user. Leveraging the abstraction built into AWS Rekoginition, I was able to quickly build and test a complex machine learning model to classify images of lasagna from visually similar images from the popular video game Doom, in just a matter of a few hours. The project consisted of three machine learning models, a small-scale single image dataset, a small-scale mixed image dataset and a large-scale mixed image dataset. The evaluation results produced by these models were surprising. Contrary to my initial hypothesis, AWS Rekognition was capable for correctly classifying nearly every image presented. While occasional outliers were present, with a singular occurrence of lasagna classified as mars, the tool often correctly classified images with a high probability of confidence.


The decision to train a machine learning model on the ability to distinguish between images of Lasagna and a Maritain landscape may seem odd. Identifying visually similar images using a convolutional image classification tool is a challenging problem. However, AWS Rekognition has demonstrated that it is capable of correctly classifying visually similar images with a relatively small dataset, combined with a limited number of supervised learning labels.

Recent Posts
Archive
Search By Tags