Segmentation

Background

Image Segmentation is the process of partitioning an image into defined segments to delineate the original image into meaningful categories. Segmentation is used to compartmentalize pixels into previously-defined groups, called classes, to differentiate objects in an image.

Image segmentation can be divided into two main groups: semantic segmentation and instance segmentation.

Semantic segmentation is the approach where pixels are grouped into classes, e.g. people in an image are segmented out from the background. On the other hand, instance segmentation goes a step further than semantic segmentation by segmenting each object within each class individually, e.g. each person in an image is segmented separately. This allows for individually identifying every object in an image in the presence of multiple objects from the same class. In our segmentation, semantic segmentation is used since there will be one character and one shape in the image, eliminating the need for instance segmentation.

Overview

Segmentation is the second step in our computer vision pipeline. It takes the cropped image from the saliency step as the input image. The purpose of segmentation is to reduce the amount of noise in the image before passing it onto the classification step. Successful segmentation allows us to simplify the classification to a simple MNIST problem. At this point, there would be a variety of methods that could result in an error rate less than 1%.

Segmentation Demo Image

Current Implementation

Segmentation is currently implemented through the use of a Fully Convolutional Network, also called an FCN. An FCN is a deep-learning neural network model which segments each pixel by downsizing the image multiple times, thereby creating a convolution of the image with each round of downsizing. The last convolution is generally then upscaled multiple times and analyzed at each upscaling step until the original image resolution is reached.

Classes

In our implementation, the FCN is looking for three classes: shape, character, and background. Using images that have already been segmented, we can train the FCN to create accurate inferences on which class a pixel in an image belongs to. The segmented shape and character are each turned into binary masks that are passed onto the classifier. The background pixels are not used.

Shape post-processing

Because each pixel is segmented into one and only one class, the pixels which belong to the character class are absent from the shape mask. As the character generally lies within the bounds of the shape, the shape mask tends to have an empty hole where the character was. To aid in the classification step, the shape mask is filled in using a simple contour detection algorithm to find the contour of the shape and fill in all the holes within the shape.

Performance

The performance of segmentation is benchmarked using two different metrics: Sørensen–Dice coefficient and Jaccard Index.

Generally, Sørensen–Dice coefficient is a more useful and trustworthy benchmark.

Future Work

In addition to benchmarking actual accuracy of segmentation, we need to benchmark runtime efficiency.