One of the more difficult challenges in robotics is the so-called “kidnapped robot problem.” Imagine you are blindfolded and taken by car to the home of one of your friends but you don’t know which one. When the blindfold is removed, your challenge is to recognize where you are. Chances are you’ll be able to determine your location, although you might have to look around a bit to get your bearings. How is it that you are able to recognize a familiar place so easily?
It’s not hard to imagine that your brain uses visual cues to recognize your surroundings. For example, you might recognize a particular painting on the wall, the sofa in front of the TV, or simply the color of the walls. What’s more, assuming you have some familiarity with the location, a few glances would generally be enough to conjure up a “mental map” of the entire house. You would then know how to get from one room to another or where the bathrooms are located.
Over the past few years, Mathieu Labbé from the University of Sherbrooke in Québec has created a remarkable set of algorithms for automated place learning and SLAM (Simultaneous Localization and Mapping) that depend on visual cues similar to what might be used by humans and other animals. He also employs a memory management scheme inspired by concepts from the field of Psychology called short term and long term memory. His project is called RTAB-Map for “Real Time Appearance Based Mapping” and the results are very impressive.
Real Time Appearance Based Mapping (RTAB-Map)
The picture on the left is the color image seen through the camera. In this case, Pi is using an Asus Xtion Pro depth camera set at a fairly low resolution of 320×240 pixels. On the right is the same image where the key visual features are highlighted with overlapping yellow discs. The visual features used by RTAB-Map can be computed using a number of popular techniques from computer vision including SIFT, SURF, BRIEF, FAST, BRISK, ORB or FREAK. Most of these algorithms look for large changes in intensity in different directions around a point in the image. Notice therefore that there are no yellow discs centered on the homogeneous parts of the image such as the walls, ceiling or floor. Instead, the discs overlap areas where there are abrupt changes in intensity such as the corners of the picture on the far wall. Corner-like features tend to be stable properties of a given location and can be easily detected even under different lighting conditions or when the robot’s view is from a different angle or distance from an object.
RTAB-Map records these collections of visual features in memory as the robot roams about the area. At the same time, a machine learning technique known as the “bag of words model” looks for patterns in the features that can then be used to classify the various images as belonging to one location or another. For example, there may be a hundred different video frames like the one shown above but from slightly different viewpoints that all contain visual features similar enough to assign to the same location. The following image shows two such frames side by side:
Here we see two different views from essentially the same location. The pink discs indicate visual features that both images have in common and, as we would expect from these two views, there are quite a few shared features. Based on the number of shared features and their geometric relations to one another, we can determine if the two views should be assigned to the same location or not. In this way, only a subset of the visual features needs to be stored in long term memory while still being able to recognize a location from many different viewpoints. As a result, RTAB-Map can map out large areas such as an entire building or an outdoor campus without requiring an excessive amount of memory storage or processing power to create or use the map.
Note that even though RTAB-Map uses visual features to recognize a location, it is not storing representations of human-defined categories such as “painting”, “TV”, “sofa”, etc. The features we are discussing here are more like the receptive field responses found in lower levels of the visual cortex in the brain. Nonetheless, when enough of these features have been recorded from a particular view in the past, they can be matched with similar features in a slightly different view as shown above.
RTAB-Map can stitch together a 3-dimensional representation of the robot’s surroundings using these collections of visual features and their geometric relations. The Youtube video below shows the resulting “mental map” of a few rooms in a house:
The next video demonstrates a live RTAB-Map session where Pi Robot has to localize himself after been set down in a random location. Prior to making the video, Pi Robot was driven around a few rooms in a house while RTAB-Map created a 3D map based on the visual features detected. Pi was then turned off (sorry dude!), moved to a random location within one of the rooms, then turned on again. Initially, Pi does not know where he is. So he drives around for a short distance gathering visual cues until, suddenly, the whole layout comes back to him and the full floor plan lights up. At that point we can set navigation goals for Pi and he autonomously makes his way from one goal to another while avoiding obstacles.