Ubicept wants half of the world’s cameras to see things differently

Computer vision could be much faster and better if we ignore the concept of still images and directly analyze the data stream from a camera. At least, that’s the theory that the latest brainchild from the MIT Media lab, Ubiceptworks under.
Most computer vision applications work the same way: a camera takes an image (or a rapid series of images, in the case of video). These still images are transmitted to a computer, which then performs the analysis to determine what is in the image. It seems simple enough.
But there’s a problem: This paradigm assumes that creating still images is a good idea. As humans accustomed to seeing photography and video, that might seem reasonable. Computers don’t care, though, and Ubicept thinks it can make computer vision much better and more reliable by ignoring the idea of frames.
The company itself is a collaboration between its co-founders. Sebastian Bauer is CEO of the company and a postdoctoral fellow at the University of Wisconsin, where he worked on lidar systems. Tristan Swedish is now CTO of Ubicept. Prior to that, he was a research assistant and holds an M.Sc. and Ph.D. student at the MIT Media Lab for eight years.
“There are 45 billion cameras in the world, and most of them create images and videos that aren’t really viewed by a human,” Bauer explained. “These cameras are primarily for perception, for systems to make decisions based on that perception. Think of autonomous driving, for example, as a system where it comes to pedestrian recognition. There’s all these studies coming out that show pedestrian detection works great in daylight but particularly poorly in low light. Other examples are cameras for industrial sorting, inspection and quality assurance. All these cameras are used for automated decision making. In sufficiently lit rooms or in daylight, they work well. But in low light conditions, especially when moving fast, problems arise.
The company’s solution is to bypass the “still image” as the source of truth for computer vision and instead measure individual photons that directly hit an imaging sensor. It can be done with a single photon avalanche diode array (or SPAD board, with friends). This raw data stream can then be fed into a network of gates programmable on site (FPGA, a type of super-specialized processor) and further analyzed by computer vision algorithms.
The newly founded company showcased its technology at CES in Las Vegas in January and has some pretty bold plans for the future of computer vision.
“Our vision is to have the technology on at least 10% of cameras over the next five years and on at least 50% of cameras over the next 10 years,” Bauer projected. “When you detect each individual photon with very high temporal resolution, you are doing the best that nature allows you to do. And you see the benefits, like high-quality videos on our webpage, which just blow everything else away.
TechCrunch saw the technology in action at a recent demo in Boston and wanted to explore how the technology works and what the implications are for computer vision and AI applications.
A new form of vision
Digital cameras typically work by capturing a single frame exposure by “counting” the number of photons that hit each of the sensor pixels over a period of time. At the end of the time period, all of those photons are multiplied together, and you have a still photograph. If nothing in the image moves, it works just fine, but the “if nothing moves” thing is a pretty big caveat, especially when it comes to computer vision. It turns out that when you try to use cameras to make decisions, everything moves all the time.
Of course, with the raw data, the company is still able to combine the photon stream into images, which creates beautifully crisp video with no motion blur. Perhaps more excitingly, dispensing with the idea of frames means the Ubicept team was able to take the raw data and analyze it directly. Here is a sample video of the dramatic difference this can make in practice: