In 2014, researchers from UC Berkley published a paper, which caught attention of every scientist in the field of Computer Visiona, from round the World. Researchers from UC Berkley created an AI Deep Learning model which was able to detect multiple objects on a video in real time, thus opening up a new spectrum of opportunities where this technology could be used.
We want to apply this technology to help blind and visually impaired navigate in grocery shops and browse through grocery products. This will be acomplished by an object detection AI model embeded in Andorid/iOS app. To make expereince more seamless we will also ship a camera earpiece device, which you will plug into your phone. The camera will record real time image of your environment while in a shop. This image would be analysed on your device by our AI model, and useful audio feedback would be generated and available through the audio piece. At the begining this model will be only able to give you information about names of products directly in the front of you. However, with time we hope to extend the functionality and enable you to learn more about the products – for example: the nutritious values of products you are browsing through, as well as information about some porduct alternatives you may want to consider instead.
Current state of Project
Projects is still in its research phase. We are building a model consisting of 4 Neural Netowrks working in a sequence:
- Detects products and product labels. Draws boundary boxes around every single product, on an image (or a frame of a video). Then sends this information to the next neural network.
- Localises letters (and words) within every single box from previous network.
- Recognizes words that were localised in a previous network.
- Based on words detected on a label of a certain products finds a brand name that those words are most likely associated with.
In order to train first network we have collected and labeled almost 10,000 images of grocery products. Initially, we started with a Faster R-CNN (region based convolutional neural network) model, built on Inception-v2 backbone architecture. This was later changed to SSD300 (Single shot MultiBox detector) built on ResNet50, which is the current model we are working with.
With over 20,000 brands in some grocery shops it is imposible to detect every single brand using only object detection approach – limitting factors here being:
- memory constraints for a model,
- diminishing accuracy of the model with growing number of classes,
- speed of detection.
Therefore we decided to complement our detection networkwith two more neural networks, one for letter and word localisation, and second one for word recognition. For localisation we used EAST (“An Efficient and Accurate Scene Text Detector”). We are still experimenting with different natural language processing networks for word recognition task.