Detail map of Mountain View, California, United States Overview map of Mountain View, California, United States

A: Mountain View, California, United States

Google Develops A Neural Image Caption Generator to Translate Images into Words

11/17/2014

Having previously transformed the machine translation process by developing algorithms from vector space mathematics, in November 2014 Oriol Vinyals and colleagues at Google in Mountain View developed a neural image caption generator to translate images into words. Google's machine translation approach is:

"essentially to count how often words appear next to, or close to, other words and then define them in an abstract vector space in relation to each other. This allows every word to be represented by a vector in this space and sentences to be represented by combinations of vectors.

"Google goes on to make an important assumption. This is that specific words have the same relationship to each other regardless of the language. For example, the vector “king - man + woman = queen” should hold true in all languages. . . .

"Now Oriol Vinyals and pals at Google are using a similar approach to translate images into words. Their technique is to use a neural network to study a dataset of 100,000 images and their captions and so learn how to classify the content of images.

"But instead of producing a set of words that describe the image, their algorithm produces a vector that represents the relationship between the words. This vector can then be plugged into Google’s existing translation algorithm to produce a caption in English, or indeed in any other language. In effect, Google’s machine learning approach has learnt to “translate” images into words.

To test the efficacy of this approach, they used human evaluators recruited from Amazon’s Mechanical Turk to rate captions generated automatically in this way along with those generated by other automated approaches and by humans.

"The results show that the new system, which Google calls Neural Image Caption, fares well. Using a well known dataset of images called PASCAL, Neural image Capture clearly outperformed other automated approaches. “NIC yielded a BLEU score of 59, to be compared to the current state-of-the-art of 25, while human performance reaches 69,” says Vinyals and co" (http://www.technologyreview.com/view/532886/how-google-translates-pictures-into-words-using-vector-space-mathematics/, accessed 01-14-2015).

Vinyals et al, "Show and Tell: A Neural Image Captional Generator" (2014) http://arxiv.org/pdf/1411.4555v1.pdf

"Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In thispaper we present a generative model based on a deep recurrent architecture that combines recent advances in computer vision and machine translation and that can be used
to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify
both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU score improvements on Flickr30k, from 55 to 66, and on SBU, from 19 to 27" (Abstract).

Timeline Themes