Log in
Forgot password ?
Become a member for free
Sign up
Sign up
New member
Sign up for FREE
New customer
Discover our services
Dynamic quotes 

MarketScreener Homepage  >  Equities  >  Nasdaq  >  Microsoft Corporation    MSFT


SummaryMost relevantAll NewsAnalyst Reco.Other languagesPress ReleasesOfficial PublicationsSector newsMarketScreener Strategies

VinVL: Advancing the state of the art for vision-language models

01/15/2021 | 01:10pm EST

Humans understand the world by perceiving and fusing information from multiple channels, such as images viewed by the eyes, voices heard by the ears, and other forms of sensory input. One of the core aspirations in AI is to develop algorithms that endow computers with a similar ability: to effectively learn from multimodal data like vision-language to make sense of the world around us. For example, vision-language (VL) systems allow searching the relevant images for a text query (or vice versa) and describing the content of an image using natural language.

As illustrated in Figure 1, a typical VL system uses a modular architecture with two modules to achieve VL understanding:

  • An image encoding module, also known as a visual feature extractor, is implemented using convolutional neural network (CNN) models to generate feature maps of input image. The CNN-based object detection model trained on the Visual Genome (VG) dataset is the most popular choice before our work.
  • A vision-language fusion module maps the encoded image and text into vectors in the same semantic space so that their semantic similarity can be computed using cosine distance of their vectors. The module is typically implemented using a Transformer-based model, such as OSCAR.

Recently, vision-language pretraining (VLP) has made great progress in improving the vision-language fusion module by pretraining it on a large-scale paired image-text corpus. The most representative approach is to train large Transformer-based models on massive image-text pair data in a self-supervised manner, for example, predicting the masked elements based on their context. The pretrained vision-language fusion model can be fine-tuned to adapt to various downstream vision-language tasks. However, existing VLP methods treat the image encoding module as a black box and leave the visual feature improvement untouched since the development of the classical bottom-up region features in 2017, despite that there has been much research progress on improving image encoding and object detection.

Here, we introduce recent Microsoft work on improving the image encoding module. Researchers from Microsoft have developed a new object-attribute detection model for image encoding, dubbed VinVL (Visual features inVision-Language), and performed a comprehensive empirical study to show that visual features matter significantly in VL models. Combining VinVL with state-of-the-art VL fusion modules such as OSCAR and VIVO, the Microsoft VL system sets new state of the art on all seven major VL benchmarks, achieving top position in the most competitive VL leaderboards, including Visual Question Answering (VQA), Microsoft COCO Image Captioning, and Novel Object Captioning (nocaps). Most notably, the Microsoft VL system significantly surpasses human performance on the nocaps leaderboard in terms of CIDEr (92.5 vs. 85.3).

Microsoft will release the VinVL model and the source code to the public. Please refer to the research paper and GitHub repository. In addition, VinVL is being integrated into the Azure Cognitive Services, powering a wide range of multimodal scenarios (such as Seeing AI, Image Captioning in Office and LinkedIn, and others) to benefit millions of users through the Microsoft AI at Scale initiative.

VinVL: A generic object-attribute detection model

As opposed to classical computer vision tasks such as object detection, VL tasks require understanding more diverse visual concepts and aligning them with corresponding concepts in the text modality. On one hand, most popular object detection benchmarks (such as COCO, Open Images, Objects365) contain annotations for up to 600 object classes, mainly focusing on objects with a well-defined shape (such as car, person) but missing visual objects occupying amorphous regions (such as grass, sky), which are typically useful for describing an image. The limited and biased object classes make these object detection datasets insufficient for training very useful VL understanding models for real-world applications. On the other hand, although the VG dataset has annotations for more diverse and unbiased object and attribute classes, it contains only 110,000 images and is statistically too small to learn a reliable image encoding model.

To train our object-attribute detection model for VL tasks, we constructed a large object detection dataset containing 2.49M images for 1,848 object classes and 524 attribute classes, by merging four public object detection datasets, that is, COCO, Open Images, Objects365 and VG. As most datasets do not have attribute annotations, we adopted a pretraining and fine-tuning strategy to build our object-attribute detection model. We first pretrained an object detection model on the merged dataset, and then fine-tuned the model with an additional attribute branch on VG, making it capable of detecting both objects and attributes. The resultant object-attribute detection model is a Faster-RCNN model with 152 convolutional layers and 133M parameters, which is the largest image encoding model for VL tasks reported.

Our object-attribute detection model can detect 1,594 object classes and 524 visual attributes. As a result, the model can detect and encode nearly all the semantically meaningful regions in an input image, according to our experiments. As illustrated in Figure 2, compared with detections of a classical object detection model (left), our model (right) can detect more visual objects and attributes in an image and encode them with richer visual features, which are crucial for a wide range of VL tasks.

State-of-the-art performance on vision-language tasks

Since the image encoding module is fundamental to VL systems, as illustrated in Figure 1, our new image encoding can be used together with many existing VL fusion modules to improve the performance of VL tasks. For example, as reported in Table 1, by simply replacing visual features produced by the popular bottom-up model with the ones produced by our model, but keeping the VL fusion module (for example, OSCAR and VIVO) intact1, we observe significant improvement on all seven established VL tasks, often outperforming previous SoTA models by a significantly large margin.

[1] Note that we still perform training for the VL fusion module, but use the same model architecture, training data, and training recipe.

To account for parameter efficiency, we compare models of different sizes in Table 2. Our base model outperforms previous large models on most tasks, indicating that with better image encoding the VL fusion module can be much more parameter efficient.

Our new VL models, which consist of the new object-attribute detection model as its image encoding module and OSCAR as its VL fusion module, sit comfortably atop several AI benchmarks as of December 31, 2020, including Visual Question Answering (VQA), Microsoft COCO Image Captioning, and Novel Object Captioning (nocaps). Most notably, our VL model performance on nocaps substantially surpasses human performance in terms of CIDEr (92.5 vs. 85.3). On the GQA benchmark, our model is also the first VL model that outperforms NSM, which contains some sophisticated reasoning components deliberately designed for that specific task.

Looking forward

VinVL has demonstrated great potential in improving image encoding for VL understanding. Our newly developed image encoding model can benefit a wide range of VL tasks, as illustrated by examples in this paper. Despite the promising results we obtained, such as surpassing human performance on image captioning benchmarks, our model is by no means reaching the human-level intelligence of VL understanding. Interesting directions of future works include: (1) further scale up the object-attribute detection pretraining by leveraging massive image classification/tagging data, and (2) extend the methods of cross-modal VL representation learning to building perception-grounded language models that can ground visual concepts in natural language and vice versa like humans do.

Acknowledgments: This research was conducted by Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Additional thanks go to the Microsoft Research Service Engineering Group for providing computer resources for large-scale modeling. The baseline models used in our experiments are based on the open-source code released in the GitHub repository; we acknowledge all the authors who made their code public, which tremendously accelerates our project progress.


Microsoft Corporation published this content on 15 January 2021 and is solely responsible for the information contained therein. Distributed by Public, unedited and unaltered, on 15 January 2021 18:09:01 UTC

© Publicnow 2021
10:59aWALL STREET STOCK EXCHANGE : Nasdaq declines as tech sell-off resumes; cyclical ..
08:16aTODAY ON WALL STREET: Powell to the rescue
07:34aALITHYA BRIEF : Says Brought Millers Mutual Insurance Live with Microsoft Dynami..
05:29aCyviz Enters Deal With Microsoft To Develop Visual Collaboration Services
04:18aACCENTURE : and Microsoft Collaborate to Help Accelerate U.K.'s Transition to Ne..
02:47aCYVZ : Cyviz partners with Microsoft Corporation to design, develop and deliver ..
02:46aCYVZ : Cyviz partners with Microsoft Corporation to design, develop and deliver ..
01:36aBubbles, bubbles bound for trouble?
02/23Tech Selloff Eases After Fed Signals Steady Course -- 4th Update
02/23Tech Selloff Eases After Fed Signals Steady Course -- 3rd Update
More news
Financials (USD)
Sales 2021 164 B - -
Net income 2021 55 891 M - -
Net cash 2021 78 388 M - -
P/E ratio 2021 31,7x
Yield 2021 0,94%
Capitalization 1 759 B 1 759 B -
EV / Sales 2021 10,3x
EV / Sales 2022 9,16x
Nbr of Employees 163 000
Free-Float 99,9%
Duration : Period :
Microsoft Corporation Technical Analysis Chart | MSFT | US5949181045 | MarketScreener
Technical analysis trends MICROSOFT CORPORATION
Short TermMid-TermLong Term
Income Statement Evolution
Mean consensus BUY
Number of Analysts 40
Average target price 274,62 $
Last Close Price 233,27 $
Spread / Highest target 35,0%
Spread / Average Target 17,7%
Spread / Lowest Target -5,69%
EPS Revisions
Managers and Directors
Satya Nadella Chief Executive Officer & Non-Independent Director
Bradford L. Smith President & Chief Legal Officer
Amy E. Hood Chief Financial Officer & Executive Vice President
John Wendell Thompson Independent Chairman
James Kevin Scott Chief Technology Officer & Executive VP
Sector and Competitors
1st jan.Capitalization (M$)
SEA LIMITED27.08%128 909
SYNOPSYS INC.3.41%38 582