Vasu Sharma

AI Researcher

Sunnyvale, California, United States8 yrs 7 mos experience
Most Likely To SwitchAI ML Practitioner

Key Highlights

  • Published 80+ papers with over 9000 citations.
  • Expert in multimodal generative AI and self-supervised learning.
  • Advises startups on AI strategies and product development.
Stackforce AI infers this person is a leading expert in AI research and multimodal machine learning.

Contact

Skills

Core Skills

Generative AiMultimodal Machine LearningDeep LearningMachine LearningAi Strategies

Other Skills

Algorithm DesignAlgorithmsAudio ProcessingAutomatic Speech Recognition SystemsCC++CaffeComputer VisionData MiningData StructuresHTMLImage ProcessingLarge Language Models (LLM)MatlabNLP

About

I am presently working as an Applied Research Scientist at Facebook AI research, working on building Multimodal foundational generative AI models. I am also interested in the domain of self supervised learning. I have published 80+ papers across top AI conferences like NeurIPS, CVPR, ACL, EMNLP, TMLR, ICLR, NAACL, COLM, WACV, Interspeech among others garnering over 9000+ citations.I routinely work with billion scale datasets to train these massive multimodal models. Previously I worked as Quantitative Researcher at Citadel where my work involved leveraging the power of Machine Learning and Statistical methods in an attempt to fathom the enigmatic world that is the financial markets. I have also worked at Amazon Alexa AI on large scale multimodal models and Embodied AI applications to bring smart robot intelligence to Alexa devices. I actively advise several early stage startups and often guest lecture at Stanford and CMU.I graduated from Indian Institute of Technology, Kanpur completing my Bachelors in Computer Science and Engineering and then completed my Masters in Machine Learning and Artificial Intelligence at the Language Technologies institute at Carnegie Mellon University.I am deeply passionate about research in my field. My research interests include: Deep Learning and it's uses in the field of Computer Vision, Speech and Music Processing and Natural Language Processing. My goal in life is to use technology to make this world a better place for everyone to live in. It is with this goal in mind that I work on several interesting projects which help me realize this dream, one step at a time.I have had the good fortune of working with some amazing people at some fantastic places and have learnt a lot from them. I hope to continue learning, travelling to new places and meeting new people. My mantra in life is- "Live life with passion - Love what you do, do what you love".Besides being a Technology enthusiast, I am also very passionate about sports. I was a part of the IIT Kanpur Aquatics team and love to swim, play Water Polo, Soccer and Cricket.I am also an ardent traveler and love exploring new places and cultures and love travelling the world making new friends on the way.

Experience

Algoverse

AI Research Director

Jan 2024Jan 2025 · 1 yr · Remote

  • Led the development of a cutting-edge AI program to empower students with industry-relevant skills, leveraging top AI research experience from leading labs.
  • Implemented strategies to enhance students' prospects for admission to top universities and successful careers in the tech industry.
  • Collaborated with teams to create unparalleled opportunities in AI education, nurturing future innovators in the field.
  • Published papers:
  • FrontierScience Bench: Evaluating AI Research Capabilities in LLMs (ICML: REALM 2025)
  • Rosetta-PL: Propositional Logic as a Benchmark for Large Language
  • Model Reasoning: https://arxiv.org/pdf/2505.00001
  • FaceSafe: An Inpainting Pipeline for Privacy-Compliant Scalable Image Datasets (ICML 2025 : DIG-BUGS)
  • COREVQA: A Crowd Observation and Visual Entailment Visual Question Answering Benchmark (ICML 2025 : DIG-BUGS)
  • Pause-Tuning for Long-Context Comprehension: A Lightweight Approach to LLM Attention Recalibration (ICML 2025: LCFM 2025)
  • NovelHopQA: Diagnosing Multi-Hop Reasoning Failures in Long Narrative Contexts (ICML 2025: LCFM 2025)
  • Rewrite-to-Rank: Optimizing Ad Visibility via Retrieval-Aware Text Rewriting (ICML 2025: MoFA 2025)
  • TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models:https://arxiv.org/pdf/2503.11656
  • Deconstructing bias: A multifaceted framework for diagnosing cultural and compositional inequities in text-to-image generative models: https://arxiv.org/pdf/2505.01430
  • Advancing Uto-Aztecan Language Technologies:
  • A Case Study on the Endangered Comanche Language: https://aclanthology.org/2025.americasnlp-1.4.pdf

Meta

Applied Research Scientist Lead

Aug 2022Present · 3 yrs 7 mos · Menlo Park, California, United States · On-site

  • Working with Facebook AI Research (FAIR) on research on Large scale multimodal foundational models on trillion scale datasets. Particularly interested in Generative AI research and production use cases and exploring the realm of Self supervised learning.
  • Published list of papers:
  • DINOv2: Learning Robust Visual Features without Supervision (Published at TMLR ) https://dinov2.metademolab.com/
  • Chameleon: Mixed-Modal Early-Fusion Foundation Models (https://about.fb.com/news/2024/06/releasing-new-ai-research-models-to-accelerate-innovation-at-scale/)
  • Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning (CM3Leon) (https://ai.meta.com/blog/generative-ai-text-images-cm3leon/)
  • Demystifying CLIP Data (MetaCLIP) (Published at ICLR 2024): (https://github.com/facebookresearch/MetaCLIP)
  • Mavil: Masked audio-video learners (Published at NeurIPS 2023): (https://ar5iv.labs.arxiv.org/html/2212.08071)
  • A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions (Published at CVPR 2024): (https://openaccess.thecvf.com/content/CVPR2024/papers/Urbanek_A_Picture_is_Worth_More_Than_77_Text_Tokens_Evaluating_CVPR_2024_paper.pdf)
  • Seamless Interaction (https://ai.meta.com/research/publications/seamless-interaction-dyadic-audiovisual-motion-modeling-and-large-scale-dataset/)
  • FLAP: Fast Language-Audio Pre-training (Published at ASRU 2023) (https://arxiv.org/abs/2311.01615)
  • An Introduction to Vision-Language Modeling (https://arxiv.org/abs/2405.17247)
  • Text Quality-Based Pruning for Efficient Training of Language Models (https://arxiv.org/pdf/2405.01582) and DMLR 2025
  • Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM (Published at COLM 2024) (https://arxiv.org/abs/2403.07816)
Generative AILarge Language Models (LLM)Natural Language Processing (NLP)Multimodal machine learningDeep LearningComputer Vision+1

Freelance

Startup Advisor

Jan 2022Present · 4 yrs 2 mos

  • Advised multiple startups and VCs on technical and AI strategies for scaling from ideation to series A/B/C stages.
  • Designed AI infrastructure and built AI native products to drive customer acquisition and growth.
  • Facilitated connections with VCs and assisted in hiring top talent to support company expansion.
AI strategiesTechnical advisingStartup growth

Citadel llc

3 roles

Quantitative Researcher (Non compete)

Aug 2021Aug 2024 · 3 yrs

Quantitative Researcher

Jan 2019Aug 2021 · 2 yrs 7 mos

Research Intern

May 2018Aug 2018 · 3 mos · Greater Chicago Area

  • Working on using Machine Learning/ Deep Learning techniques to better model Financial Time series data and ensure scalability of the algorithms to arbitrary number of input features

Amazon lab126

Applied Scientist

Aug 2021Aug 2022 · 1 yr · Sunnyvale, California, United States · On-site

  • Working with Alexa AI on a variety of problems like:
  • A dialog enabled visual-language navigation bot leveraging the multimodal data sources to faithfully navigate a virtual environment based on user instruction. Created a new benchmark for visual language navigation as a part of the Alexa Prize Simbot challenge and designed benchmark models for the same
  • Designing efficient multimodal transformers to speed up their training and deployment by improving the computational complexity of the self attention mechanism
  • Video processing applications like video action recognition, video question answering, video summarization, moment retrieval etc working directly with compressed video streams
  • Created a benchmark for cooperative heterogenous multi agent reinforcement learning platform including open sourcing the collected dataset and it's benchmark models
  • Working on creating a massively multimodal transformer pipeline capable of handling a wide range of input modalities with modality agnostic transformer blocks which work well for a several tasks leveraging a multitude of modalities
  • Published list of papers:
  • Alexa arena: A user-centric interactive platform for embodied ai (Published at Neurips 2023) (https://www.amazon.science/publications/alexa-arena-a-user-centric-interactive-platform-for-embodied-ai)
  • Alexa, play with robot: Introducing the first Alexa Prize SimBot Challenge on embodied AI (https://www.amazon.science/alexa-prize/proceedings/alexa-play-with-robot-introducing-the-first-alexa-prize-simbot-challenge-on-embodied-ai)
  • CHMARL: A Multimodal Benchmark for Cooperative, Heterogeneous Multi-Agent Reinforcement Learning (Published at RSS 2022) (https://www.amazon.science/publications/chmarl-a-multimodal-benchmark-for-cooperative-heterogeneous-multi-agent-reinforcement-learning)
  • ε-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer (Published at WACV 2024) (https://arxiv.org/abs/2311.17267)

Carnegie mellon university

2 roles

Graduate Research Assistant

Aug 2018Aug 2019 · 1 yr · Multicomp Lab, Language Technologies Institute, School of Computer Science

  • I worked with Prof. Loius Phillipe Morency on a wide range of project related to Multi Modal Machine Learning and on building robust, explainable Deep Learning models
  • We designed adversarial attack mechanisms on Visual Question Answering models to identify their vulnerabilities and then fix the same by proposing a variety of robust training mechanisms.
  • We worked on developing a neural network model which uses a Deep Convolutional Neural Network based pipeline alongside a geometrically conditioned point distribution model for Facial Landmark Detection.
  • We alsondeveloped the first fully ecologically validated models of visual perception. We will combine intracranial EEG (iEEG)recordings captured during long stretches of natural visual behavior with cutting-edge computer vision, machine learning, and statistical analyses to understand the neural basis of natural, real-world visual perception.
  • We also explored facial expression recognition in extreme face scenarios like profile face views, occluded faces, non centric and rotated faces alongside recognition for gender, age and racially diverse faces.

Graduate Research Assistant

Aug 2017May 2018 · 9 mos · Articulab, Language Technologies Institute, School of Computer Science

  • I worked on the SARA and the Yahoo! InMind projects at the ArticuLab which focus on building a socially aware robotic assistant. My primary focus was on trying to combine the user’s visual, vocal and verbal cues to better gauge the ‘rapport’ between the user and the conversational agent and using it to enable the agent to become socially more aware to the user’s emotional needs.

Localite

Head of AI

May 2017May 2018 · 1 yr · Los Angeles Metropolitan Area · Remote

  • Founding team at Localite Inc - a tours and activities marketplace for connecting people with local tour agencies and local tour guides. Raised $200k in pre-seed funding.
  • Implemented the core recommendation systems, feed ranking and search retrieval systems.

Epfl (école polytechnique fédérale de lausanne)

Research Intern

May 2017Jul 2017 · 2 mos · Lausanne, Vaud, Switzerland · On-site

  • Worked on learning unsupervised document embeddings using Continuous Bag of Words model implemented in a Deep Convolutional Neural Network framework and using transfer learning techniques to make cross domain use of these generalized embeddings to demonstrate improved performance on a wide array of tasks like similarity matching, sentiment analysis etc.

Abzooba

Research Consultant

Aug 2016Jul 2017 · 11 mos · Milpitas, California, United States · Remote

  • > Worked on building "A Smart E-commerce Virtual Assistant"
  • > Implemented features like cloth parsing from images, similar image retrieval from a huge fashion catalogue, a state of the art Deep Recommender system and a multi turn conversational voice agent to facilitate user interaction
  • > "Query based document retrieval" by learning rich semantic document embeddings using a deep LSTM pipeline and using these to find the match the queries to relevant documents
  • > "Abstractive summarization using Attention based encoder-decoder networks": Worked on building a deep residual LSTM pipeline which used temporal attention over both encoder and decoder networks to generate an abstractive summary of documents.

University of toronto

Research Intern

May 2016Jul 2016 · 2 mos · Greater Toronto Area, Canada · On-site

  • Research intern with the Computer Vision and Machine Learning group with Raquel Urtasun and Sanja Fidler in Geofffery Hinton's lab.
  • Worked on the problem of Instance and semantic segmentation from videos with direct applications in Autonomous driving and video surveillance.
  • Implemented a 2 stream network to combine base Segmentation masks generated by Deep Convolutional -
  • Deconvolutional Neural Networks and optical flow information obtained via implementing FlowNet based on Deep CNN's to achieve improved performance on video semantic segmentation task.

Xerox

2 roles

Research Intern

Jan 2015Jan 2015 · 0 mo

  • Worked as a research intern with the Computer Vision team at Xerox Research Europe, working on building deep learning frameworks for large scale object recognition.

Research Intern

Jan 2015Jan 2015 · 0 mo

  • Worked with the Speech and Signal processing team at XRCI to create Deep Neural Network based Speech Recognition systems.
  • Worked on 3 projects during this internship: “Application of Deep Learning for Automatic Speech
  • Recognition”, “ A comprehensive analysis of Activation Functions in Deep Nets” and “A new
  • hashing technique to enhance Deep Net performance ”. Also got the Best Project award for the same.
  • The projects primarily focused on constructing Deep Learning frameworks for Speech Recognition. -The Internship provided me extensive research and coding experience of how to efficiently train Deep Nets.

Carnegie mellon university

2 roles

Research Intern

May 2014Jul 2014 · 2 mos · Pittsburgh, Pennsylvania, United States · On-site

  • Worked on exploring Applications of Deep Learning to Audio and Speech Signal Processing, particularly exploring the use of Gated Recurrent Neural Network for denoising speech signals

Winter Intern

Dec 2013Dec 2013 · 0 mo

  • Completed 2 projects:
  • 1.Analyzing Newspaper Crime Reports For Identification Of Safe Transit Paths
  • 2.Automatic Image Summarization using Topic Modelling

Education

Carnegie Mellon University

Masters in Artificial Intelligence (MLT)

Indian Institute of Technology, Kanpur

Bachelor of Technology (B.Tech.) — Computer Science and Engineering

St.Columba's School, New Delhi

Class XII — Computer Science

Stackforce found 100+ more professionals with Generative Ai & Multimodal Machine Learning

Explore similar profiles based on matching skills and experience