Vishwa Karia

Senior Software Engineer

Seattle, Washington, United States7 yrs 8 mos experience
Most Likely To SwitchAI Enabled

Key Highlights

  • Expert in building scalable AI training infrastructures.
  • Proven leadership in optimizing GPU resource management.
  • Passionate mentor empowering women in technology.
Stackforce AI infers this person is a SaaS-focused Software Engineer with expertise in AI and distributed systems.

Contact

Skills

Core Skills

Distributed TrainingAiHpcSoftware ArchitectureLeadership

Other Skills

Deep LearningMicro-servicesCloud Computing (AWS)Agile DevelopmentDesign ThinkingHigh Performance Computing (HPC)Large Language Models (LLM)C++GitPyTorchPythonTensorFlowMPIRDMAGenerative AI

About

I am a Software Engineer at Meta, working in the AI Infrastructure organization supporting various Pytorch-based frameworks for large-scale distributed training of personalization models. My interests lie at the intersection of Machine Learning and Distributed Systems, with a focus on building scalable distributed training experiences for customers. I am passionate about empowering women in technology and AI, and have been actively mentoring women and high-school students. I thrive on teams looking to leave the world better than we found it - and do so in innovating, exciting ways. I am always looking to meet new people, so please feel free to reach out to me for a casual chat, or for any collaboration opportunities! Skills: Deep Learning, AI, HPC, Distributed Training, Micro-services, Cloud Computing (AWS), Leadership, Software Architecture, Agile Development, Design Thinking, Mentorship

Experience

7 yrs 8 mos
Total Experience
1 yr 2 mos
Average Tenure
2 yrs 4 mos
Current Experience

Meta

Senior Software Engineer

Dec 2023Present · 2 yrs 4 mos · Bellevue, Washington, United States · On-site

  • Building training infrastructure across all Meta products like Facebook, Instagram and Ads for Recommendation System models using PyTorch to improve performance and scalability of AI model training. Managing fleets of GPU systems to efficiently train large scale machine learning models to produce better model quality faster.
  • Pioneering the development of a GPU resource optimization system that identifies and terminates underutilized training jobs, with estimated reductions of 7k+ inefficient training jobs within 3 months and potential cost savings of $XXM.
  • Led the design and development of an automated system to deploy machine learning models to Meta’s version control systems, effectively resolving persistent compatibility issues between model code and infrastructure code. Successfully deployed the system to 100% eligible training and inference models with 0 downtime. Eliminated manual effort entirely for issue resolution and reduced resolution time by 85%, resulting in revenue savings of $XXX M.
  • Spearheaded an organization-wide initiative to enhance pre-production testing processes, directing a team of 25 engineers in a comprehensive analysis and overhaul of existing testing infrastructure.
Deep LearningAIHPCDistributed TrainingMicro-servicesCloud Computing (AWS)+4

Amazon web services (aws)

Software Development Engineer II

Mar 2022Dec 2023 · 1 yr 9 mos · Seattle, Washington, United States

  • Spearheaded the design and implementation of an end-to-end automated security patching system for AWS SageMaker HyperPods. Provided technical leadership, defined the customer experience, and managed failure scenarios to minimize downtime, ensuring project success and meeting critical security goals.
  • Led a team of 3 engineers to reduce impact of infrastructure failures on training jobs when scaling to >1000 GPUs and reduce startup time from ~20 mins to <1 min for model parallelism use-cases.
  • Lead SDE to independently revamp the customer experience for launching distributed training jobs on SageMaker. Designed and implemented a feature to allow customers to launch Torchrun based jobs on AWS SageMaker using the SageMaker Python SDK with no code changes to their training script.
  • Contributed to the Amazon SageMaker Distributed Data Parallelism (SMDDP) library for large scale data parallel distributed deep learning model training with TensorFlow and PyTorch.
  • Counseled Amazon SageMaker customers like Torc Robotics, Rivian and LightOn to help them use SMDDP and scale training workloads for 1B – 175B parameter models on >3K GPU clusters.
High Performance Computing (HPC)Large Language Models (LLM)C++GitPyTorchPython+8

Ai for good foundation

Council Member

Jan 2022Jul 2022 · 6 mos · Global

  • The Council for Good is a group of AI innovators, policy makers and social change-makers dedicated to using technology to advance the United Nation’s SDGs.
Public SpeakingSocial Outreach

Amazon

3 roles

Software Development Engineer II

Promoted

Oct 2021Mar 2022 · 5 mos · Greater Seattle Area

  • Part of the Kindle Manga team, responsible for designing and building systems that support content ingestion from publishers
  • Built a plagiarism detection system for a new category of Manga books
  • Led collaboration with product managers and 6 partner teams to design and implement a solution that excludes books from certain devices
  • Actively mentored new developers and interns on the team. In addition, served as a Scrum Leader for my team.

Software Development Engineer

Feb 2020Oct 2021 · 1 yr 8 mos · Greater Seattle Area

Software Development Intern

Jun 2019Sep 2019 · 3 mos · Greater New York City Area

  • Developed a serverless web application for the Kindle Comics Team that supports operational use cases for maintaining and troubleshooting data stored in DynamoDB
  • Technologies used: AWS Lambda, AWS DynamoDB, AWS API Gateway, Java, React Redux

Built by girls

Mentor

Oct 2020Oct 2021 · 1 yr

  • Mentoring young women to prepare the next generation of female leaders to boldly step into careers powered by technology.

University of california, los angeles

Graduate Teaching Assistant

Apr 2019Jun 2019 · 2 mos · Greater Los Angeles Area

  • Worked as a Graduate Teaching Assistant for the course 'Machine Learning for Economics (Spring 2019)' for graduate students at UCLA.
  • My responsibilities included:
  • Familiarizing the students with the use of Python for Machine Learning
  • Holding office hours to help them with various concepts and doubts
  • Helping the Professor in creating and grading assignments, projects and exams
Git

Center for smart health, ucla

Graduate Student Researcher

Jan 2019Dec 2019 · 11 mos · Greater Los Angeles Area

  • I worked under the guidance of Prof. Ramin Ramezani on the problem of low classification accuracy of the minority class in imbalanced health-related image datasets.
  • Developed a Genetic algorithm that uses a combination of data-driven and algorithmic approaches to effectively handle this imbalance by generating synthetic data
  • Achieved a higher F-score when performing classification on 8 out of 9 real-world datasets, with an improvement of ~1%
MentoringIntercultural CommunicationCommunity Outreach

Samsung r&d institute india - bangalore private limited

Summer Intern

May 2017Jul 2017 · 2 mos · Bangalore, India

  • Interned in the Client Group of the Intelligent Services Team
  • Wrote grammar files and python scripts based on Deep Neural Networks to generate training data for the training the model of Bixby, Samsung’s virtual assistant
  • Developed a client-server based Android Application using WebSocket

Technovanza

General Secretary

May 2016Dec 2016 · 7 mos · Mumbai Area, India

  • Led a team of over 200 students at Technovanza, VJTI, mainly handling Marketing, Public Relations, Events and Media

Education

UCLA

Master of Science - MS

Jan 2018Jan 2020

Veermata Jijabai Technological Institute (VJTI)

Bachelor of Technology - BTech — Computer Engineering

Jan 2014Jan 2018

Ramnarain Ruia College

Higher Secondary School

Jan 2012Jan 2014

Stackforce found 100+ more professionals with Distributed Training & Ai

Explore similar profiles based on matching skills and experience