Vishwa Karia

Senior Software Engineer

Seattle, Washington, United States7 yrs 8 mos experience

Most Likely To SwitchAI Enabled

Key Highlights

Expert in building scalable AI training infrastructures.
Proven leadership in optimizing GPU resource management.
Passionate mentor empowering women in technology.

Stackforce AI infers this person is a SaaS-focused Software Engineer with expertise in AI and distributed systems.

Contact

Skills

Core Skills

Distributed TrainingAiHpcSoftware ArchitectureLeadership

Other Skills

Deep LearningMicro-servicesCloud Computing (AWS)Agile DevelopmentDesign ThinkingHigh Performance Computing (HPC)Large Language Models (LLM)C++GitPyTorchPythonTensorFlowMPIRDMAGenerative AI

About

I am a Software Engineer at Meta, working in the AI Infrastructure organization supporting various Pytorch-based frameworks for large-scale distributed training of personalization models. My interests lie at the intersection of Machine Learning and Distributed Systems, with a focus on building scalable distributed training experiences for customers. I am passionate about empowering women in technology and AI, and have been actively mentoring women and high-school students. I thrive on teams looking to leave the world better than we found it - and do so in innovating, exciting ways. I am always looking to meet new people, so please feel free to reach out to me for a casual chat, or for any collaboration opportunities! Skills: Deep Learning, AI, HPC, Distributed Training, Micro-services, Cloud Computing (AWS), Leadership, Software Architecture, Agile Development, Design Thinking, Mentorship

Experience

7 yrs 8 mos

Total Experience

1 yr 2 mos

Average Tenure

2 yrs 4 mos

Current Experience

Amazon web services (aws)

Software Development Engineer II

Mar 2022 – Dec 2023 · 1 yr 9 mos · Seattle, Washington, United States

Spearheaded the design and implementation of an end-to-end automated security patching system for AWS SageMaker HyperPods. Provided technical leadership, defined the customer experience, and managed failure scenarios to minimize downtime, ensuring project success and meeting critical security goals.
Led a team of 3 engineers to reduce impact of infrastructure failures on training jobs when scaling to >1000 GPUs and reduce startup time from ~20 mins to <1 min for model parallelism use-cases.
Lead SDE to independently revamp the customer experience for launching distributed training jobs on SageMaker. Designed and implemented a feature to allow customers to launch Torchrun based jobs on AWS SageMaker using the SageMaker Python SDK with no code changes to their training script.
Contributed to the Amazon SageMaker Distributed Data Parallelism (SMDDP) library for large scale data parallel distributed deep learning model training with TensorFlow and PyTorch.
Counseled Amazon SageMaker customers like Torc Robotics, Rivian and LightOn to help them use SMDDP and scale training workloads for 1B – 175B parameter models on >3K GPU clusters.

High Performance Computing (HPC)Large Language Models (LLM)C++GitPyTorchPython+8

Ai for good foundation

Council Member

Jan 2022 – Jul 2022 · 6 mos · Global

The Council for Good is a group of AI innovators, policy makers and social change-makers dedicated to using technology to advance the United Nation’s SDGs.

Public SpeakingSocial Outreach

Amazon

3 roles

Software Development Engineer II

Promoted

Oct 2021 – Mar 2022 · 5 mos · Greater Seattle Area

Part of the Kindle Manga team, responsible for designing and building systems that support content ingestion from publishers
Built a plagiarism detection system for a new category of Manga books
Led collaboration with product managers and 6 partner teams to design and implement a solution that excludes books from certain devices
Actively mentored new developers and interns on the team. In addition, served as a Scrum Leader for my team.

Software Development Engineer

Feb 2020 – Oct 2021 · 1 yr 8 mos · Greater Seattle Area

Software Development Intern

Jun 2019 – Sep 2019 · 3 mos · Greater New York City Area

Developed a serverless web application for the Kindle Comics Team that supports operational use cases for maintaining and troubleshooting data stored in DynamoDB
Technologies used: AWS Lambda, AWS DynamoDB, AWS API Gateway, Java, React Redux

Built by girls

Mentor

Oct 2020 – Oct 2021 · 1 yr

Mentoring young women to prepare the next generation of female leaders to boldly step into careers powered by technology.

University of california, los angeles

Graduate Teaching Assistant

Apr 2019 – Jun 2019 · 2 mos · Greater Los Angeles Area

Worked as a Graduate Teaching Assistant for the course 'Machine Learning for Economics (Spring 2019)' for graduate students at UCLA.
My responsibilities included:
Familiarizing the students with the use of Python for Machine Learning
Holding office hours to help them with various concepts and doubts
Helping the Professor in creating and grading assignments, projects and exams

Git

Center for smart health, ucla

Graduate Student Researcher

Jan 2019 – Dec 2019 · 11 mos · Greater Los Angeles Area

I worked under the guidance of Prof. Ramin Ramezani on the problem of low classification accuracy of the minority class in imbalanced health-related image datasets.
Developed a Genetic algorithm that uses a combination of data-driven and algorithmic approaches to effectively handle this imbalance by generating synthetic data
Achieved a higher F-score when performing classification on 8 out of 9 real-world datasets, with an improvement of ~1%

MentoringIntercultural CommunicationCommunity Outreach

Samsung r&d institute india - bangalore private limited

Summer Intern

May 2017 – Jul 2017 · 2 mos · Bangalore, India

Interned in the Client Group of the Intelligent Services Team
Wrote grammar files and python scripts based on Deep Neural Networks to generate training data for the training the model of Bixby, Samsung’s virtual assistant
Developed a client-server based Android Application using WebSocket