Sohom Bhattacharjee

DevOps Engineer

Bengaluru, Karnataka, India7 yrs 11 mos experience

Key Highlights

  • Reduced stateful cluster scaling time from 10-12 hours to 2 hours.
  • Managed a time-series database ingesting over 1.5 billion time-series per minute.
  • Executed complex data-center migrations with zero downtime.
Stackforce AI infers this person is a DevOps Engineer specializing in high-availability systems and cloud infrastructure.

Contact

Skills

Core Skills

Server ArchitecturePerformance TuningSite Reliability EngineeringKubernetesEngineeringLinuxDevops

Other Skills

Linux System AdministrationAnsibleGrafanaTerraformDocker ProductsContainer OrchestrationPrometheus.ioGoogle Kubernetes Engine (GKE)golangAPI TestingPython (Programming Language)GitGitOpsCI/CDKafka

About

I'm a DevOps engineer who began as a sysadmin and never lost that deep appreciation for how systems actually work. I've led infrastructure transformations that significantly reduced deployment times, improved cluster efficiency, and enabled high availability at scale—across both startup and enterprise environments. I spend waay too much staring at logs. I write tools in Python, Bash, and Go—and can read C, C++, and Java (still debating if I should spend more time with Perl). At Last9, I maintained a highly available, multi-region time-series data platform ingesting over 1.5 billion time-series per minute. I re-engineered the backend infrastructure to support rack-aware deployments, which reduced stateful cluster scaling time from 10–12 hours to a fully automated 2-hour operation—dramatically improving reliability and operational efficiency. At Druva I took a detour from SRE/Devops and spent time working on two research projects involving the Kubernetes Control Plane and VMDK VDDK virtual storage layer. I genuinely enjoy the challenge of migrations—whether it's orchestrating entire data-centre moves across regions or lifting and shifting legacy applications into the cloud. I’ve planned and executed complex transitions with zero downtime and a strong focus on reliability and repeatability. 💡 Things I work with regularly: Kubernetes, AWS, GCP, Docker, Prometheus, Terraform, Go, Python, GitOps, CI/CD 📦 Things I’ve worked with: Kafka, Riak, SolrCloud, ZooKeeper, VictoriaMetrics (cluster mode), Prometheus, GitLab, Postfix + Dovecot (Email), Prosody (XMPP), LVM 🛠️ What I care about: Simplicity in design, observability, reducing operational burden, teaching & mentoring On the side, I tinker with my home lab (a 5TB NAS on a Raspberry Pi) and keep diving deeper into Linux, networking, and large-scale storage systems—the places where hardware and software shake hands.

Experience

7 yrs 11 mos
Total Experience
1 yr
Average Tenure
4 mos
Current Experience

Cloudflare

Systems Engineer

Dec 2025Present · 4 mos · Bengaluru, Karnataka, India · Hybrid

E2e cloud

Senior Site Reliability Engineer

Jun 2025Nov 2025 · 5 mos · Bangalore Urban, Karnataka, India · On-site

  • Building AI/ML platform on top of k8s
  • Revamp Observability and Alerting for the entire org
  • Automating bare-metal server provisioning and maintenance (Open Nebula)
Server ArchitecturePerformance TuningLinux System Administration

Career Break

Feb 2025Jun 2025 · 4 mos · Mountains

  • Went off to the mountains to complete my Basic Mountaineering Course.
  • Went off to Nepal to hike the Manaslu Circuit

Last9 inc

Site Reliability Engineer

Jan 2023Feb 2025 · 2 yrs 1 mo · Pune, Maharashtra, India · Hybrid

  • Rewrote the backend infrastructure of our product to support rack-aware deployments. This reduced our stateful cluster scaling operations from 10-12 hours to a highly automatic 2 hour operation.
  • Running a TSDB for customers with three-9s of availability in multiple regions across multiple deployments.
  • Owning the observability pipeline end-to-end
  • TSDB runs at scale (peak ingestion > 1.5 Billion time-series / min) in our largest deployment
  • Inter-Region Data-Center migration (>30TB data) without any disruption in reads / writes
  • Load-Testing / Stress-Testing / Performance analysis of new software. Capacity planning based on the same.
  • Reduced toil on internal storage operations by writing an automation platform. Time for new deployments is less
  • than 5 mins as opposed to 45 mins.
  • Regular on-call rotations and capacity-planning discussions
  • Wrote tools in Python and Golang to support internal workloads and teams.
  • Managed a cluster with > 900 individual nodes
  • Building incident-response runbooks for the entire team
  • Mentoring and supporting Juniors during on-call incidents.
AnsibleGrafanaTerraformSite Reliability EngineeringDocker ProductsContainer Orchestration+3

Druva

Software Engineer

Jan 2021Dec 2022 · 1 yr 11 mos · Pune, Maharashtra, India

  • Software Engineer at Druva-Labs.
  • Worked on the Kubernetes Protection Project. We built a Prototype Operator that was able to perform backup and restore operations on K8s StatefulSets (on top of AWS)
  • Worked on Vmware VMDK disk format for another project.
  • Performed API testing using Cypress.
  • Performed M & A analysis of other tools/companies from a technical perspective.
EngineeringgolangTerraformDocker ProductsContainer OrchestrationKubernetes

People interactive

2 roles

Senior DevOps Engineer

Jul 2020Jan 2021 · 6 mos

AnsibleLinuxGrafanaPython (Programming Language)TerraformSite Reliability Engineering+4

DevOps Engineer

Aug 2019Jul 2020 · 11 mos

  • Owner of the entire container platform; including the CI/CD layer, multiple staging and production cluster across multiple business units.
  • Maintained system stability during multiple AWS outages
  • Migrated Redis live in prod without service disruption.
  • Built and Maintained infra to run SolrCloud on top of ECS
  • Built and Maintained infra to run ZooKeeper on top of ECS
  • Helped Maintain multiple Kafka Clusters
  • Maintained an internal CI/CD and deployment tooling built on top of Gitlab
  • Day-to-Day operations and tasks / optimizations.
AnsibleLinuxGrafanaPython (Programming Language)TerraformSite Reliability Engineering+3

Voicereach

DevOps Engineer

Aug 2018Aug 2019 · 1 yr · Mumbai, Maharashtra, India

  • I started as a sysadmin / DevOps Engineer where I managed day-to-day operations using Terraform, Ansible, Packer and Jenkins.
  • With a view to better understand the functioning for the database (Riak) we were running, I grokked the manual and performed benchmarks. This resulted in significant improvement in the uptime of our Riak-Cluster. I also set up monitoring for the JVM and Riak.
  • Wrote a couple of tools that would make bulk import/export from Riak fast in Python. After this, the time taken to export the entire DB was reduced to 2 hours as opposed to 1 day. This enabled us to perform faster database backups and migrations.
  • Volunteered to write a data-cleaning tool for the data-team. This reduced their toil by automating away repetitive tasks. This reduced the toil for the data-team by reducing the time taken to clean data (before ingestion) from 6 hours to 1 hour.
AnsibleLinuxPython (Programming Language)TerraformDocker ProductsGit+2

Azim premji foundation

Technical Consultant

Apr 2017Sep 2017 · 5 mos · Bengaluru Area, India

  • I was responsible to training the Content Development team at Azim Premji Foundation on Linux and other FOSS tools that can be used in a classroom context for teaching high school children.
Linux

Education

S Nijalingappa College

Bachelor of Comptuer Applications — Computer Science

Jan 2015Jan 2018

Stackforce found 100+ more professionals with Server Architecture & Performance Tuning

Explore similar profiles based on matching skills and experience