Mihai-Valentin Curelea — DevOps Engineer

Most organisations have a system that works until it doesn't. The people who understood why it was built that way have left, the architecture accumulated decisions that made sense at the time, and nobody has looked at the whole thing end to end in years. That's usually when I get called.I come in, map what actually exists - not what the diagrams say - find where the real breaking points are, and give leadership a clear picture of what needs to change and why. Sometimes I make those changes myself. Sometimes with the team. Depends on what the situation needs. Most recently I cut cloud costs by 25% for a Fortune 500 company. At Meta, I scaled the self-healing infrastructure platform for the whole Facebook fleet as it grew from 4 to 18 data-centers globally. I've done this type of work at Meta, Datadog, and other Fortune 500 companies. I co-authored a research paper on AI-based root cause analysis presented at ACM Sigmetrics. I've built open-source infrastructure used in 100K+ projects. I can walk into a system nobody fully understands and tell you, with precision, what's holding it together and what's about to break. If your delivery is slower than it should be, your costs are skyrocketing while growth has stagnated, your AI investments aren't paying off, or you're about to make a significant architecture decision and want someone who has seen how these go wrong - that's the conversation I'm useful for. Remote only. If that's your situation, send me a message.

Stackforce AI infers this person is a SaaS and B2B Infrastructure Specialist with extensive experience in cloud observability and system reliability.

Experience: 15 yrs 6 mos

Skills

Cloud Infrastructure
Solution Architecture
Site Reliability Engineering
Cloud Observability
Service Level Objectives Management
Infrastructure Scalability
Infrastructure Provisioning
Root Cause Analysis
Web Development
Fullstack Development
Javascript Development
Open Source Development

Career Highlights

Reduced cloud costs by 25% for a Fortune 500 company.
Scaled Facebook's self-healing infrastructure from 4 to 18 data centers.
Co-authored a research paper on AI-based root cause analysis.

Work Experience

Remote Work

Principal Software Engineer & AWS Solutions Architect (3 yrs 11 mos)

Datadog

Senior Site Reliability Engineer (1 yr 3 mos)

Facebook

Tech Lead, Production Engineer (Site Reliability Engineer / Cloud Engineer) (8 mos)

Tech Lead, Production Engineer (Site Reliability Engineer / Cloud Engineer) (1 yr)

Tech Lead, Production Engineer (Site Reliability Engineer / Cloud Engineer) (3 yrs 4 mos)

1&1 Internet, Inc.

Fullstack Software Architect (NodeJS) (2 yrs 4 mos)

Senior PHP/JavaScript developer (3 yrs 9 mos)

Adobe

Senior JavaScript & NodeJS Software Engineer (5 mos)

Open Source

Author & Lead Developer (1 mo)

Hippotomate - Supercharge your automated testing development & debugging

Author of Open Source App (1 mo)

Image Pro WordPress Plugin

Author of Open Source WordPress Plugin (3 yrs)

RCS & RDS

Web developer (1 yr 4 mos)

MultiACT Media

Web developer (2 yrs 1 mo)

vWorker

Freelancer on vWorker (ex RentAcoder) (1 yr)

Education

Machine Learning at Stanford University

Bachelor of Engineering at University POLITEHNICA of Bucharest

Mihai-Valentin Curelea

DevOps Engineer

15 yrs 6 mos experience

Highly Stable

Key Highlights

Reduced cloud costs by 25% for a Fortune 500 company.
Scaled Facebook's self-healing infrastructure from 4 to 18 data centers.
Co-authored a research paper on AI-based root cause analysis.

Stackforce AI infers this person is a SaaS and B2B Infrastructure Specialist with extensive experience in cloud observability and system reliability.

Contact

Skills

Core Skills

Cloud InfrastructureSolution ArchitectureSite Reliability EngineeringCloud ObservabilityService Level Objectives ManagementInfrastructure ScalabilityInfrastructure ProvisioningRoot Cause AnalysisWeb DevelopmentFullstack DevelopmentJavascript DevelopmentOpen Source Development

Other Skills

AWS LambdaAmazon DynamodbInfrastructure as a Service (IaaS)Node.jsSQLMachine LearningAmazon SQSAmazon RedshiftAWS Cloud MigrationNoSQLDevOpsSnowflakeMicroservicesLarge Scale SystemsTechnical Leadership

About

Experience

15 yrs 6 mos

Total Experience

2 yrs

Average Tenure

Current Experience

Remote work

Principal Software Engineer & AWS Solutions Architect

Jun 2022 – Present · 3 yrs 11 mos

Currently contracting as a hands-on Principal Software Engineer & AWS Solutions Architect for a large US multinational
My main responsabilities:
➤ 𝐃𝐞𝐟𝐢𝐧𝐢𝐧𝐠 𝐚𝐧𝐝 𝐃𝐫𝐢𝐯𝐢𝐧𝐠 𝐓𝐞𝐜𝐡𝐧𝐢𝐜𝐚𝐥 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐲
Setting the vision and direction for the infrastructure organization to ensure scalability, reliability, and innovation.
➤ 𝐀𝐜𝐜𝐞𝐥𝐞𝐫𝐚𝐭𝐢𝐧𝐠 𝐒𝐨𝐟𝐭𝐰𝐚𝐫𝐞 𝐃𝐞𝐥𝐢𝐯𝐞𝐫𝐲
Enhancing development velocity and prototyping capabilities while maintaining enterprise-grade quality.
➤ 𝐀𝐫𝐜𝐡𝐢𝐭𝐞𝐜𝐭𝐢𝐧𝐠 𝐚𝐧𝐝 𝐢𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐢𝐧𝐠 𝐋𝐚𝐫𝐠𝐞-𝐒𝐜𝐚𝐥𝐞 𝐒𝐲𝐬𝐭𝐞𝐦𝐬
Designing and optimizing distributed, high-performance architectures that power mission-critical applications.
➤ 𝐎𝐩𝐭𝐢𝐦𝐢𝐳𝐢𝐧𝐠 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐚𝐧𝐝 𝐕𝐞𝐧𝐝𝐨𝐫 𝐂𝐨𝐬𝐭𝐬
Leading strategic cost-efficiency initiatives across engineering teams and third-party providers, ensuring maximum ROI.
➤ 𝐄𝐱𝐞𝐜𝐮𝐭𝐢𝐧𝐠 𝐒𝐞𝐚𝐦𝐥𝐞𝐬𝐬 𝐋𝐚𝐫𝐠𝐞-𝐒𝐜𝐚𝐥𝐞 𝐌𝐢𝐠𝐫𝐚𝐭𝐢𝐨𝐧𝐬
Orchestrating complex infrastructure transitions with zero downtime and minimal risk.
➤ 𝐈𝐦𝐩𝐥𝐞𝐦𝐞𝐧𝐭𝐢𝐧𝐠 𝐄𝐧𝐝-𝐭𝐨-𝐄𝐧𝐝 𝐎𝐛𝐬𝐞𝐫𝐯𝐚𝐛𝐢𝐥𝐢𝐭𝐲
Elevating business and infrastructure-level insights with robust monitoring, tracing, and logging strategies.
➤ 𝐌𝐞𝐧𝐭𝐨𝐫𝐢𝐧𝐠 𝐚𝐧𝐝 𝐃𝐞𝐯𝐞𝐥𝐨𝐩𝐢𝐧𝐠 𝐄𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐋𝐞𝐚𝐝𝐞𝐫𝐬
Cultivating high-performance teams by mentoring key engineers and technical leaders to drive innovation and execution at scale.

AWS LambdaAmazon DynamodbInfrastructure as a Service (IaaS)Node.jsSQLMachine Learning+19

Datadog

2 roles

Senior Site Reliability Engineer

Mar 2021 – Jun 2022 · 1 yr 3 mos · Bucharest Metropolitan Area

Cataloging services in a fast-moving, large-scale environment
Datadog, the market leader in the cloud observability space, was incurring substantial delays and was dealing with low productivity among on-call engineers, due to the painfully slow and inefficient process of gathering essential information about various internal services.
𝗦𝗮𝘃𝗲𝗱 ~𝟭𝟬𝟬 𝗵𝗼𝘂𝗿𝘀 𝘄𝗼𝗿𝘁𝗵 𝗼𝗳 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝘁𝗶𝗺𝗲 𝗲𝘃𝗲𝗿𝘆 𝗺𝗼𝗻𝘁𝗵 by significantly decreasing the time to get relevant information about any service in the company by building an internal Service Catalog to centralize service data across the whole company.

ReactInfrastructure as a Service (IaaS)SQLMachine LearningDatadogNoSQL+14

Senior Site Reliability Engineer

Mar 2021 – Jun 2022 · 1 yr 3 mos · Bucharest Metropolitan Area

Driving company-wide adoption of SLOs (Service Level Objectives) in a large-scale environment
Datadog, the market leader in the cloud observability space, was incurring substantial delays during time-sensitive production incidents due to missing or inconsistent high-level observability (SLOs) across internal services.
𝗗𝗲𝗰𝗿𝗲𝗮𝘀𝗲𝗱 𝘁𝗵𝗲 𝘁𝗶𝗺𝗲 𝗻𝗲𝗲𝗱𝗲𝗱 𝗯𝘆 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝘀 𝘁𝗼 𝗮𝘀𝘀𝗲𝘀𝘀 𝘁𝗵𝗲 𝗼𝘃𝗲𝗿𝗮𝗹𝗹 𝗵𝗲𝗮𝗹𝘁𝗵 𝗼𝗳 𝘁𝗵𝗲𝗶𝗿 𝘀𝗲𝗿𝘃𝗶𝗰𝗲𝘀 𝗳𝗿𝗼𝗺 𝟱-𝟯𝟬 𝗺𝗶𝗻𝘂𝘁𝗲𝘀 𝘁𝗼 𝗹𝗲𝘀𝘀 𝘁𝗵𝗮𝗻 𝟯𝟬 𝘀𝗲𝗰𝗼𝗻𝗱𝘀 by delivering an easy-to-use system where every team can easily onboard their regular and AI/ML services and workloads to SLOs and get out-of-the-box dashboards and monitors that expose their system health.
𝗘𝗻𝗮𝗯𝗹𝗲𝗱 𝟭𝟬+ 𝗰𝗼𝗺𝗽𝗮𝗻𝗶𝗲𝘀 𝘁𝗼 𝘀𝗲𝘁 𝘂𝗽 𝗦𝗟𝗢𝘀 at scale by presenting my approach to simplifying SLO management at the yearly Dash 2021 tech conference.

Infrastructure as a Service (IaaS)Python (Programming Language)SQLDatadogNoSQLMLOps+11

Facebook

3 roles

Tech Lead, Production Engineer (Site Reliability Engineer / Cloud Engineer)

Apr 2020 – Dec 2020 · 8 mos

Migrating legacy provisioning platform to a new platform and adding observability to it
Facebook, the world’s biggest social media company, was incurring substantial operational costs and delays due to their infrastructure provisioning platform being unable to scale up on par with the accelerated growth of their servers fleet (from 4 to 18 datacenters in less than 4 years!)
𝗗𝗲𝗰𝗿𝗲𝗮𝘀𝗲𝗱 𝘁𝗵𝗲 𝘁𝗶𝗺𝗲 𝘁𝗼 𝗴𝗲𝘁 𝘀𝘆𝘀𝘁𝗲𝗺-𝘄𝗶𝗱𝗲 𝗮𝗻𝗱 𝗶𝘀𝗼𝗹𝗮𝘁𝗲𝗱 𝗳𝗮𝗶𝗹𝘂𝗿𝗲𝘀 𝗳𝗿𝗼𝗺 𝟱-𝟯𝟬 𝗺𝗶𝗻𝘂𝘁𝗲𝘀 𝘁𝗼 𝗹𝗲𝘀𝘀 𝘁𝗵𝗮𝗻 𝟯𝟬 𝘀𝗲𝗰𝗼𝗻𝗱𝘀, preventing significant loss of revenue during time-sensitive production incidents by leading and implementing the multi-level observability efforts consisting of metrics, dashboards, monitors and SLOs across the whole provisioning platform
𝗥𝗲𝗱𝘂𝗰𝗲𝗱 𝗯𝗮𝗱 𝗽𝘂𝘀𝗵𝗲𝘀 𝘁𝗼 𝘃𝗶𝗿𝘁𝘂𝗮𝗹𝗹𝘆 𝟬 as a result of automating error-checking of system-wide metrics in the release pipeline.

Infrastructure as a Service (IaaS)Software ArchitecturePython (Programming Language)SQLNoSQLLarge Scale Systems+12

Tech Lead, Production Engineer (Site Reliability Engineer / Cloud Engineer)

Mar 2019 – Mar 2020 · 1 yr

Exposing hidden errors using root cause investigation at scale
Facebook, the world’s biggest social media company, was incurring substantial operational costs and delays due to a painfully slow, inefficient and manual process of identifying small pockets of failures inside their humongous volumes of data
𝗗𝗲𝗰𝗿𝗲𝗮𝘀𝗲𝗱 𝗲𝗿𝗿𝗼𝗿 𝗱𝗲𝘁𝗲𝗰𝘁𝗶𝗼𝗻 𝘁𝗶𝗺𝗲 𝗳𝗿𝗼𝗺 𝗺𝗼𝗿𝗲 𝘁𝗵𝗮𝗻 𝟱 𝗺𝗶𝗻𝘂𝘁𝗲𝘀 𝘁𝗼 𝗼𝗻𝗹𝘆 𝟯𝟬 𝘀𝗲𝗰𝗼𝗻𝗱𝘀 and reduced the risk of revenue loss associated with manually detecting root causes to virtually 0 during high-stakes incidents by designing a highly-intuitive real-time hidden error detection tool and embedding it in Facebook’s data platforms ecosystem
Pioneered this approach and 𝗲𝗻𝗮𝗯𝗹𝗲𝗱 𝟭𝟬𝟬𝟬+ 𝗰𝗼𝗺𝗽𝗮𝗻𝗶𝗲𝘀 𝗶𝗻 𝘁𝗵𝗲 𝗶𝗻𝗱𝘂𝘀𝘁𝗿𝘆 𝘁𝗼 𝗹𝗲𝘃𝗲𝗿𝗮𝗴𝗲 𝗺𝘆 𝘄𝗼𝗿𝗸 by co-authoring a scientific paper, presenting it at ACM Sigmetrics conference, and being featured on Facebook’s engineering blog

Infrastructure as a Service (IaaS)Software ArchitecturePython (Programming Language)SQLMachine LearningNoSQL+15

Tech Lead, Production Engineer (Site Reliability Engineer / Cloud Engineer)

Sep 2017 – Jan 2021 · 3 yrs 4 mos

Scaling the self-healing infrastructure platform for the whole Facebook fleet
Facebook, the world’s biggest social media company, was incurring substantial operational costs and delays due to their self-healing infrastructure platform being unable to scale up on par with the growth from 4 to 18 data centers globally
𝐂𝐮𝐭 𝐞𝐧𝐠𝐢𝐧𝐞𝐞𝐫𝐢𝐧𝐠 𝐰𝐨𝐫𝐤𝐥𝐨𝐚𝐝 𝐛𝐲 𝟐𝟓𝟎+ 𝐡𝐨𝐮𝐫𝐬/𝐦𝐨𝐧𝐭𝐡 𝗮𝗰𝗿𝗼𝘀𝘀 𝘁𝗵𝗲 𝘄𝗵𝗼𝗹𝗲 𝗰𝗼𝗺𝗽𝗮𝗻𝘆 by building highly intuitive self-service tools to automate the most relevant support requests
𝗗𝗿𝗮𝘀𝘁𝗶𝗰𝗮𝗹𝗹𝘆 𝗿𝗲𝗱𝘂𝗰𝗲𝗱 𝘀𝘁𝗮𝗿𝘁𝗶𝗻𝗴 𝘁𝗶𝗺𝗲 𝗼𝗳 𝗮𝘂𝘁𝗼-𝗿𝗲𝗺𝗲𝗱𝗶𝗮𝘁𝗶𝗼𝗻𝘀 𝗯𝘆 𝟰𝘅, 𝗿𝗲𝘀𝘂𝗹𝘁𝗶𝗻𝗴 𝗶𝗻 𝗲𝘃𝗲𝗿𝘆 𝘀𝗲𝗿𝘃𝗲𝗿 𝗯𝗲𝗶𝗻𝗴 𝗯𝗮𝗰𝗸 𝘁𝗼 𝗮 𝗵𝗲𝗮𝗹𝘁𝗵𝘆 𝘀𝘁𝗮𝘁𝗲 𝗳𝗮𝘀𝘁𝗲𝗿 𝗮𝗻𝗱 𝗽𝗿𝗲𝘃𝗲𝗻𝘁𝗶𝗻𝗴 𝘀𝗶𝗴𝗻𝗶𝗳𝗶𝗰𝗮𝗻𝘁 𝗱𝗮𝗶𝗹𝘆 𝗿𝗲𝘃𝗲𝗻𝘂𝗲 𝗹𝗼𝘀𝘀, by leading a series of scalability, reliability, observability, machine learning, security and performance improvements efforts, such as building a fast-track to prioritize the most common auto-remediations around the fleet

Infrastructure as a Service (IaaS)Software ArchitecturePython (Programming Language)SQLMachine LearningNoSQL+15

1&1 internet, inc.

3 roles

Fullstack Software Architect (NodeJS)

May 2015 – Sep 2017 · 2 yrs 4 mos

Generating beautiful one-page websites for customers in less than 30 minutes
Ionos (1&1), international leading provider of cloud infrastructure, cloud services, and hosting based in Germany, was struggling to acquire new customers and keep the existing ones due to due to the long time it took the customers to publish their website online.
𝗗𝗲𝗰𝗿𝗲𝗮𝘀𝗲𝗱 𝘁𝗶𝗺𝗲 𝘁𝗼 𝗽𝘂𝗯𝗹𝗶𝘀𝗵 𝗮 𝗻𝗲𝘄 𝘄𝗲𝗯𝘀𝗶𝘁𝗲 𝗳𝗿𝗼𝗺 𝗵𝗼𝘂𝗿𝘀/𝗱𝗮𝘆𝘀/𝗻𝗲𝘃𝗲𝗿 𝘁𝗼 𝗹𝗲𝘀𝘀 𝘁𝗵𝗮𝗻 𝟯𝟬 𝗺𝗶𝗻𝘂𝘁𝗲𝘀 𝗳𝗿𝗼𝗺 𝗵𝗼𝘂𝗿𝘀/𝗱𝗮𝘆𝘀/𝗻𝗲𝘃𝗲𝗿 by building a simplified version of the website that the user can publish on the internet in less than 30 minutes.
𝗗𝗲𝗰𝗿𝗲𝗮𝘀𝗲𝗱 𝗰𝗵𝘂𝗿𝗻 𝗿𝗮𝘁𝗲 𝗯𝘆 𝗮𝗻 𝗲𝘀𝘁𝗶𝗺𝗮𝘁𝗲𝗱 𝟮𝟬% 𝗳𝗼𝗿 𝘁𝗵𝗲 𝗻𝗲𝘄 𝗰𝘂𝘀𝘁𝗼𝗺𝗲𝗿𝘀 that were set up through this flow by driving the onboarding process to include the simplified website instead of the standard one.

ReactInfrastructure as a Service (IaaS)Node.jsSoftware ArchitectureSQLNoSQL+15

Fullstack Software Architect (NodeJS)

May 2015 – Sep 2017 · 2 yrs 4 mos

Increase flexibility and variety of website designs for a Website builder SaaS
Ionos, a German multinational technology conglomerate in the hosting space, was struggling to keep existing customers to their online website builder due to offering only outdated looking final websites and templates
𝗜𝗻𝗰𝗿𝗲𝗮𝘀𝗲𝗱 𝗡𝗣𝗦 (𝗡𝗲𝘁 𝗣𝗿𝗼𝗺𝗼𝘁𝗲𝗿 𝗦𝗰𝗼𝗿𝗲) 𝗯𝘆 𝗮𝗻 𝗲𝘀𝘁𝗶𝗺𝗮𝘁𝗲𝗱 𝟮𝟬% for customers who created their websites using the self built modern-looking, industry-leading functionalities such as adaptive color themes, dynamic content and content recommendation

ReactInfrastructure as a Service (IaaS)Node.jsSoftware ArchitectureSQLNoSQL+15

Senior PHP/JavaScript developer

Mar 2011 – Dec 2014 · 3 yrs 9 mos

I am part of the team that develops "1&1 My Website", 1&1's online website builder, used by customers from various countries to manage their internet presence.
I am responsible for feature design and implementation, technical topic lead, package building and various application improvements.

Adobe

Senior JavaScript & NodeJS Software Engineer

Dec 2014 – May 2015 · 5 mos · Bucharest, Bucharest, Romania

Increasing the number of new users onboarded to Adobe Illustrator by improving the onboarding experience
Adobe, the global leader in digital media and digital marketing solutions was struggling with new customers dropping from their onboarding flow for Adobe Illustrator due to the significant time it took users from downloading Illustrator to starting it for the first time.
𝗜𝗻𝗰𝗿𝗲𝗮𝘀𝗲𝗱 𝘁𝗵𝗲 𝗔𝗱𝗼𝗯𝗲 𝗜𝗹𝗹𝘂𝘀𝘁𝗿𝗮𝘁𝗼𝗿 𝗼𝗽𝗲𝗻𝗶𝗻𝗴 𝗮𝗳𝘁𝗲𝗿 𝗱𝗼𝘄𝗻𝗹𝗼𝗮𝗱 𝗿𝗮𝘁𝗲 𝗯𝘆 𝗼𝘃𝗲𝗿 𝟭𝟬% by building an easy-to-use, browser-based version of Adobe Illustrator and connected it to the onboarding flow for new customers, so they can start using a simplified version of the product before the download was completed.

ReactInfrastructure as a Service (IaaS)Node.jsSQLNoSQLJavaScript+4

Open source

Author & Lead Developer

Nov 2014 – Dec 2014 · 1 mo · Bucharest Metropolitan Area

Enabling the whole industry to perform full-text search functionalities across 18 languages in JavaScript
I identified an opportunity to enable multi-language full-text search across product documentation, which was not natively supported in Lunr JS, and I developed a solution for it.
𝗘𝗻𝗮𝗯𝗹𝗲𝗱 𝗼𝘃𝗲𝗿 𝟭𝟬𝟬𝗸 𝘄𝗲𝗯𝘀𝗶𝘁𝗲𝘀 𝗮𝗻𝗱 𝘄𝗲𝗯 𝗮𝗽𝗽𝘀 𝘁𝗼 𝗽𝗲𝗿𝗳𝗼𝗿𝗺 𝗳𝘂𝗹𝗹-𝘁𝗲𝘅𝘁 𝘀𝗲𝗮𝗿𝗰𝗵𝗲𝘀 𝗮𝗰𝗿𝗼𝘀𝘀 𝟭𝟴 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲𝘀 in JavaScript by building Lunr-Languages and building a tech community around it.
🛠Skills: JavaScript, Language processing

Node.jsJavaScriptOpen Source Development

Hippotomate - supercharge your automated testing development & debugging

Author of Open Source App

Oct 2014 – Nov 2014 · 1 mo

Creator of Hippotomate (http://mihaivalentin.com/hippotomate/), an app that helps QA automation engineers develop and debug their tests.
Hippotomate will help you analyze errors, view screenshots, run tests step by step and see what's going on.
I started Hippotomate as a side project, and went from idea, concept, design, architecture, coding to preparing an MVP and promoting it to get early user feedback.
Check out how Hippotomate can help you: https://www.youtube.com/watch?v=zuIJma9LaUE

Image pro wordpress plugin

Author of Open Source WordPress Plugin

Jul 2011 – Jul 2014 · 3 yrs

Image Pro WordPress plugin disrupts the image management and editing in Wordpress. Now you can easily add, edit, resize, remove images. It fully replaces WordPress image upload system with a completely new one, which is faster and more fun to use. Downloaded by over 66000 WordPress users.
http://www.mihaivalentin.com/image-pro-wordpress-image-management/

Rcs & rds

Web developer

Nov 2009 – Mar 2011 · 1 yr 4 mos

When I joined the company, they had their online presence scattered among many service-specific websites (tv, internet, telephony). Together with the Marketing department, we set up a plan to merge this information into a single website that now represents the company as a whole:
www.rcs-rds.ro