Matt Davis

Co-Founder

Fullerton, California, United States25 yrs 8 mos experience

Key Highlights

  • Over 20 years of experience in Site Reliability Engineering.
  • Expert in incident management and observability platforms.
  • Co-founded a cross-functional Developer Experience group.
Stackforce AI infers this person is a SaaS Site Reliability Engineer with extensive experience in incident management and cloud infrastructure.

Contact

Skills

Core Skills

GolangDeveloper ExperienceIncident ManagementSite Reliability EngineeringChaos EngineeringBig Data

Other Skills

Audio MasteringMusic ProductionMasteringElectronic MusicMusic PerformanceAudio MixingPlatform EngineeringTest-Driven-DevelopmentProduction ReadinessGitHub GitOpsBackstageLocalStackCodefreshArgoCDDataDog

About

Staff Site Reliability Engineer with over 20 years of experience. Expert in incident management, observability platforms, data systems, and cloud infra.

Experience

25 yrs 8 mos
Total Experience
1 yr 8 mos
Average Tenure
--
Current Experience

Weedmaps

Senior Site Reliability Engineer

Feb 2024Jun 2025 · 1 yr 4 mos · Fullerton, California, United States · Remote

  • Skills: Golang, Developer Experience, Platform Engineering, Test-Driven-Development, Production Readiness, GitHub GitOps, Backstage, LocalStack, Codefresh, ArgoCD, DataDog, Elixir
  • Developer Platform: Co-founded cross-functional DevEx group Team Diesel (Developer Integrations, Experience, Safety, Environments, and Learning), building home-grown Internal Developer Platform (IDP) serving 100+ engineers across Ruby, Java, Node.js, React Native, and Elixir applications.
  • Continuous Verification Platform: Built and maintained Continuous Verification application in Golang that automatically scores services against testable production readiness requirements, reducing deployment risk and improving service reliability metrics.
  • Production Infrastructure: Supported and managed AWS Cloud resources running polyglot application stack, including automated deployment pipelines, monitoring, and alerting for high-traffic consumer application serving thousands of users.
  • Automation & Integration: Developed Slack bot for interfacing between Jira and Slack during incident response, automating workflow transitions and status updates to reduce time to resolution.
  • Data Engineering: Produced comprehensive 2024 Incident retrospective using custom data gathering tools and analysis pipelines, processing hundreds of incidents to identify patterns and improvement opportunities.
  • Platform Reliability: Designed four-dimensional improvement roadmap for Incident Management platform, supported by executive leadership and implemented using flexible, data-driven prioritization framework.
GolangDeveloper ExperiencePlatform EngineeringTest-Driven-DevelopmentProduction ReadinessGitHub GitOps+6

Form

Site Reliability Architect

Feb 2023Aug 2023 · 6 mos · Fullerton, California, United States · Remote

  • Skills: Platform Engineering, Program Management, OpenTelemetry, Machine Learning Operations, Lucidchart, Pingdom, SumoLogic, Jeli, Nobl9, PagerDuty
  • Blogs: Making Music with Others ::: LFI Blog Series (2023).
  • Talks: On-Call Reprised and Rejuvenated ::: Southern California Linux Expo 20x (2023), Human Observability of Incident Response ::: SRECon Americas (2023).
  • SRE Platform Development: Architected and built new Site Reliability Engineering practice and platform serving three product engineering divisions (GoSpotCheck, FieldConnector, ShelfWise), establishing observability, incident management, and reliability systems from the ground up.
  • Observability Engineering: Owned and optimized observability platforms including SumoLogic, Nobl9, and Pingdom, achieving 20% cost reduction through pipeline refactoring and usage optimization while improving signal quality.
  • ML Platform Integration: Consulted with development teams on implementing OpenTelemetry across Machine Learning pipelines for image recognition applications, enabling distributed tracing and end-to-end performance monitoring.
  • Incident Management Platform: Implemented comprehensive incident management platform using Jeli, transforming chaotic multi-channel incident response into structured, automated workflows with clear escalation paths and real-time collaboration.
  • Developer Tooling: Created high-level process automation and decision-making tools that reduced incident response time from hours to minutes.
Platform EngineeringProgram ManagementOpenTelemetryMachine Learning OperationsLucidchartPingdom+6

Blameless

Staff Engineer

Nov 2020Jan 2023 · 2 yrs 2 mos · Fullerton, California, United States · Remote

  • Talks: Groove with Ambiguity: the Robust, the Reliable, and the Resilient ::: DeveloperWeek (2021), Atlanta SRE Meetup (2021), LISA21 (2021). Human Observability of Incident Response ::: CMG ObservabilityCon (2022).
  • Blogs: SRE and the Practice of Practice ::: Blameless (2022), SRE and the Art of Improvisation ::: The New Stack (2022).
  • Platform Engineering & Automation: Implemented automation in Golang and TOML for gathering knowledge-base endpoints and data aggregation, creating CLI tools for rapid operational access across distributed systems.
  • Kubernetes Infrastructure: Engineered and rebuilt all GKE clusters with CNCF-compliant observability stack (Prometheus + FluentD + OpenTelemetry), supporting multi-tenant SaaS platform serving 500+ customers.
  • Observability Platform: Architected company's first organized observability approach, building instrumental logging dashboards that enabled rapid visualization of customer and service KPIs across microservices architecture.
  • Infrastructure as Code: Established GitOps patterns for deploying Linux hosts and secure user access in GCP using Terraform, Salt, and version control. Migrated entire Terraform codebase across multiple versions to v1 compatibility.
  • Developer Experience: Bridged SRE and Engineering teams as platform advocate, authored technical blog posts, led educational webinars, and created self-service tooling that reduced operational toil.
  • Incident Management Platform: Restructured the Incident Management program, adding continuous improvement, automated workflows, severity classification systems, and real-time collaboration procedures integrated with Blameless, Slack, and PagerDuty.
  • Reliability Engineering: Served as Incident Commander for critical production issues, designed cheat sheets and decision trees, implemented SLO/SLI frameworks across the platform.
Technical LeadershipSite Reliability EngineeringGolangIncident ManagementGoogle Cloud Platform (GCP)

Verica

Founding Senior Infrastructure Engineer

Dec 2018Jul 2020 · 1 yr 7 mos · Fullerton, California · Remote

  • Skills: Terraform, Packer, Vagrant, Kafka, Kubernetes, Platform Engineering, AWS, chaos engineering, resilience engineering, microservices architecture
  • Talks: Music in Resilience: The Practice of Practice: Southern California Linux Expo (2020), Sparklecon (2020), REdeploy (2019); QCon NYC Chaos Engineering Workshop (2019).
  • Platform Foundation: Designed and built complete IT and network infrastructure for chaos engineering platform, including AWS deployment supporting customers conducting resilience experiments.
  • Distributed Systems: Architected and implemented production Confluent Kafka cluster in AWS using Infrastructure as Code (Terraform), supporting real-time data streaming for chaos experiment telemetry and results aggregation.
  • Microservices Platform: Developed and operated Go microservices for testing Kubernetes modules, implementing container orchestration patterns and service mesh integration for resilience testing platform.
  • Observability Pipeline: Configured and maintained observability infrastructure using SumoLogic and Humio, building custom dashboards and alerting for platform health and customer experiment monitoring.
  • CI/CD Platform: Operated CircleCI pipelines for continuous integration and deployment, implementing automated testing and deployment workflows for platform components.
TerraformPackerVagrantKafkaKubernetesPlatform Engineering+5

Openx

Manager, Site Reliability Engineering

Jan 2013Mar 2018 · 5 yrs 2 mos · Pasadena, CA · Hybrid

  • Skills: Distributed Systems, Platform Engineering, Riak, Vertica, Hadoop, HBase, Erlang, Consul, Salt Stack, Mesos, Kubernetes, high-volume traffic systems, datacenter management
  • Blogs: Salted Riak, Making Logs Awesome with SumoLogic
  • Talks: Stepping up to Scale: RICON (2016), SREcon (2016); Measuring and Monitoring Riak Across the Globe: RICON (2015), SCaLE (2015).
  • Large-Scale Platform Engineering: Built and managed globally distributed data platform supporting 10+ billion daily ad requests, including 2000+ node Hadoop clusters with 30PB+ storage and real-time streaming infrastructure.
  • Distributed Database Systems: Architected, deployed, and maintained multiple Riak K/V clusters supporting global advertising user data, implementing eventually-consistent distributed systems patterns for high-availability and partition tolerance.
  • Platform Migration & Modernization: Led technical migration of all RDBMS systems from MySQL to MariaDB with Galera clustering, implementing multi-tenant high-availability platform serving critical business applications.
  • Streaming Data Platform: Operated large-scale Kafka clusters enabling real-time data workflows between Ad Quality stack components, processing 1000+ display ads and their content in near-real-time.
  • Chaos Engineering: Performed large-scale distributed systems chaos engineering experiments, implementing canary releases and network configurations to verify system resilience under failure conditions.
  • Team & Platform Leadership: Led globally distributed team of Data SREs, collaborated on hardware architecture decisions, and managed technical roadmap for platform evolution supporting exponential traffic growth.
Distributed SystemsPlatform EngineeringRiakVerticaHadoopHBase+9

Buzzmedia

Sr. Systems Administrator

Jun 2011Jan 2013 · 1 yr 7 mos · Hollywood, CA

  • Skills of note:​ CentOS, MySQL, RAID & Networked Storage, Puppet, Splunk, RPM/Yum, Enterprise IT, server hardware, datacenter management.
  • Managed 200+ CentOS servers in multiple locations operating websites under the Buzz Media publishing umbrella, running a combination of Wordpress, Apache, Nginx, Varnish, PHP, memcache, and MySQL. Supported infrastructure: Dell / SuperMicro / Penguin / IBM servers, Isilon networked storage, iPromise & Xyratex RAID arrays, Citrix Netscalers, Foundry & Force10, Nagios, Cacti, Yum Repos, Splunk. Managed datacenter installations and MySQL database operations including DRBD clustering, replication, backups, storage layout, benchmarking and performance testing. Selected and maintained hardware, VoIP systems, and power management in corporate headquarters IT server room plus two floors of offices with Cisco and Apple Server. Documented all aspects of the operation, cross-org collaboration/communication, and effective use of ticketing systems for issue tracking.
Organization SkillsLinuxTechnical LeadershipSite Reliability EngineeringReliabilityUnix+3

Cyberdefender

Sr. Systems Engineer

Feb 2010Jun 2011 · 1 yr 4 mos

  • Lead operations architect for 50+ CentOS Linux server farm: maintenance of highly available web clusters, MySQL installations including replication and backup, software and website release automation, management and design of Rackspace Cloud based resources, system-level networking and security, ensuring data integrity and availability to customers, installation and 24/7 on-call support, monitoring using both Nagios and BigBrother, system statistic gathering using Cricket and Munin.
Organization SkillsLinuxTechnical LeadershipTeam ManagementSite Reliability EngineeringTeam Leadership+5

Oversee.net

Systems Engineer

Jul 2009Jan 2010 · 6 mos

  • Provide systems, storage and network support for CentOS/RedHat Linux, NetApp storage, Netscaler load balancers, and Foundry switches.
LinuxSite Reliability EngineeringReliabilityUnixReliability EngineeringOutages+1

Rackable systems

Sr. Systems Engineer

Oct 2008May 2009 · 7 mos

  • Systems engineering support for sales with a focus in Media and Entertainment markets, providing expert consulting on the Rackable suite of x86-based hardware products (servers, containers and racks, power distribution, storage arrays, network integration, out-of-band management), and serving as the technical advocate for the customer with internal engineering groups.
LinuxSite Reliability EngineeringReliabilityUnixReliability EngineeringOutages+1

Aol

Principal System Administrator

Oct 2005Oct 2008 · 3 yrs

  • Lead operations architect (within AOL’s “Internet Media Operations / Personal Media” department), responsible for the operational integrity and continuous availability of the Xdrive File System and its related clients at xdrive.com and bluestring.com, including over 400TB of user storage and complex multi-tiered networks.
LinuxSite Reliability EngineeringReliabilityUnixReliability EngineeringOutages+1

Xdrive

Sr. System Administrator

Oct 2004Sep 2005 · 11 mos

  • Solaris, Linux and Network Appliance systems and network administration for an online storage site.
Organization SkillsLinuxTechnical LeadershipTeam ManagementSite Reliability EngineeringTeam Leadership+5

Tower records

Clerk II

Jan 2003Jan 2004 · 1 yr

  • Retail responsibilities in addition to experimental music specialist and dance music buyer for Tower in Brea, CA.
Organization SkillsLinuxSite Reliability EngineeringReliabilityUnixReliability Engineering+2

About.com

Technical Project Manager

Jan 2001Jan 2001 · 0 mo

  • Technical project management and due-diligence for newly acquired Primedia properties being integrated into About.com hosting facilities, with projects spanning multiple publication and business units.
Technical LeadershipTeam LeadershipReliabilityOutages

Global center

Director of Professional Services, East

Jan 1998Jan 2001 · 3 yrs

  • Technical Director of system administrators and project managers for Virginia, NYC and Chicago areas.
Organization SkillsLinuxTeam LeadershipReliabilityUnixReliability Engineering+1

Digex

Sr. Unix Systems Administrator

Jan 1997Jan 1998 · 1 yr

  • Solaris administration, top tier technical support and systems design.
Organization SkillsLinuxTechnical LeadershipTeam ManagementTeam LeadershipReliability+4

Husky labs

Graphic Designer

Jan 1995Jan 1997 · 2 yrs

  • Graphic design, web page design, java programming, unix hardware installation support.
Organization SkillsLinuxSite Reliability EngineeringReliabilityUnixReliability Engineering+2

Craque

Performing Musician

Jan 1990Present · 36 yrs 4 mos · Los Angeles Metropolitan Area

  • LA area artist working in freeform electro-acoustic sound sculptures which take on aspects of jazz improvisation and classical music, sometimes including unique experimental dance grooves.
  • Released on Labels: Inpuj, Audiobulb, Deep Listening, Kahvi Collective, Kikapu, Metatron Press, Modsquare, Stadtgruen, Test Tube, Xynthetic. Self-released material through Bandcamp.
  • Professional DJ: Ran weekly and monthly music events: "Synesthesia" at Big Wig and other locations (Chicago) [2000-01], Continental Room (Fullerton, CA) [2003-04], "Loungeometry" at Kettle and the Keg (Fullerton) [2005-08], Commonwealth Lounge (Fullerton), "Drunken Quill Society" at Matador Cantina (Fullerton) [2015-2017], "The Collective" at Front Street Bar and Grill (Fullerton) [2018], "The Glass Show @ Bootleggers Brewery" (Fullerton) [2015-2019].
  • http://craque.bandcamp.com
Audio MasteringMusic ProductionMasteringElectronic MusicMusic PerformanceAudio Mixing

Education

University of Maryland

n/a — Opera Performance

Jan 1996Jan 1998

Virginia Tech

BA — Music Composition

Jan 1990Jan 1995

Stackforce found 100+ more professionals with Golang & Developer Experience

Explore similar profiles based on matching skills and experience