Andy Howden

Director of Engineering

Berlin, Berlin, Germany11 yrs 9 mos experience
Most Likely To SwitchHighly Stable

Key Highlights

  • Led teams to enhance reliability across large organizations.
  • Developed self-service products for thousands of developers.
  • Expert in incident response and operational readiness.
Stackforce AI infers this person is a Senior Engineering Leader in E-commerce with a focus on reliability and operational excellence.

Contact

Skills

Core Skills

ArchitectureIncident ResponseProject Management

Other Skills

Oral CommunicationCommunicationVideo ProductionKubernetesAnsiblePrometheusWeb DevelopmentWeb ServicesVirtualizationSoftware DevelopmentJavaScriptLinuxMySQLCSSHTML

About

Staff engineer or engineering manager with a history of delivering critical projects that span across hundreds of engineers, collaborating to develop products to improve reliability as efficiently as possible across a large organization, working directly with teams to address a given critical reliability challenge or working with senior stakeholders to understand and set organizational tradeoffs around reliability, on-call health, velocity and cost.

Experience

11 yrs 9 mos
Total Experience
1 yr 8 mos
Average Tenure
2 yrs 2 mos
Current Experience

Delivery hero

2 roles

Engineering Director (Acting)

Promoted

Mar 2025Present · 1 yr 2 mos · Hybrid

  • Welp I landed in Management again.
  • Leading a team of 30 people (approximately) broken down into the "Leadership" team (with Staff+ engineers & management), Incident Detection & Response, Observability, Resilience Engineering and two Embedded SRE teams.
  • Focused on building an organization that enables thousands of developers at Delivery Hero to improve their reliability through self service products, enablement or collaboration with other leaders in the organization. Doing that by providing an amount of structure to the product & engineering work that is delivered, but building fairly autonomous teams that have a deep understanding of the customer challenges, the problem domain and customer experience and finding ways to improve that experience within their scope.
ArchitectureIncident ResponseProject ManagementOral Communication

Principal Engineer

Feb 2024Feb 2025 · 1 yr · Hybrid

  • Joining the Site Reliability Engineering team, helping reduce the impact of incidents while supporting developers across the Delivery Hero group.

Career break

Personal goal pursuit

Aug 2023Jan 2024 · 5 mos · Berlin, Berlin

  • For the next few months, I will be dedicated to creating a course to help mid, senior or principal level engineers grow their understanding of production systems, their ability to deploy and manage software, understand the social, financial, business or health costs of their operations and work to improve reliability.
  • You can learn more at the URL h4n.link/course

Zalando

3 roles

Engineering Manager

Dec 2021Aug 2023 · 1 yr 8 mos

  • Within my role as an engineering manager, I helped design, hire for and lead an experimental "Embedded SRE" team. This team was tasked with improving the reliability of the "Transactional Experience", a business section comprising the user journey between the "add to cart" and when the order hit the warehouse. This crossed 4 business units, 10s of teams, over 100 engineers, product colleagues, stakeholders, etc. Drafted multiple strategies, reviewed yearly and adjusted based on the organisation's challenges and the general economic environment. Delivered multiple projects around limiting the impact of "sneaker-bots"⁴, and improvements to reliability through service level objectives on critical business operations² and the modelling of Risk as a "risk register" and Golang-based lifecycle tooling.
  • Within my parallel role, helping with the "Technical and Operational Readiness" (CyberWeek) preparation project helped enable a colleague to lead the "Operational Readiness" or technical risk management workstream. Toward the end of this program wrote guidance designed to prepare engineers for being on-call during this period (as well as more generally; see public version¹) and ran the "situation room" as the leader on duty.
  • Within my role as incident, commander responded to multiple major production issues, leading the response or functioning as a technical expert in cloud systems, Linux or performance different runtimes (e.g. NodeJS, Java). Gave multiple talks on how SRE is implemented at Zalando at SLOConf², Swisscom, SRECon (with Salome Santos³)
  • Most critically, I helped with the personal development of several colleagues to build the "consulting capability" of SRE or placed them where they could be more effective as code contributors.
  • 1. https://www.andrewhowden.com/p/help-im-now-on-call
  • 2. https://www.youtube.com/watch?v=diUOjC109Mw
  • 3. https://www.usenix.org/conference/srecon22emea/presentation/howden
  • 4. https://github.com/zalando/skipper/issues/2004
ArchitectureIncident ResponseProject ManagementOral Communication

Principal Engineer

Promoted

Aug 2021May 2022 · 9 mos

  • Within my role as principal engineer drove the "Operational Readiness" workstream for the Technical and Operational Readiness ("TOER") project. This included helping engineers from across the organization review over 1000 systems through a self-assessment questionnaire, and using that to guide what should be improved. It required using domain experts, "production readiness reviews", incident reviews and load testing to gather over 400 "risks" in a central place where they can be triaged and ranked, and then working with a domain "coordinator" as well as their leadership to ensure priority was given to address these risks before CyberWeek.
  • Established mechanisms to scale intervention through domain coordinators by creating a common communication mechanism (email + weekly meeting for discussion) which set clear expectations about where the project needed to be and what coordinators should focus on next, as well as worked with outlier organizations to help them back on track and be sufficiently prepared for the week.
  • It is a project with a hard deadline, massive scope and a range of unusual challenges.
ArchitectureCommunicationIncident ResponseProject ManagementOral Communication

SRE

Oct 2019Aug 2021 · 1 yr 10 mos

  • Collaborated with a team to improve the reliability of a service that segmented customers based on various factors, heavily used in the critical path. Produced multiple guides on a range of topics such as an on-call video (viewed 540 times internally), guidance on how to review postmortem as well as corresponding enablement for the principal engineering community, a video on "observability" (viewed 1300+ times internally) and worked with colleagues to produce a second on distributed tracing (viewed 800+ times internally). Wrote guidance on how to assess a given production system for risks, as well as how to structure an intervention designed to improve that system.
ArchitectureCommunicationVideo ProductionIncident ResponseOral Communication

Global fashion group

Senior Software Engineer

Apr 2019Sep 2019 · 5 mos · Berlin Area, Germany

Sitewards gmbh

Software Engineer

Oct 2016Apr 2019 · 2 yrs 6 mos · Frankfurt Am Main Area, Germany

  • Sitewards is a provider of primarily PHP eCommerce application services for German middle to large enterprises.
  • As part of my employment there I was responsible for the design, development and release of several eCommerce projects. These projects varied depending on the customer requirements, but included:
  • The migration of a store to AWS, subsequent redesign of its frontend components implementation of several unique business requirements.
  • The design and maintenance of several data processing services (imports)
  • The ongoing maintenance of fairly highly trafficked agricultural services
  • The design of monitoring and introspection infrastructure across a fleet of ~30 machines on various clients
  • The design and implementation of CI/CD infrastructure
  • The design of infrastructure as code definitions in Ansible, Kubernetes

Fontis

PHP Developer

Sep 2014Oct 2016 · 2 yrs 1 mo

  • Fontis is a provider of Magento services to larger, multi-million dollar eCommerce services in Australia.
  • My role was the development, implementation and maintenance of all kinds of both user interface and business logic features. Later in my stay I began to be responsible for some parts of infrastructure management, and helped the company do its initial investigation into Kubernetes.
  • My achievements include:
  • Primarily responsible for several major software migrations, site redesigns and feature developments such as the GAZ MAN redesign, Bing Lee checkout redesign and MageAudit.
  • Developed a structure for deploying applications onto Kubernetes continuously, including numerous safety checks.
  • Developed a structure for automatically checking the style and quality of code as part of the standard review process.
  • Developed provisioning tools that create and manage a Kubernetes cluster that tolerates failure of an application, machine or network partition.
  • Developed a template for a JavaScript module that can be loaded by dependency management systems asynchronously.
  • Contributed to open source projects such as Ecomdev_CheckItOut, Boilr and Modd.
Incident Response

Shop at pty ltd

Technical Consultant

May 2014Sep 2014 · 4 mos

  • Shop@ did not meet its employment contract obligations.

Education

Self taught

Jan 2008Jan 2024

Stackforce found 100+ more professionals with Architecture & Incident Response

Explore similar profiles based on matching skills and experience