Andy Howden

Director of Engineering

Berlin, Berlin, Germany11 yrs 9 mos experience

Most Likely To SwitchHighly Stable

Key Highlights

Led teams to enhance reliability across large organizations.
Developed self-service products for thousands of developers.
Expert in incident response and operational readiness.

Stackforce AI infers this person is a Senior Engineering Leader in E-commerce with a focus on reliability and operational excellence.

Contact

Skills

Core Skills

ArchitectureIncident ResponseProject Management

Other Skills

Oral CommunicationCommunicationVideo ProductionKubernetesAnsiblePrometheusWeb DevelopmentWeb ServicesVirtualizationSoftware DevelopmentJavaScriptLinuxMySQLCSSHTML

About

Staff engineer or engineering manager with a history of delivering critical projects that span across hundreds of engineers, collaborating to develop products to improve reliability as efficiently as possible across a large organization, working directly with teams to address a given critical reliability challenge or working with senior stakeholders to understand and set organizational tradeoffs around reliability, on-call health, velocity and cost.

Experience

11 yrs 9 mos

Total Experience

1 yr 8 mos

Average Tenure

2 yrs 2 mos

Current Experience

Delivery hero

2 roles

Engineering Director (Acting)

Promoted

Mar 2025 – Present · 1 yr 2 mos · Hybrid

Welp I landed in Management again.
Leading a team of 30 people (approximately) broken down into the "Leadership" team (with Staff+ engineers & management), Incident Detection & Response, Observability, Resilience Engineering and two Embedded SRE teams.
Focused on building an organization that enables thousands of developers at Delivery Hero to improve their reliability through self service products, enablement or collaboration with other leaders in the organization. Doing that by providing an amount of structure to the product & engineering work that is delivered, but building fairly autonomous teams that have a deep understanding of the customer challenges, the problem domain and customer experience and finding ways to improve that experience within their scope.

ArchitectureIncident ResponseProject ManagementOral Communication

Principal Engineer

Feb 2024 – Feb 2025 · 1 yr · Hybrid

Joining the Site Reliability Engineering team, helping reduce the impact of incidents while supporting developers across the Delivery Hero group.

Career break

Personal goal pursuit

Aug 2023 – Jan 2024 · 5 mos · Berlin, Berlin

For the next few months, I will be dedicated to creating a course to help mid, senior or principal level engineers grow their understanding of production systems, their ability to deploy and manage software, understand the social, financial, business or health costs of their operations and work to improve reliability.
You can learn more at the URL h4n.link/course

Zalando

3 roles

Engineering Manager

Dec 2021 – Aug 2023 · 1 yr 8 mos

Within my role as an engineering manager, I helped design, hire for and lead an experimental "Embedded SRE" team. This team was tasked with improving the reliability of the "Transactional Experience", a business section comprising the user journey between the "add to cart" and when the order hit the warehouse. This crossed 4 business units, 10s of teams, over 100 engineers, product colleagues, stakeholders, etc. Drafted multiple strategies, reviewed yearly and adjusted based on the organisation's challenges and the general economic environment. Delivered multiple projects around limiting the impact of "sneaker-bots"⁴, and improvements to reliability through service level objectives on critical business operations² and the modelling of Risk as a "risk register" and Golang-based lifecycle tooling.
Within my parallel role, helping with the "Technical and Operational Readiness" (CyberWeek) preparation project helped enable a colleague to lead the "Operational Readiness" or technical risk management workstream. Toward the end of this program wrote guidance designed to prepare engineers for being on-call during this period (as well as more generally; see public version¹) and ran the "situation room" as the leader on duty.
Within my role as incident, commander responded to multiple major production issues, leading the response or functioning as a technical expert in cloud systems, Linux or performance different runtimes (e.g. NodeJS, Java). Gave multiple talks on how SRE is implemented at Zalando at SLOConf², Swisscom, SRECon (with Salome Santos³)
Most critically, I helped with the personal development of several colleagues to build the "consulting capability" of SRE or placed them where they could be more effective as code contributors.
1. https://www.andrewhowden.com/p/help-im-now-on-call
2. https://www.youtube.com/watch?v=diUOjC109Mw
3. https://www.usenix.org/conference/srecon22emea/presentation/howden
4. https://github.com/zalando/skipper/issues/2004

ArchitectureIncident ResponseProject ManagementOral Communication

Principal Engineer

Promoted

Aug 2021 – May 2022 · 9 mos

Within my role as principal engineer drove the "Operational Readiness" workstream for the Technical and Operational Readiness ("TOER") project. This included helping engineers from across the organization review over 1000 systems through a self-assessment questionnaire, and using that to guide what should be improved. It required using domain experts, "production readiness reviews", incident reviews and load testing to gather over 400 "risks" in a central place where they can be triaged and ranked, and then working with a domain "coordinator" as well as their leadership to ensure priority was given to address these risks before CyberWeek.
Established mechanisms to scale intervention through domain coordinators by creating a common communication mechanism (email + weekly meeting for discussion) which set clear expectations about where the project needed to be and what coordinators should focus on next, as well as worked with outlier organizations to help them back on track and be sufficiently prepared for the week.
It is a project with a hard deadline, massive scope and a range of unusual challenges.

ArchitectureCommunicationIncident ResponseProject ManagementOral Communication

SRE

Oct 2019 – Aug 2021 · 1 yr 10 mos

Collaborated with a team to improve the reliability of a service that segmented customers based on various factors, heavily used in the critical path. Produced multiple guides on a range of topics such as an on-call video (viewed 540 times internally), guidance on how to review postmortem as well as corresponding enablement for the principal engineering community, a video on "observability" (viewed 1300+ times internally) and worked with colleagues to produce a second on distributed tracing (viewed 800+ times internally). Wrote guidance on how to assess a given production system for risks, as well as how to structure an intervention designed to improve that system.

ArchitectureCommunicationVideo ProductionIncident ResponseOral Communication

Global fashion group

Senior Software Engineer

Apr 2019 – Sep 2019 · 5 mos · Berlin Area, Germany

Sitewards gmbh

Software Engineer

Oct 2016 – Apr 2019 · 2 yrs 6 mos · Frankfurt Am Main Area, Germany

Sitewards is a provider of primarily PHP eCommerce application services for German middle to large enterprises.
As part of my employment there I was responsible for the design, development and release of several eCommerce projects. These projects varied depending on the customer requirements, but included:
The migration of a store to AWS, subsequent redesign of its frontend components implementation of several unique business requirements.
The design and maintenance of several data processing services (imports)
The ongoing maintenance of fairly highly trafficked agricultural services
The design of monitoring and introspection infrastructure across a fleet of ~30 machines on various clients
The design and implementation of CI/CD infrastructure
The design of infrastructure as code definitions in Ansible, Kubernetes

Fontis

PHP Developer

Sep 2014 – Oct 2016 · 2 yrs 1 mo

Fontis is a provider of Magento services to larger, multi-million dollar eCommerce services in Australia.
My role was the development, implementation and maintenance of all kinds of both user interface and business logic features. Later in my stay I began to be responsible for some parts of infrastructure management, and helped the company do its initial investigation into Kubernetes.
My achievements include:
Primarily responsible for several major software migrations, site redesigns and feature developments such as the GAZ MAN redesign, Bing Lee checkout redesign and MageAudit.
Developed a structure for deploying applications onto Kubernetes continuously, including numerous safety checks.
Developed a structure for automatically checking the style and quality of code as part of the standard review process.
Developed provisioning tools that create and manage a Kubernetes cluster that tolerates failure of an application, machine or network partition.
Developed a template for a JavaScript module that can be loaded by dependency management systems asynchronously.
Contributed to open source projects such as Ecomdev_CheckItOut, Boilr and Modd.

Incident Response