Successful Devops Implementation for Small Teams - a True Story

I have published slides from my talk about implementation of DevOps practices in Cosmose.

This speech outlines some lessons learned.

Transcript:

SUCCESSFUL DEVOPS IMPLEMENTATION FOR SMALL TEAMS: A TRUE STORY

What is DevOps?

Definition

“A set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality”

Bass, Len; Weber, Ingo; Zhu, Liming. DevOps: A Software Architect’s Perspective. ISBN 978-0134049847.

It’s not a new thing already established in the industry - tons of job offerings confirm that

  • automation, automation, automation

  • containers eeeeverywheeereee

  • The Cloud (i.e. someone else’s computer)

Simple test: “are you DevOps level over 9000?”

  • your answer for “how many servers do you have?” is “I have to check..”

  • you do multiple production deployments each day

  • your dev team can create new (micro)service along with all supporting components without any ticket for ops team

  • you can terminate any random instance in your infrastructure and the environment will self-heal

  • .. but let’s not even start with security related topics

How not to do “DevOps” - DON’TS

  • Post a job description for “DevOps Engineer” and hire a few

  • Put them on an “on-call”

  • Push away developers from directly interacting with the environment

Effect?

Apart from low velocity and quality you will get these:

  • “Hey, can you send me logs from my service?”

  • “Heey, can you purge Redis for me on staging?”

  • “Heeey, I clicked deploy on Jenkins and it’s stuck, HALP”

  • “Heeeeeeeeeeeeeey….”

Let’s try another approach - DO’S

  • Do enable developers

  • Streamline deployment process

  • Streamline infrastructure management

  • Guide, advise, discuss

  • Hide complexity, but not too much

  • Treat yourself as a service provider - deliver products not tickets

BUT HOW ?!

It’s ok to hire devops engineer - Brings experience and specialized focus

  • Communication skills are super important here

  • Tech requirements: good *nix skills, good google skills and sixth sense for sniffing bad practices

  • Probably the first person to handle Security in your new startup

Starting point

The Bad
  • Production environment: two servers, dozen microservices

  • Everything spinned up manually through AWS Console

  • Deployment meant ssh’ing to a server, downloading new docker image, stop+start (incurring downtime)

  • Monitoring? Just cloudwatch logs

The Good
  • Spring Boot + Spring Cloud (Netflix)

  • Dockerized, built on Jenkins

  • Configured via environment variables

  • Stateless

  • Use of AWS

  • Use of managed services

  • Most important thing: competent development team, eager to innovate 🚀 ✅

Kubernetes - Fixing error prone deployments

  • batteries-included approach

  • documentation ○ courses, FAQs, examples

  • popular

  • reasonably sane ○ apart from Milicores concepts and a few others ;-)

  • Lots of progress in the past ~2 years ○ stable ○ reliable ○ lots of know-how ○ lots of lessons learned ○ powerful CLI

Kubernetes cont’d

  • Helm

  • Spinnaker

  • Jenkins integrations

  • Operators for complex deployments

  • Monitoring stack

  • Cloud offerings (GKE, EKS, Azure) tons of tools on top of it Tons of tools on top of it

Still, not a silver bullet

YAAAML 😱

  • 200-400 lines of YAML to describe a service..

  • Secrets management..

  • Even with Helm, deployment is a complex command

  • Tains, tolerations, affinity, heap vs total memory, exposing ports, scraping metrics .. and keep it all consistent across multitude of services

  • Tooling versioning

Re: hide complexity, but not too much

  • Jenkins deployment job is nice and all, up until it stops working

  • How can you expect proficiency with Kubernetes / kubectl if all developers ever do is push a Run button?

  • Enable them by making it easy to use CLI tools ○ Prepare Helm, helm-secrets, helm-diff, all along with binaries, configs and ./setup.sh script for easy installation ○ Create one template for all services, supporting most common configuration ○ Add yet another abstraction layer for most common tasks

Demo: qp ~200 lines of BASH script as an abstraction layer on top of Helm

afterbefore

DevOps == Collaboration

  • Example: monitor performance of all microservices ○ Example stack: Prometheus via Prometheus Operator ○ Add Service Monitor objects to each deployment

  • New application<->platform contract emerged: just expose prometheus metrics on port N and you will see your service graphs on Grafana ○ Developers responsible for adjusting their services to obey the new contract, make domain specific dashboards

  • Good tools helped here: Kubernetes made it easy to deploy the stack, Spring framework made it easy to expose metrics

Infrastructure as Code - Terraform + Atlantis

  • Git-versioned infrastructure

  • Migrate/Move or import existing resources

  • Setup Atlantis for audited and peer-reviewed infrastructure changes

  • Use the same tools to detect state drift (changes that were made outside of atlantis flow)

  • Optionally remove user permissions so that changes must go through Pull Requests

Terraform Declarative infrastructure management

  • Define AWS resources ○ Readable syntax ○ Combine multiple resources into reusable module

  • Plan ○ Compare definition with current state ○ Display detailed changeset

  • Apply ○ Make changes to infrastructure ○ Record state

  • Team-workflow supported ○ State in AWS S3 ○ Locks in DynamoDB

Atlantis Pull Requests for infrastructure

  1. GitHub hook on each Pull Request to terraform repo
  2. Additional layer of locking so no other PR can touch the same parts of infrastructure
  3. Autoplan: show plan preview in PR comments
  4. Review & Approve Pull Request
  5. Apply changes
  6. Remove locks and merge

Demo?

If time permits ;-) If time won’t permit: shout out to my friend Szymon W. who made a nice blogpost about introducing terraform and atlantis across whole company: https://lab.getbase.com/terraform-base/

Does it work?

From my own experience

Cosmose
  • One “devops engineer”, seven contributors to terraform repo in a month, eleven now

  • 10+ production deployments per day

  • 3x more microservices since I joined (~6 months)

  • Infrastructure autoscaled 10x one time, when a dev wanted to “speed up his processing task” ;-)

Base / Zendesk Sell
  • Around 8 Ops and 42 (!) contributors to terraform repo

  • 30-50 deployments to prod daily

  • High level of ownership in dev teams, including expertise in running databases (e.g. ElasticSearch, MySQL), building their own infrastructure stacks (QA Kubernetes)

It works!

Thanks!

Jakub P. Głazik zytek@nuxi.pl github.com/zytek Questions are more than welcome

comments powered by Disqus