Successful Devops Implementation for Small Teams

I have published slides from my talk about implementation of DevOps practices in Cosmose.

This speech outlines some lessons learned.

Transcript:

SUCCESSFUL DEVOPS IMPLEMENTATION FOR SMALL TEAMS: A TRUE STORY

To make error is human. To propagate error to all server in automatic way is #devops.
— DevOps Borat (@DEVOPS_BORAT) February 26, 2011

What is DevOps?

Definition

“A set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality”

— Bass, Len; Weber, Ingo; Zhu, Liming. DevOps: A Software Architect’s Perspective. ISBN 978-0134049847.

It’s not a new thing already established in the industry - tons of job offerings confirm that

automation, automation, automation
containers eeeeverywheeereee
The Cloud (i.e. someone else’s computer)

Simple test: “are you DevOps level over 9000?”

your answer for “how many servers do you have?” is “I have to check..”
you do multiple production deployments each day
your dev team can create new (micro)service along with all supporting components without any ticket for ops team
you can terminate any random instance in your infrastructure and the environment will self-heal
.. but let’s not even start with security related topics

How not to do “DevOps” - DON’TS

Post a job description for “DevOps Engineer” and hire a few
Put them on an “on-call”
Push away developers from directly interacting with the environment

Effect?

Apart from low velocity and quality you will get these:

“Hey, can you send me logs from my service?”
“Heey, can you purge Redis for me on staging?”
“Heeey, I clicked deploy on Jenkins and it’s stuck, HALP”
“Heeeeeeeeeeeeeey….”

Let’s try another approach - DO’S

Do enable developers
Streamline deployment process
Streamline infrastructure management
Guide, advise, discuss
Hide complexity, but not too much
Treat yourself as a service provider - deliver products not tickets

BUT HOW ?!

It’s ok to hire devops engineer - Brings experience and specialized focus

Communication skills are super important here
Tech requirements: good *nix skills, good google skills and sixth sense for sniffing bad practices
Probably the first person to handle Security in your new startup

Starting point

The Bad

Production environment: two servers, dozen microservices
Everything spinned up manually through AWS Console
Deployment meant ssh’ing to a server, downloading new docker image, stop+start (incurring downtime)
Monitoring? Just cloudwatch logs

The Good

Spring Boot + Spring Cloud (Netflix)
Dockerized, built on Jenkins
Configured via environment variables
Stateless
Use of AWS
Use of managed services
Most important thing: competent development team, eager to innovate 🚀 ✅

Kubernetes - Fixing error prone deployments

batteries-included approach
documentation ○ courses, FAQs, examples
popular
reasonably sane ○ apart from Milicores concepts and a few others ;-)
Lots of progress in the past ~2 years ○ stable ○ reliable ○ lots of know-how ○ lots of lessons learned ○ powerful CLI

Kubernetes cont’d

Helm
Spinnaker
Jenkins integrations
Operators for complex deployments
Monitoring stack
Cloud offerings (GKE, EKS, Azure) tons of tools on top of it Tons of tools on top of it

Still, not a silver bullet

YAAAML 😱

200-400 lines of YAML to describe a service..
Secrets management..
Even with Helm, deployment is a complex command
Tains, tolerations, affinity, heap vs total memory, exposing ports, scraping metrics .. and keep it all consistent across multitude of services
Tooling versioning

Re: hide complexity, but not too much

Jenkins deployment job is nice and all, up until it stops working
How can you expect proficiency with Kubernetes / kubectl if all developers ever do is push a Run button?
Enable them by making it easy to use CLI tools ○ Prepare Helm, helm-secrets, helm-diff, all along with binaries, configs and ./setup.sh script for easy installation ○ Create one template for all services, supporting most common configuration ○ Add yet another abstraction layer for most common tasks

Demo: qp ~200 lines of BASH script as an abstraction layer on top of Helm

afterbefore

DevOps == Collaboration

Example: monitor performance of all microservices ○ Example stack: Prometheus via Prometheus Operator ○ Add Service Monitor objects to each deployment
New application<->platform contract emerged: just expose prometheus metrics on port N and you will see your service graphs on Grafana ○ Developers responsible for adjusting their services to obey the new contract, make domain specific dashboards
Good tools helped here: Kubernetes made it easy to deploy the stack, Spring framework made it easy to expose metrics

Infrastructure as Code - Terraform + Atlantis

Git-versioned infrastructure
Migrate/Move or import existing resources
Setup Atlantis for audited and peer-reviewed infrastructure changes
Use the same tools to detect state drift (changes that were made outside of atlantis flow)
Optionally remove user permissions so that changes must go through Pull Requests

Terraform Declarative infrastructure management

Define AWS resources ○ Readable syntax ○ Combine multiple resources into reusable module
Plan ○ Compare definition with current state ○ Display detailed changeset
Apply ○ Make changes to infrastructure ○ Record state
Team-workflow supported ○ State in AWS S3 ○ Locks in DynamoDB

Atlantis Pull Requests for infrastructure

GitHub hook on each Pull Request to terraform repo
Additional layer of locking so no other PR can touch the same parts of infrastructure
Autoplan: show plan preview in PR comments
Review & Approve Pull Request
Apply changes
Remove locks and merge

Demo?

If time permits ;-) If time won’t permit: shout out to my friend Szymon W. who made a nice blogpost about introducing terraform and atlantis across whole company: https://lab.getbase.com/terraform-base/

Does it work?

From my own experience

Cosmose

One “devops engineer”, seven contributors to terraform repo in a month, eleven now
10+ production deployments per day
3x more microservices since I joined (~6 months)
Infrastructure autoscaled 10x one time, when a dev wanted to “speed up his processing task” ;-)

Base / Zendesk Sell

Around 8 Ops and 42 (!) contributors to terraform repo
30-50 deployments to prod daily
High level of ownership in dev teams, including expertise in running databases (e.g. ElasticSearch, MySQL), building their own infrastructure stacks (QA Kubernetes)

It works!

Thanks!

Jakub P. Głazik zytek@nuxi.pl github.com/zytek Questions are more than welcome

@zytek

Successful Devops Implementation for Small Teams - a True Story