I have published slides from my talk about implementation of DevOps practices in Cosmose.
This speech outlines some lessons learned.
Transcript:
SUCCESSFUL DEVOPS IMPLEMENTATION FOR SMALL TEAMS: A TRUE STORY
To make error is human. To propagate error to all server in automatic way is #devops.
— DevOps Borat (@DEVOPS_BORAT) February 26, 2011
What is DevOps?
Definition
“A set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality”
— Bass, Len; Weber, Ingo; Zhu, Liming. DevOps: A Software Architect’s Perspective. ISBN 978-0134049847.
It’s not a new thing already established in the industry - tons of job offerings confirm that
-
automation, automation, automation
-
containers eeeeverywheeereee
-
The Cloud (i.e. someone else’s computer)
Simple test: “are you DevOps level over 9000?”
-
your answer for “how many servers do you have?” is “I have to check..”
-
you do multiple production deployments each day
-
your dev team can create new (micro)service along with all supporting components without any ticket for ops team
-
you can terminate any random instance in your infrastructure and the environment will self-heal
-
.. but let’s not even start with security related topics
How not to do “DevOps” - DON’TS
-
Post a job description for “DevOps Engineer” and hire a few
-
Put them on an “on-call”
-
Push away developers from directly interacting with the environment
Effect?
Apart from low velocity and quality you will get these:
-
“Hey, can you send me logs from my service?”
-
“Heey, can you purge Redis for me on staging?”
-
“Heeey, I clicked deploy on Jenkins and it’s stuck, HALP”
-
“Heeeeeeeeeeeeeey….”
Let’s try another approach - DO’S
-
Do enable developers
-
Streamline deployment process
-
Streamline infrastructure management
-
Guide, advise, discuss
-
Hide complexity, but not too much
-
Treat yourself as a service provider - deliver products not tickets
BUT HOW ?!
It’s ok to hire devops engineer - Brings experience and specialized focus
-
Communication skills are super important here
-
Tech requirements: good *nix skills, good google skills and sixth sense for sniffing bad practices
-
Probably the first person to handle Security in your new startup
Starting point
The Bad
-
Production environment: two servers, dozen microservices
-
Everything spinned up manually through AWS Console
-
Deployment meant ssh’ing to a server, downloading new docker image, stop+start (incurring downtime)
-
Monitoring? Just cloudwatch logs
The Good
-
Spring Boot + Spring Cloud (Netflix)
-
Dockerized, built on Jenkins
-
Configured via environment variables
-
Stateless
-
Use of AWS
-
Use of managed services
-
Most important thing: competent development team, eager to innovate 🚀 ✅
Kubernetes - Fixing error prone deployments
-
batteries-included approach
-
documentation ○ courses, FAQs, examples
-
popular
-
reasonably sane ○ apart from Milicores concepts and a few others ;-)
-
Lots of progress in the past ~2 years ○ stable ○ reliable ○ lots of know-how ○ lots of lessons learned ○ powerful CLI
Kubernetes cont’d
-
Helm
-
Spinnaker
-
Jenkins integrations
-
Operators for complex deployments
-
Monitoring stack
-
Cloud offerings (GKE, EKS, Azure) tons of tools on top of it Tons of tools on top of it
Still, not a silver bullet
YAAAML 😱
-
200-400 lines of YAML to describe a service..
-
Secrets management..
-
Even with Helm, deployment is a complex command
-
Tains, tolerations, affinity, heap vs total memory, exposing ports, scraping metrics .. and keep it all consistent across multitude of services
-
Tooling versioning
Re: hide complexity, but not too much
-
Jenkins deployment job is nice and all, up until it stops working
-
How can you expect proficiency with Kubernetes / kubectl if all developers ever do is push a Run button?
-
Enable them by making it easy to use CLI tools ○ Prepare Helm, helm-secrets, helm-diff, all along with binaries, configs and ./setup.sh script for easy installation ○ Create one template for all services, supporting most common configuration ○ Add yet another abstraction layer for most common tasks
Demo: qp ~200 lines of BASH script as an abstraction layer on top of Helm
afterbefore
DevOps == Collaboration
-
Example: monitor performance of all microservices ○ Example stack: Prometheus via Prometheus Operator ○ Add Service Monitor objects to each deployment
-
New application<->platform contract emerged: just expose prometheus metrics on port N and you will see your service graphs on Grafana ○ Developers responsible for adjusting their services to obey the new contract, make domain specific dashboards
-
Good tools helped here: Kubernetes made it easy to deploy the stack, Spring framework made it easy to expose metrics
Infrastructure as Code - Terraform + Atlantis
-
Git-versioned infrastructure
-
Migrate/Move or import existing resources
-
Setup Atlantis for audited and peer-reviewed infrastructure changes
-
Use the same tools to detect state drift (changes that were made outside of atlantis flow)
-
Optionally remove user permissions so that changes must go through Pull Requests
Terraform Declarative infrastructure management
-
Define AWS resources ○ Readable syntax ○ Combine multiple resources into reusable module
-
Plan ○ Compare definition with current state ○ Display detailed changeset
-
Apply ○ Make changes to infrastructure ○ Record state
-
Team-workflow supported ○ State in AWS S3 ○ Locks in DynamoDB
Atlantis Pull Requests for infrastructure
- GitHub hook on each Pull Request to terraform repo
- Additional layer of locking so no other PR can touch the same parts of infrastructure
- Autoplan: show plan preview in PR comments
- Review & Approve Pull Request
- Apply changes
- Remove locks and merge
Demo?
If time permits ;-) If time won’t permit: shout out to my friend Szymon W. who made a nice blogpost about introducing terraform and atlantis across whole company: https://lab.getbase.com/terraform-base/
Does it work?
From my own experience
Cosmose
-
One “devops engineer”, seven contributors to terraform repo in a month, eleven now
-
10+ production deployments per day
-
3x more microservices since I joined (~6 months)
-
Infrastructure autoscaled 10x one time, when a dev wanted to “speed up his processing task” ;-)
Base / Zendesk Sell
-
Around 8 Ops and 42 (!) contributors to terraform repo
-
30-50 deployments to prod daily
-
High level of ownership in dev teams, including expertise in running databases (e.g. ElasticSearch, MySQL), building their own infrastructure stacks (QA Kubernetes)
It works!
Thanks!
Jakub P. Głazik zytek@nuxi.pl github.com/zytek Questions are more than welcome