Beyond DevOps - How Netflix Bridges the Gap

Engineering

josh-evans
of 70
Description
Text
PowerPoint Presentation Josh Evans - Director of Operations Engineering November 16, 2015 Beyond DevOps: How Netflix Bridges the Gap Technical Debt Java 6 Perforce Single Master Jenkins Ant CentOS Asgard/Mimir Fall 2013 Java 6 – needed to move forward on Java but struggled to drive adoption Perforce – many teams moving to Git – no story for supporting perforce in the cloud Jenkins – long queues & build times Ant – long build times, inefficient dependency management CentOS – slow delivery of new kernel and userland binaries Asgard served us well as a deployment & cloud management Mimir gave a great prototype and we learned a lot Tech debt kept us from doing our jobs well How do we drive broad-based change? Does this sound familiar? Have any of you been on one side or the other of this situation? The Paved Road Java 7 Stash Jenkins Shards Gradle Ubuntu To move forward we defined the concept of the paved road The paved road promises a well supported integrated developer experience. Java 7 – just to move forward – Java 8 already on the horizon Git – organically adopted by many teams Gradle – built time reduced due to efficient dependency management Ubuntu – more frequent, well vetted userland binarie & kernels Jenkins shards to fix long build times Started building our next generation cloud console & continuous delivery platform Spinnaker We staffed up and went for it – big bang Some said You’re overloading us Too many projects Poor targeting Others said What took you so long? We’ve moved on Now we need to migrate That’s great but… We’re paying a high tax Expectations gap Division of labor Timing of solutions Leadership Affects Reputation Relationships Lost opportunities Organizational Debt How do we bridge the gap? “Remember that TIME is money…” Read to the audience: He that can earn ten shillings a day by his labour, and goes abroad, or sits idle one half of that day, tho' he spends but sixpence during his diversion or idleness, ought not to reckon that the only expense; he has really spent or rather thrown away five shillings besides. - Advice to a Young Tradesman Time is a form of currency Please raise you hand if you know which puritanical workaholic wrote this? In addition to the obvious intent behind this there is a more profound message. Time spent working is related to the money you make but time is also in and of itself a form of currency. It’s the exchange or giving of time that drives the economics of an engineering organization Product Engineering Operations Engineering Challenges & Strategies Our time today… Product Engineering Operations Engineering Challenges & Strategies Our time today… Product Innovation winning moments of truth Every facet of the product 1400 AB tests in the last year & accelerating Continuous Innovation But wait, there’s more… Build It design code build bake test deploy Run It configure monitor triage fix …at scale, globally You build it, you run it Netflix has a freedom & responsibility culture. You build it you run it perfectly aligns with our values around autonomy & ownership Internet 1000s of starts per second 100,000s of requests per second 100,000,000 hours of content / day 3 AWS Regions, 3 AZs per region Relentless product innovation Building & running micro-services at scale, globally This leads a high pressure situation created a shortage of time. Product Engineering Operations Engineering Challenges & Strategies Our time today… DevOps is a software development method that emphasizes the roles of both software developers and other information-technology (IT) professionals with an emphasis on IT Operations. - Wikipedia The Gap Read definition out loud Out of curiosity – who agrees with this definition? Who disagrees? Not only is there disagreement but the general construct isn’t really that helpful Why? How? It doesn’t address how to bridge the gap or why it matters to do so? What’s are the strategies for success? It’s the practices, tools, culture Motivations the reason for doing DevOps is to achieve operational excellence Quality Velocity Operational Excellence Operational Excellence is the continuous improvement of the management, design, and function of operational environments to achieve greater quality, velocity, and competitive advantage. Engineering Tools Insight & Real-time Analytics Performance & Reliability Operations Engineering is the application of software engineering practices to achieve and sustain operational excellence. We do the undifferentiated heavy lifting for out customers. This means we take on the operationally oriented common engineering work across teams so that each team can focus on their core charter. Operations Engineering Service provider Operational excellence driver Cross-cutting solutions Undifferentiated heavy lifting We do the undifferentiated heavy lifting for out customers. This means we take on the operationally oriented common engineering work across teams so that each team can focus on their core charter. Product Engineering Operations Engineering Challenges & Strategies Our time today… You’re overloading us What took you so long? Remember that feedback? We made assumptions Requirements – what & when Time for non-product work Move from assumptions to knowledge Affect change without imposing a tax? Achieve and sustain operational excellence? How do we… Time is a form of currency Going back to our Ben Franklin quote – time is a form of currency. In our engineering world time really is currency. We don’t pay each other to do work. We commit time to projects. In other words we have a time-based economy. 5 strategies for success in time-based economies software & organizational engineering Audience – can anyone name one of the strategies? 1. Reach out What are your biggest operational pain points? How can we help? How well are we meeting your needs today? What would you like to see from us in the future? Listen Shower, rinse, repeat Talk to your engineering customers Grease the Squeaky Wheels low tolerance for tax more vocal than most Stop spamming us! High impact solutions Clarity on deliverables Lower operational tax Leadership, innovation, and partnership What they wanted Deliver on solutions Better road map definition & communication A more aggressive stance on automation Deeper investment into leadership, innovation, planning Our commitments 2. Make an impact Apply what you’ve learned Deliver what matters global cloud console end to end delivery automation platform velocity with confidence Pipelines - Automated Global Delivery 3. Make it easy to do the right thing Audience – can anyone name one of the strategies? A free chaos monkey for good ones Engineering time is scarce We must do more heavy lifting Supply & Demand Spinnaker manual step Automated migrations – Mimir Provide on-ramps Automate proven practices Alerting and Monitoring Apache & Tomcat Hardening Automated Canary Analysis Autoscaling Chaos Participation Consistent Naming ELB Configuration Healthcheck Configured Red-Black Pipeline Squeeze Testing Timeout & Fallback Tuning Workload Reliability Production Ready? Alerting and Monitoring Apache & Tomcat Hardening Automated Canary Analysis Autoscaling Chaos Participation Consistent Naming ELB Configuration Healthcheck Configured Red-Black Pipeline Squeeze Testing Timeout & Fallback Tuning Workload Reliability Production Ready? Old Version (v1.0) New Version (v1.1) Load Balancer Customers 100 Servers 5 Servers 95% 5% Metrics Canaries Old Version (v1.0) New Version (v1.1) Load Balancer Customers 0 Servers 100 Servers 100% Metrics Canaries Define Metrics A threshold Every n minutes Classify metrics Compute score Make a decision Automated Canary Analysis Canary Analysis Performance Integration Tests Chaos Conformity Static Unit Tests Make it easy to do the right thing Static & Functional Testing 4. Reduce the cost of change \ Ongoing migrations Library propagation 100s of micro-services Complex dependencies Continuous, Broad-based Change There are several approaches that you might take to solve for this problem. I’ll explore each one. Change Engineering Locate Communicate Facilitate Automated forensics Who last touched x? What team? Who was their manager? Who owns this artifact, repository, service? Whitepages Workday wrapper App & REST API Organization hierarchy Metadata Change log (###) ###-#### Krieger REST-based service Sources Whitepages Stash Edda Jenkins Spinnaker Etc… { "content": {}, "_links": { "employees": { "href": "/api/employees/" }, "projects": { "href": "/api/projects/" }, "teams": { "href": "/api/teams/" }, "applications": { "href": "/api/applications/" }, "jobs": { "href": "/api/build/jobs" }, "masters": { "href": "/api/build/masters" }, "projectDistribution": { "href": "/api/teams/projectDistribution" } } } /api/employees?q=jevans "employees": [ { "id": "241", "firstName": "Josh", "lastName": "Evans", "username": "jevans", "email": "jevans@netflix.com", "jobTitle": "Director of Operations Engineering", "isManager": true, "isCurrent": true, "title": "Josh Evans (jevans) - Operations Engineering", "_links": { "self": { "href": "/api/employees/241" }, "manager": { "href": "/api/employees/117890" }, "team": { "href": "/api/teams/f9134a81" }, "projects": { "href": "/api/teams/f9134a81/projects" } } } ] } Security vulnerabilities Who owns this service? Platform updates Who is using this version of this library? Today – Targeted Coordination Automated, efficient technical project management Communication Guidance Tracking Low tax for TPMs & engineers Security Fix Java 9 Guava Future – Change Campaigns 5. Develop Partnerships Beyond supply & demand And once you’ve proven that you can deliver you have some money in the bank. You have earned a seat at the table. Now you’re ready to build strong partnerships. Nearing completion Aggressive schedule Unexpected delays Commitment to June delivery Spinnaker 1.0 – 1H 2015 Built their own continuous delivery solution Not positioned for engineering-wide support Believes common solutions Edge Engineering Partnership in Action Strong relationship Open discussions about concerns Decision - leaned forward +2 engineers on Spinnaker Successful 1.0 launch Moving Forward Together Containers? Achieving alignment Collaborative exploration Edge, Platform, Operations A new paved road? Paved Road adopted Adding new ones Production Ready ongoing Migrations easier Reputation improving Improved Service uptime Rate of change Payoffs Putting it to the test in 2016 Streaming production & test - EC2 Classic to VPC Highly cross-functional Complex dependencies Zero downtime Stay tuned… Five Strategies Reach out Make an impact Make it easy to do the right thing Reduce the cost of change Develop partnerships Open Sourced! https://netflix.github.io/ Josh Evans jevans@netflix.com @ops_engineering Questions?
Comments
Top