Thursday, January 30, 2014

Small Projects and Big Programs

The Standish Group’s CHAOS 2013 Report has some interesting things to say about what is driving software development project success today. More projects are succeeding (39% in 2012, up from 29% in 2004), mostly because projects are getting smaller (which also means more projects done using Agile, since small projects are the sweet spot for Agile):

“Very few large projects perform well to the project management triple constraints of cost, time, and scope. In contrast to small projects, which have more than a 70% chance of success, a large project has virtually no chance of coming in on time, on budget, and within scope… Large projects have twice the chance of being late, over budget, and missing critical features…. A large project is more than 10 times more likely to fail outright, meaning it will be cancelled or will not be used because it outlived its useful life prior to implementation.”

So don’t run large projects. Of course it’s not that simple: many problems, especially in enterprises, are much too big to be solved by small teams in small projects. Standish Group says that you can get around this if you “Think Big, Act Small”:
“…there is no need for large projects… any IT project can be broken into a series of small projects that could also be done in parallel if necessary.”

Anything that can be done in one big project can obviously be done in a bunch of smaller projects. You can make project management simple – by pushing the hard management problems and risks up to the program level.

Program Management isn't Project Management

Understanding and orchestrating work across multiple projects isn’t as simple as breaking a big project down into small projects and running them independently. Managing programs, managing horizontally across projects, is different than managing projects. There are different risks, different problems to be solved. It requires different skills and strengths, and a different approach.

PMI recognizes this, and has a separate certification for Program Managers (PgMP). Program management is more strategic than project management. Program Managers are not just concerned with horizontal and cross-project issues, coordinating and managing interdependencies between projects – managing at scale. They are also responsible for understanding and helping to achieve business goals, for managing organizational risks and political risks, and they have to take care of financing and contracts and governance: things that Agile coaches running small projects don’t have to worry much about (and that Agile methods don’t help with).

Agile and Program Management

The lightweight tools and practices that you use to successfully coach an Agile team won’t scale to managing a program. Program management needs all of those things that traditional, plan-driven project management is built on. More upfront planning to build a top-down roadmap for all of the teams to share: project teams can’t be completely free to prioritize work and come up with new ideas on the fly, because they have to coordinate handoffs and dependencies. Architecture and technology strategy. More reporting. Working with the PMO. More management roles and more management. More people to manage. Which = more politics.

Johanna Rothman talks a little bit about program management in her book Managing Your Project Portfolio, and has put up a series of blog posts on Agile and Lean Program Management as work in progress for another book she is writing on program management and Agile.

Rothman looks at how to solve problems of organization in programs using the Scrum of Scrums hierarchy (teams hold standups, then Scrum Masters have their own standups together every day to deal with cross-program issues). Because this approach doesn't scale to handle coordination and decision making and problem solving in larger programs, she recommends building loose networks between projects and teams using Communities of Practice (a simple functional matrix in which architects, testers, analysts, and especially the Product Owners in each team coordinate continuously with each other across teams).

Rothman also looks at the problems of coordinating work on the backlog between teams, and evolving architecture, and how Program Managers need to be Servant Leaders and not care what teams do or how they do it, only about the results.

Rothman believes that Program Managers should establish and maintain momentum from the beginning of the program. Rather than taking time upfront to initiate and plan (because, who actually needs to plan a large, complex program?!), get people learning how to work together from the start. Release early, release often, and keep work in progress to a minimum – the larger the program, the less work in progress you should have. Finally she describes some tools that you could use to track and report progress and provide insight into a program’s overall status, and explains how and why they need to be different than the tools used for projects.

There are some ideas here that make sense and would probably work, and some others that don’t - like skipping planning.

Get Serious about Program Management

A more credible and much more comprehensive approach for managing large programs in large organizations would be one of the heavyweight enterprise Agile hybrids: the Scaled Agile Framework (SAFe) or Disciplined Agile Delivery which take Agile ideas and practices and envelop them inside a structured, top-down governance-heavy process/project/program/portfolio management framework based on the Rational Unified Process. But now you’re not trying to manage and coordinate small, simple Agile projects any more, you’re doing something quite different, and much more expensive.

The most coherent and practical framework I have seen for managing programs is laid out in Executing Complex Programs, a course offered by Stanford University, as part of its Advanced Project Management professional development certificate.

This framework covers how to manage distributed cross-functional and cross-organizational teams in global environments; managing organizational and political and logistical and financial risks; and modeling and understanding and coordinating the different kinds of interdependencies and interfaces between projects and teams (shared constraints and resources, APIs and shared data, hand-offs and milestones and drop-dead dates, experts and specialists…) in large, complex programs. The course explores case studies using different approaches, some Agile, some not, some in high reliability / safety critical and regulated environments. This should give you everything that you need to manage a program effectively.

You can and should make projects simpler and smaller – which means that you’ll have to do more program management. But don’t try to get by at the program level with improvising and iterating and leveraging the same simple techniques that have worked with your teams. Nothing serious gets done outside of programs. So take program management seriously.

Thursday, January 23, 2014

Can you Learn and Improve without Agile Retrospectives? Of course you can…

Retrospectives – bringing the team together on a regular basis to examine how they are working and identify where and how they can improve – are an important part of Agile development.

Scrum and “Inspect and Adapt”

So important that Schwaber and Sutherland burned retrospectives into Scrum at the end of every Sprint, to make sure that teams will continuously Inspect and Adapt their way to more effective and efficient ways of working.

End-of-Sprint retrospectives are now commonly accepted as the right way to do things, and are one of the more commonly followed practices in Agile development. VersionOne’s latest State of Agile Development survey says that 72% of Agile teams are doing retrospectives.

Good Retrospectives are Hard Work

Good retrospectives are a lot of work.

For the leader/Coach/Scrum Master who needs to sell them to the team – and to management – and build a safe and respectful environment to hold the meetings and guide everyone through the process properly.

For the team, who need to take the time to learn and understand together and act on what they've learned and then follow-up and actually get better at how they work.

So hard that there several books written just on how to do retrospectives,(Agile Retrospectives: Making Good Teams Great, The Retrospective Handbook, Getting Value out of Agile Retrospectives), as well as several chapters written about retrospectives in other books on Agile, and retrospective websites (including one just on how to make retrospectives fun) and a wiki and at least one prime directive for running retrospectives, and dozens of blog posts with suggestions and coaching tips and alternative meeting formats and collaborative games and tools and techniques to help teams and coaches through the process, to energize retrospectives or re-energize them when teams lose momentum and focus.

Questioning the need for Retrospectives

Because retrospectives are so much work, some people have questioned how useful running retrospectives each Sprint really is, whether they can get by without a retrospective every time, or maybe without doing them at all.

There are good and bad reasons for teams to skip – or at least want to skip – retrospectives.

Because not everyone works in a safe environment where people trust and respect each other, so retrospectives can be dangerous and alienating, a forum for finger pointing and blame and egoism.

Because they don’t result in meaningful change, because the team doesn’t act on what they find – or aren’t given a chance to – and so the meetings become a frustrating and pointless waste of time, rehashing the same problems again and again.

Because the real problems that they need to solve in order to succeed are larger problems that they don’t have the authority or ability to do anything about, and so the meetings become a frustrating and pointless waste of time….

Because the team is under severe time pressure, they have to deliver now or there may not be a chance to get better in the future.

Because the team is working well together, they've “inspected and adapted” their way to good practices and don’t have any serious problems that have to be fixed or initiatives that are worth spending a lot of extra time and energy on, at least for now. They could keep on trying to look for ways to get even better, or they could spend that time getting more work done.

Inspecting and Adapting – without Regular Retrospectives

Regular, frequent retrospectives can be useful – especially when you are first starting off in a new team on a new project. But once the team has learned how to learn, the value that they can get from retrospectives will decline.

This is especially the case for teams working in rapid cycles, short Sprints every 2 weeks or every week or sometimes every few days. As the Sprints get shorter, the meetings need to be shorter too, which doesn’t leave enough time to really review and reflect. And there’s not enough time to make any meaningful changes before the next retrospective comes up again.

At some point it makes good sense to stop and try something different. Are there other ways to learn and improve that work as well, or better than regular team retrospective meetings?

XP and Continuous Feedback

Retrospectives were not part of Extreme Programming as Kent Beck et al defined it (in either the first or second edition).

XP teams are supposed to follow good engineering (at least coding and testing) practices and work together in an intelligent way from the beginning – it should be enough to follow the rules of XP, and fix things when they are broken.

XP relies on built-in feedback loops: TDD, Continuous Integration and continuous testing, pair programming, frequently delivering small releases of software for review. The team is expected to learn from all of this feedback, and improve as they go. If tests fail, or they get negative feedback from the Customer, or find other problems, they need to understand what went wrong, why, and correct it.

Devops and Continuous Delivery/Deployment

Delivering software frequently, or continuously, to production pushes this one step further. If you are delivering working software to real customers on a regular basis, you don’t need to ask the team to reflect internally, to introspect – your customers will tell you if you are doing a good job, and where you need to improve:

Are you delivering what customers need and want? Is it usable? Do they like it?

Is the software quality good – or at least good enough?

Are you delivering fast enough?

By understanding and acting on this feedback, the team will improve in ways that make a real difference.

Root Cause Analysis

If and when something seriously goes wrong in testing or production or within the team, call everyone together for an in depth review and carefully step through Root Cause Analysis to understand what happened, why, what you need to change to prevent problems like this from happening again, and put together a realistic plan to get better.

Reviews like this, where the team works together to confront serious problems in a serious way and genuinely understand them and commit to fixing them, are much more important than a superficial 2-hour meeting every couple of weeks. These can be – and often are – make or break situations. Handled properly, this can pull teams together and make them much stronger. Never waste a crisis.

Kanban and Micro-Optimization

Teams following Kanban are constantly learning and improving.

By making work visible and setting work limits, they can immediately detect delays and bottlenecks, then get together and correct them. This micro-optimization at the task level, always tuning and fixing problems as they come up, might seem superficial, but the results are immediate (recognizing and correcting problems as soon as they come up makes more sense than waiting until the next scheduled meeting), and small improvements are all that many teams are actually able to make anyways.

Take advantage of audits and reviews

In large organizations and highly regulated environments, audits and other reviews (for example security penetration tests) are a fact of life. Instead of trying to get through them with the least amount of effort and time wasted, use them as valuable learning opportunities. Build on what the auditors or reviewers ask for and what they find. If they find something seriously missing or wrong, treat it as a serious problem, understand it and correct it at the source.

Moving Beyond Retrospectives

There are other ways to keep learning and improving, other ways to get useful feedback, ways that can be as effective or more effective and less expensive than frequent retrospectives, from continuous checking and tuning to deep dives if something goes wrong.

You can always schedule regular retrospective meetings if the circumstances demand it: if quality or velocity start to slide noticeably, or conflicts arise in the team, or if key people leave, or there’s been some other kind of shock, a sudden change in direction or priorities that requires everyone to work in a much different way, and start learning all over again.

But don’t tie people down and force them to go through a boring, time-wasting exercise because it’s the “right way to do Agile”, or turn retrospectives into a circus because it’s the only way you can keep people engaged. Find other, better ways to keep learning and improving.

Thursday, January 16, 2014

How much can Testers help in Appsec?

It’s not clear how much of a role QA – which in most organizations means black box testers who do manual functional testing or write automated functional acceptance tests – can or should play in an Application Security program.

Train QA, not Developers, on Security

At RSA 2011, Caleb Sima asserted that training developers in Appsec is mostly a waste of time (“Don’t Teach Developers Security”).

Because most developers won’t get it; they don’t have the time to worry about security even if they do get it; and the rate of turnover in most development teams is too high so if you train them, they are not likely to be around for long enough to make much of a difference.

Sima suggests to start with QA instead. Because testers are paid to find where things break, and Appsec gives them more broken things to find.

Instead of putting a test team through general Appsec training, he recommends taking a more targeted, incremental approach.

Start with a security scan or pen test. Have a security expert review the results and identify the 1 or 2 highest risk types of vulnerabilities found, problems like SQL injection or XSS.

Then get that security expert to train the testers on what these bugs are about and how to find them, and help them to explain the bugs to development. Developers will also learn something about security by working through these bugs. When all of the highest priority bugs are fixed, then train the test team on the next couple of important vulnerabilities, and keep going.

Unfortunately, this won't work...

This approach is flawed in a couple of important ways.

First, it doesn’t address the root cause of software security problems: developers making security mistakes when designing and writing software. It’s a short-term bandage.

And in the short term, there is a fundamental problem with asking the QA team to take a leadership role in application security: most testers don’t understand security, even after training.

A recent study by Denim Group assessed how much developers and QA understood about application security before and after they got training. Only 22% of testers passed a basic application security test after finishing security training.

Testing is not the same as Pen Testing

This is disappointing, but not surprising. A few hours or even a few days, of security training can’t make a black box functional QA tester into an application security pen tester. Nick Coblentz points out that some stars will emerge from security training. Some testers, like some developers, will “get” the White Hat/Black Hat stuff and fall in love with it, and make the investment in time to really get good at it. However, these people probably won’t stay in testing any ways – there's too much demand for talented Appsec specialists today.

But most testers won’t get good at it. Because it’s not their job. Because there are too many technical details to understand about the architecture and platform and languages for people who are just as likely to have a degree in Art History as Computer Science, who are often inexperienced, and already over worked. And these details are important – in Appsec, making even small mistakes, and missing small mistakes, matters.

Cigital has spent a lot of time helping setup Appsec programs at different companies and studying what works for these companies. They have found that:

Involving QA in software security is non-trivial... Even the "simple" black box Web testing tools are too hard to use.

In order to scale to address the sheer magnitude of the software security problem we've created for ourselves, the QA department has to be part of the solution. The challenge is to get QA to understand security and the all-important attackers' perspective. One sneaky trick to solving this problem is to encapsulate the attackers' perspective in automated tools that can be used by QA. What we learned is that even today's Web application testing tools (badness-ometers of the first order) remain too difficult to use for testers who spend most of their time verifying functional requirements…

Software [In]security: Software Security Top 10 Surprises

But there’s more to Security Testing than Pen Testing

There’s more to security testing than pen testing and running black box scans. So Appsec training can still add value even if it can’t make QA testers into pen testers.

Appsec training can help testers to do a better job of testing security features and the system’s privacy and compliance requirements: making sure that user setup and login and password management work correctly, checking that access control rules are applied consistently, reviewing audit and log files to make sure that activities are properly recorded, and tracing where private and confidential data are used and displayed and stored.

And Appsec training can give testers a license to test the system in a different way.

Most testers spend most of their time verifying correctness: walking through test matrices and manual checklists, writing automated functional tests, focused on test coverage and making sure that the code conforms to the specification, or watching out for regressions when the code is changed. A lot of this is CYA verification. It has to be done, but it is expensive and a poor use of people and time. You won’t find a lot of serious bugs this way unless the programmers are doing a sloppy job. As more development teams adopt practices like TDD, where developers are responsible for testing their own code, having testers doing this kind of manual verification and regression will become less useful and less common.

This kind of testing is not useful at all in security, outside of verifying security features. You can’t prove that a system is secure, that it isn’t vulnerable to injection attacks or privilege escalation or other attacks by running some positive tests. You need to do negative testing until you are satisfied that the risks of a successful exploit are low. You still won’t know that the system is secure, only that it appears to be “secure enough”.

Stepping off of the Happy Path

It’s when testers step off of the tightly-scripted happy path and explore how the system works that things get more interesting. Doing what ifs. Testing boundary conditions. Trying things that aren’t in the specification. Trying to break things, and watching what happens when they break.

Testing high-risk business logic like online shopping or online trading or online banking functions in real-world scenarios, testing in pairs or teams to check for timing errors or TOC/TOU problems or other race conditions or locking problems, injecting errors to see what happens if the system fails half way through a transaction, interrupting the workflow by going back to previous steps again or trying to skip the next step, repeating steps two or three or more times. Entering negative amounts or really big numbers or invalid account numbers. Watching for information leaks in error messages. Acting unpredictably.

This is the kind of testing that can be done better by a QA tester who understands the domain and how the system works than by a pen tester on a short-term engagement. As long as they are willing to do a little hacking.

It shouldn’t take a lot of training, or add a lot to the cost of testing, to get testers doing some exploratory and negative testing in the high-risk areas of an application. A lot of important security bugs (and functional bugs and usability problems) can be found testing this way – bugs that can’t be found by walking through test checklists, or by running vulnerability scanners and dynamic analysis tools. Application Security training should reinforce to testers – and developers, and managers – how important it is to do this kind of testing, and that the bugs found this way are important to fix.

Moving from functional testing of security features to edge and boundary condition testing and “adversarial testing” is the first major step that QA teams need to take in playing a role in Application Security, according to Cigital’s Build Security in Maturity Model: From there, some QA teams may go on to integrate black box security testing tools, and possibly other more advanced security testing tools and practices.

The real value of Security Testing

But even if you can farm some security testing out to QA, you’ll still need to rely on security experts. You need someone who really understands the tools and the technical issues, who has spent a lot of time hacking, who understands security risks and who is paid to keep up with the changing threat landscape. Someone who is at least as good or better than whoever you expect to be attacking you.

This might mean relying on full-time security experts in your own organization, or contracting outside consultants to do pen tests and other reviews, or taking advantage of third party on-demand security testing platforms from companies like WhiteHat, Qualys, Veracode, HP and IBM.

The important thing is to take the results of all of this testing – whether it’s done by QA or a pen tester or by a third party testing service – and act on it.

Developers – and managers – need to understand what these bugs mean and why they need to be fixed and how to fix them properly, and more importantly how to prevent them from happening in the future. The real point of testing is to provide information back to development and management, about risks, about where the design looks weak, or where there are holes in the SDLC that need to be plugged. Until we can move past test-then-fix-then-test-then-fix… to stopping security problems upfront, we aren’t accomplishing anything important. Which means that testers and developers and managers all have to understand security a lot better than they do today. So teach developers security. And testers. And managers too.

Thursday, January 9, 2014

Developers working in Production. Of course! Maybe, sometimes. What, are you nuts?

One of the basic ideas in Devops is that developers and operations should share responsibility for designing systems, for implementing them and keeping them running. Developers should be on call in case something goes wrong, and be the one to fix whatever breaks. Because the person who wrote the code is often the only one who knows how it really works. And because of the moral hazard argument: if programmers are held fully accountable for the work that they do, they will be incented to do a better job, instead of writing garbage and handing it off to somebody else.

But this means that developers need some kind of access to production. How much access developers need, how often, and how this can be made safe, are important questions that have to be answered.

Hire wicked smart people and give them all access to root.
Unnamed devops evangelist, Is Devops Subversive?

If you ask whether developers should have access to production you’ll find that people fall into one of 3 camps:

Yeah, sure, of course – who else is going to support the system?

This is a simple decision for online startups, where there’s often nobody else to install, configure and support the application any ways.

As these organizations grow, developers often continue to stay closely involved in deployment, support and application operations, and in some cases, still play a primary role, especially in shops that heavily leverage cloud infrastructure (think Netflix).

Read my lips: Never Ever! Are you out of your freakin’ mind?

Question: Should developers have access to production?

Answer: Not only no, but hell no.

kperrier, Slashdot: Should Developers Have Access to Production

The situation is much different in large enterprises and government organizations, where walls have been built up between development and operations for many different reasons. It’s not just mergers and acquisitions and inertia and internal politics and protectionism that made this happen. It’s also SOX and PCI and HIPAA and GLBA and other overlapping regulations and privacy rules, and ITIL and COBIT and ISOxxx and CMMI and other IT governance frameworks, and internal and external auditors enforcing separation of duties and need-to-know access limitations in order to ensure the integrity and confidentiality of system data.

The same rules also apply to leaner-and-meaner Devops shops. For example, at Etsy (a Devops leader), PCI DSS compliant functions are managed and supported by a different team in a different way from the rest of their online systems: while developers have R/O access to a lot of production “data porn” (metrics and graphs and logs), they do not have access to production databases; there are more requirements for activity logging; a push to QA is handled in a clearly different way than a push to production; and all changes to production must be tracked and approved through a ticketing system.

And there’s also the problem of shared infrastructure: the same networks and servers and databases and other parts of the stack may be used by many different applications and different business units. Developers of course only understand the applications that they are working on and are only familiar with the simplified test configurations that they use day-to-day – they may not know about other systems and their shared dependencies, and could easily make changes that break these systems without being aware of the risks.

In case of emergency, break glass

Most organizations fall somewhere in between a Noops web startup in the cloud and a legacy-bound enterprise weighted down by too much governance and management politics. Operations is usually run separately, management is still accountable to regulators and auditors, but most people understand and recognize the need for developers to help out, especially when something goes wrong.

When the shit has indeed and truly hit the fan, developers – although usually only senior developers or team leads – are brought in to help troubleshoot and recover. Their access is temporary, maybe using a “fire id” extracted from a vault, then locked down again as soon as they are done. Developers are often paired up with an operations buddy who does most of the driving, or at least watches them carefully when the developer has to take the wheel.

Question: Should developers have access to production?

Answer: Everyone agrees that developers should never have access to production… Unless they’re the developer, in which case it’s different.

SatanicPuppy, Slashdot: Should Developers Have Access to Production

Problems in production can be fixed much faster if developers can see the logs, stack traces and core dumps and look at production data when something goes wrong. Giving at least some developers read access to production logs and alerts and monitors – enough to recognize that something has gone wrong and to figure out what needs to be fixed – makes sense.

Sometimes really bad things happen and all that matters is getting the system back and up and running as quickly as possible. You want the best people you can find working on the problem, and this includes developers. You’ll need their help with diagnosis and deciding what options are safest to take for roll back or roll forward, putting in an emergency fix or workaround, and data repair and reconciliation. Everyone will need to check later to make sure that any temporary fixes or workarounds are implemented properly, checked-in and redeployed.

When you run incident management fire drills, make sure that developers are included. And developers should also be included in incident postmortem reviews, even if they weren't part of the incident management team, because this is an important opportunity to learn more about the system and to improve it. But if you have developers firefighting in production more than almost never, then you’re doing almost everything wrong.

Debugging in production?

Some problems, intermittent failures and timing-related problems and heisenbugs, only happen in production and can’t be reproduced in test – or at least not without a lot of time, expense and luck. To debug these problems a developer may need to examine the run-time state of the system when the problem happens. But these problems should be the exception, not the rule. Debugging in production opens up security problems (exposing private data in memory) and run-time risks that developers and Ops both need to be aware of.

Question: Should developers have access to production?

Answer: Whenever an error occurs that I can’t replicate in a dev environment, I'm always SO tempted to hop into prod and start adding in some output statements... Yeah, it’s probably a good thing I don’t have access to prod.

Enderjsy, Slashdot: Should Developers Have Access to Production

Deploying to production?

Auditors will tell you that the people who write the code cannot be the same people who deploy it in production. But some developers will tell you that they need to take care of deployment, because Ops won’t understand all the steps involved, or at least that they need to manually check that all of the config changes were made correctly, and to run the data conversion and check that it worked, and to make sure that the right code was installed in the right places. If this is how your deployment is done, you’re doing it wrong.

And you’re doing a lot of things wrong if Ops won’t trust development enough to push changes out at all:

Most times, when I see devs screwing with production it's either a "hero" coder who is way too good to use best practices, or a situation in which the environment is so hostile that the "best" solution seems to be breaking the rules.

I once did some contract work for a company where the QA and testing process took a minimum of two weeks for the most trivial changes, and where the admins on the production servers refused to deploy things like security patches without a testing period that ran close to a month. The devs there had a hundred tricks for sneaking their code into production, and linking production code to the development servers in an attempt to meet their productivity goals.

SatanicPuppy, Slashdot: Should Developers Have Access to Production

Testing in production?

The only testing that has to be done in production is A/B split testing to see what features customers like or don’t like. You should not need to test in production to see if something works – that’s what test environments are for – except maybe when you are deploying and launching a system for the first time, or some limited integration test cases with other systems that can’t be reached from a test environment. Or load testing done with Ops – a lot of shops can’t afford to have a test environment sized big enough for real load testing.

Making Production Safe for (and from) Developers

Whether developers should have production access (and how much access you can allow them) also depends on how much developers can be trusted to be careful and responsible with the systems and with customer data. It’s inconsistent that while organizations will trust developers to write the software that runs in production, they won’t trust them with the production system. But development and production are different worlds.

Most developers lack the necessary situational awareness. They are used to experimenting and trying things to see what happens. I've seen smart, experienced developers do dangerous things in production without realizing it while they are deep into problem solving. Developers should be scared of working in production. Not too scared to think, but scared enough to think before they act. They need to understand the risks, and be held to the same duties of care as anyone in Ops.

You can spend a lot of time breaking down the wall between development and Ops, only to see it built back up overnight (much thicker and higher too) the first time that a developer blows away a production database when they thought they were in test, or kills the wrong process or hot deploys the wrong version of code or deletes the wrong config file and causes a widespread outage. Make sure that test and development environments are firewalled from production so that it isn’t possible for anything running in test to touch production through hard-wired links. Make it clear to developers when they are in production. Force them to make a jump: open a tunnel, sign on with a different id and password, see a different prompt.

With great power comes great responsibility

Nobody supporting an app should need – or even want – root access for day-to-day support and troubleshooting. Developers should only be granted the access that they need and no more, so that they can’t do things they shouldn't do, they can’t see things that they shouldn't see, and so that they can’t cause more damage than you can afford.

At the Velocity Conference in 2009, John Allspaw and Paul Hammond explained how important and useful it is for developers to have access to production - but that most of this access can be and should be limited:

Allspaw: “I believe that ops people should make sure that developers can see what’s happening on the systems without going through operations… There’s nothing worse than playing phone tag with shell commands. It’s just dumb.”

“Giving someone [i.e., a developer] a read-only shell account on production hardware is really low risk. Solving problems without it is too difficult.”

Hammond: “We’re not saying that every developer should have root access on every production box.”

Developers who need access to the system should be given a read-only account that allows them to monitor the run-time – logs and metrics. Then force them to make another jump to gain whatever command or write access they need to do admin functions or help with repair and recovery.

One problem is that a lot of systems aren’t designed with fine grained access control at the admin level: there’s an admin user (that owns the application and can see and do everything needed to setup and run the system) and there’s everybody else. It can be painful to break out the application and the environment ownership structure and permissioning scheme and separate read-only monitoring access from support and control functions, to setup sudo privilege escalation rules, and to track and manage all of the user accounts properly.

And none of this works if you aren’t properly protecting confidential and private data or other data that somebody could use for their own benefit. Tokenizing or masking or encrypting data so it can’t be read, hashing critical data to make sure that it hasn’t been tampered with; making sure that confidential data isn’t written to logs or temporary files.

You also have to make sure that you can track what everyone in production does, what they looked at and what they changed through auditing in the application, database and OS; and track changes to important files (including the code) using a detective change control tool like OSSEC.

All of these checks and safeties also make it safer for developers, as well as for Ops, and will hopefully be enough to keep the auditors satisfied.

Try to make it work

There are advantages to having developers working in production besides getting their help with support and troubleshooting.

The more time that developers spend working in production on operations issues with operations staff, the more that they will learn about what it takes to design and build a real-world system. Hopefully they will take this and design better, more resilient systems with more transparency and more consideration for support and admin requirements.

And having developers share responsibility for making the software work and support it, proving that they care and helping out, will go a long way to breaking down the wall of confusion between operations and development.

It’s not a simple thing to do. It might not even be possible in your organization – at least not in your lifetime. You need to understand and balance the risks and advantages. You need to understand the political and governance constraints and how to deal with them. You need to put in the proper safeguards. And you need to make sure that you stay onside of compliance and regulations. But you’re leaving too much on the table if you don’t try.

Site Meter