In the last blog post, we talked about all the things we have to consider in our approach to incident management, and the increasing complexity of our technology services in the “always-on” world. Opsgenie, and Atlassian’s other important supporting player Status Page, are the keys to the kingdom when it comes to filtering out the noise of event management, reducing the time required to resolve incidents and gain power over the cost and effects of major incidents.
This is the story of Kevin and Bill.
Meet Kevin, IT Infrastructure Manager at Global Brands, Inc. Global Brands, Inc. (GBI) is a major food and beverage distributor.
GBI is headquartered in London with additional small network and support teams based in Hong Kong, Mumbai, and Kansas City, MO. The entire IT department uses Atlassian Jira Service Desk to manage its key processes, including Incident, Request, Change, and Problem. Their internal software development teams use Jira Software to manage software bugs, development, and releases.
Last year IT adopted Opsgenie and Status Page in response to a growing impatience from within IT for a way to manage the alerts noise. Feedback from the GBI business users sent a clear message that major incident communication was abysmal – inconsistent in both message and timing. Kevin was being pressed for IT to come up with improved ways to calculate the costs of outages as an IT and a business expense.
Meet Bill, Sr. Network Manager and interim IT Director at C-Sweet Systems. They are an industrial cable manufacturer.
C-Sweet is based in London with offices across the UK with manufacturing and distribution centers in New York, San Francisco, Dubai, Sydney, Hong Kong, and Rotterdam.
C-Sweet uses Atlassian Jira Service Desk in London to manage their internal ITSM processes, which include Incident, Request, and Change. HR, Facilities, Contracts, and Marketing business teams have also adopted JSD to help manage their workflows.
C-Sweet’s FY21 strategy includes introducing new cabling products for use in off-shore drilling rigs and High-G applications, and a customer portal to manage customer requests, inquiries, and service changes.
It’s not magic, it’s Opsgenie
Two IT Managers walk into a bar. They both receive a text on their phones It’s late in the day on a sunny Friday and the only thing standing between them and the weekend is a text alert about a critical service with performance problems.
Kevin orders a pint and sits back to respond to the alert. He presses a button to acknowledge the alert and presses another button to fix the issue. As his pint arrives, he presses a final button declaring the alert closed.
Meanwhile, Bill apologizes and hurries back to his office. He logs in to the offending system and proceeds to fix the issue – log in to the network, remote out to the service, find the script he needs to run to fix the service (he mutters that he can never find the darned thing and something about putting a shortcut on his desktop for it), run the fix, test that the service is working properly, answer another text from the service desk team where calls are coming in from end-users that the system is broke. Finally, he logs off and sends a text to Kevin: “sorry, mate, let’s catch up next week?”
Kevin’s team configured Opsgenie to fire commands, on-demand, in response to events to resolve them quickly. Bill’s team uses an events management system but it’s not integrated with their Atlassian platform.
Communicating and collaborating, not scurrying
Kevin and Bill knew what the alerts meant and how to fix them without further troubleshooting, even though Bill had to do his manually, and it took 45 minutes to do it, not including travel time between the pub and his office. Not all incidents have known workarounds, or are quite so obvious.
You’re a distributed organization, and your teammates are literally across the globe. Text and chat and other asynchronous comms such as email are fine for many tasks but we know that for complex technical situations, nothing beats face-to-face communication, screenshots, videoconferencing. Opsgenie has a built-in war room feature, the Incident Command Center (ICC), where your team can wield their problem solving superpowers alongside the teams in Hong Kong, Mumbai, and Slough. Just click the ICC link to launch the war room, and they’ll stay online while you swarm the issue. You can even reopen the war room as needed.
Focus on what matters most
It’s not just one of our core values, it’s also what Kevin and Bill have to do to make sure that the important incidents are spotted and dealt with, and recurring incidents are captured for problem management. Event management systems are very effective at producing a lot of noise. While they offer customization around what events to alert, adding Opsgenie to your ecosystem allows another layer of filtering. Additionally, use on-call calendars and groups to ensure the right teams are receiving event notifications. Kevin’s team have invested much time and effort configuring the alerts for their event management system and they didn’t have to reinvent the wheel when they connected Opsgenie. Using the rules they already had in place, they have configured Opsgenie to create a Jira ticket for every alert of major and above for specific network and application events but only sends a text to the on-call rota when the alert is critical. Opsgenie can accept integrations from over 200 monitoring, logging, automation, cloud, chat, ITSM, and deployment tools. REST API extends Opsgenie integration options.
Bill’s event management system does a great job of sending some key event alerts via email to Jira Service Desk. He’s configured JSD to prioritizes and assigns them automatically. He could further automate, parse, and assign those with other marketplace apps like Jira Email This Issue, Jira Misc Workflow Extensions, Scriptrunner, Elements Copy & Sync. Sadly, Bill is still missing key components in the puzzle.
Expose Updates via Status Page
Yes, this can be scary. How detailed do you go? We’ve read status updates which clearly stated that an update which the company themselves rolled out caused the problem and that the resolution would be fixed forward (or rolled back). We’ve also read status updates which say “This has been identified as unavailable. We are investigating.” and it remains the status until resolution. Not everyone wants to be updated. Thankfully, Status Page allows for self-subscription by product. If I’m on the 1st floor in Marketing, maybe I really don’t want to know from Facilities that the elevator is unavailable, but I really want to know that our externally facing website is down.
Yes, some end users just will not care about your updates, or will not read them, or will try to phone up CIO and demand a personal update from the Head of IT. But there are more who will pay attention and will be grateful for the information. Take control of the message. Look across the entire service value chain and define a policy about what level of detail to share, and how to say it. Make it general enough for people to execute and provide examples. Then, stick to the messaging policy for consistency. Over time, coupled with a sound ITSM strategy, the feedback from your clients or business users will improve on this topic.
And what did Kevin do? Nothing: he fixed the issue before it could affect the customer.
Bill asks the service desk team to send a blast email to the distribution groups with a status update for the incident.
Post-Incident Review (PIR)
The dust has settled and there is still work to be done. Kevin did one final thing before he closed his Opsgenie app: he opened the incident and pressed the publish incident review button. Also known as the Post Mortem or any number of other monikers, Opsgenie will on-demand generate a well-formatted dashboard-like PIR based on data collected through Opsgenie. This data will be useful for the inevitable Problem and Change work which follows Incidents. And, of course, you can email this – the screenshot below was received from Opsgenie (recipient has been blanked out). You can extract individual elements from the tiles on the report. The PIR also includes the number of Status Page updates that were associated with Incident, if any.
The post-incident review contains statistics that will contribute to the GBI key performance indicators for Incident Management, including “Proactive Resolutions” by Quarter and “Cost v. Cost Avoidance.” Kevin knows that his “cost avoidance” was ahead of “cost” in the last fiscal year. To date, this is trending positive.
Bill will be mashing up his data from Jira and his event management tools using Excel in the week ahead of the C-Sweet IT Stakeholder meeting.
If a certain energy drink gives you wings, then Opsgenie gives you control. Control to filter out the noise. Control of cost through knowledge and reports. Control to tell the story of success without spin. Control of resolution using conferencing tools to bring together all the parts of your world. Add Status Page to the mix and gain control of the message. Regain control of your valuable time to have a pint with your mate or date night with your other half.