Incident management is the group of services by which an organization responds to unplanned events and service disruptions. The severity of incidents can vary widely, ranging to as large as an entire global web service crashing and down to a small number of users, or even a single user, experiencing an isolated error.
Incidents are those events that render users frustrated, blocked, isolated, or unproductive – and most often result in unsatisfactory comments about IT. Therefore, of course, it makes sense to get this important service done right.
This is 2020 and we are beyond the mere “it’s broken” process; modern incident management must account for
- a growing body of complex matrices of priorities and response levels for multiple services across customers or lines of business
- one-size-does-not-fit-all communications to stakeholders
- production and non-production environments
- vertical and lateral escalations
- collaborate effectively across centralized and distributed teams
- capable of managing major incidents
- continuously improve
These are the expectations and the reality of information and communications technology (ICT) teams today.
In other words, when we were heads-down Fixing Stuff, incident management evolved into a Practice with a capital P.
To meet these expectations, teams need reliable methods to prioritize incidents so they can meet restoration service levels effectively. Many organizations report downtime due to outages costing more than $300,000 per hour, according to Gartner. For some web-based services, that number can be dramatically higher, and therefore the ITSM tools used should be capable of identifying and alerting teams in multiple ways based on multiple criteria – fast.
Steps in the Incident Management Process
Incident Management doesn’t have to be the cumbersome lumbering heavyweight process many associates with ITIL (incorrectly, in this author’s opinion) but will large part depend on the structure and culture of the organization. However, certain characteristics need to be present for the practice to add value to the enterprise.
Most importantly, the rhythms and patterns of categorization and prioritization, service levels, and communication will have been agreed with stakeholders/customers when the service was designed or introduced.
Identify & Log
An incident can come from anywhere: an employee, a customer, a vendor, monitoring systems. No matter the source, the first two steps are simple: the incident is identified, the incident is logged. These incidents (i.e., tickets) typically include:
- The name of the person reporting the incident
- The date and time the incident is reported
- A description of the incident (what is down or not working properly)
- A unique identification number assigned to the incident, for tracking
Categorize & Prioritize
Assigning a logical, intuitive category (and subcategory, as needed) to every incident helps you analyze your data for trends and patterns, which is a critical part of effective problem management and preventing future incidents.
Every incident must also be prioritized, and in most circumstances, this can be done through automation. It starts with the above categorization and then with assessing the impact on the business and the number of users, applicable SLAs, and potential financial, security, compliance, and reputation implications of the incident. Some tools can even look for incident patterns and respond before users have noticed.
- Initial diagnosis: Ideally, your front-line support team can see an incident through from diagnosis through close, but if they can’t, the next step is to log all the pertinent information and escalate to the next tier team.
Swarming is a technique taken from DevOps and in this context combines escalation, investigation, diagnosis, and resolution and can be appropriate for both major and non-major incidents.
- Escalate, Investigate, Diagnosis: Vertical or lateral? You may need both for major incidents, and lateral escalation – single or multiple – may be the only way to address an issue. Sometimes teams bring in outside resources or other department members to consult and assist with the resolution.
- Communicate: Updates are shared with relevant internal and external stakeholders via paths that are best suited to the recipient – SMS, WhatsApp, Teams, Slack, other text/chat, automated messaging, email, webpage, one-to-one.
- Resolution and recovery: Necessary steps are taken to resolve the incident. In a major incident, recovery simply implies the amount of time it may take for operations to be fully restored, since some fixes (like bug patches, etc.) may require testing and deployment even after the proper resolution has been identified.
- Closure: It can take multiple forms depending on the type of incident. For simple issues, Closure might follow automatically on from Resolution. For major incidents, this might occur after an “after-action report” has been completed for the event.
Incident management tools
Incident management isn’t done just with a tool, but the right blend of tools, practices, and people. Here are several of the most common tool categories for effective incident management:
- Incident tracking: Every incident should be tracked and documented so you can identify trends and make comparisons over time. Integrating a customer portal, and event and asset management tools with Jira Service Desk workflows can help eliminate large swaths of hands-on ticket work.
- Chat room: Real-time text communication is invaluable for diagnosing and resolving the incident as a team, as well as a tool for end-users to report issues. When integrated with your ITSM toolset it can provide a rich set of data for response analysis later on.
- Video chat: Video chat complements text chat for many incidents, team video chat can help discuss the findings and map out a response strategy for distributed teams.
- Alerting system: A tool such as OpsGenie integrates with your monitoring system and manages on-call rotations and escalations.
- Documentation tool: A tool such as Confluence can capture incident state documents and post-mortems. Additionally, Confluence integration with Jira Service Desk contributes to self-service and self-help initiatives.
- Communication: Communicating status with both internal stakeholders and customers through Statuspage helps keep everyone in the loop during a major incident. In-built notifications and workflows
The use of automation, adopting practices from DevOps, and recognizing incident management as a practice of related activities can keep the incident management process lean, and the Incident Management Practice a valuable source of information for other practices in IT/ICT. If you or anybody in your enterprise is wanting to learn more about incident management then contact one of our expert ITSM consultants today.