Engaging with the Problem Management sounds like a no-brainer, but the When & Why can be difficult for an IT organization to nail down. ITIL defines a Problem as the cause of one or more incidents. The point of Problem Management is to eliminate the causes of incidents – either reactively or proactively. Not all incidents need to be considered fodder for the Problem Management Practice. How many Incidents make a Problem? How long is a piece of string? Today we’re going to take a look at these sticky questions and answers, and see how C-Suite and (the other guys) handle it.
Know Your Business Terminology
Whilst ITSM is all about IT service management – service desks are highly dependant on internal business processes and not just the technology, especially an external one. Question: do you know the difference between an incident and a problem? Or rather what is your company’s definition of an incident and a problem?
An example of an incident is a service outage. When a system goes down, your team focuses on fixing it and restoring service within the agreed Service Level Agreement (SLA’s). Now that fix may be temporary or not an ideal fix, but it restores the service.
An example of a problem is NodeA failure that caused single or multiple service outages. The problem is found in relation to a single/multiple incidents and can be open in the background whilst the incident is closed as soon as possible because this is where you find the reason as to why something failed and looking for a more permanent fix/resolution.
The Importance of Having a Problem Management Strategy
In order to best transition incidents to problems, you need to know if you are going to be actively or proactively managing your problems. By this we mean the following:
Active or rather reactive problem management is where you solve problems in response or due to one or more incidents. Such as in the example above, when a single or multiple services go down, you restore them but also create a problem ticket in the background to assess the cause of the failure, i.e. Node A and fix it.
Proactive problem management is where you identify or solve a potential problem or a known error in order to avoid future incidents. Whilst resolving the problem in the above-mentioned example you realize Node A is linked to Node B. In order to ensure the Node B doesn’t fail and cause more service outages you quickly patch the nodes and ensure that Node A’s failure doesn’t impact Node B. By doing so you are one step ahead and reducing the potential of more service outages.
Having a business strategy that includes both approaches to problem management strategy is key. Many companies often don’t have a strategy or only go for one approach and that is where they set themselves up to fail. By defining this ahead of implementing ITIL processes, you make it easier for your teams to transition incidents to problems as in when required.
The Transition Process
The transition process focuses mainly on reactive problem management but once an organization reaches a certain level of maturity, it shifts to proactive problem management which we will explore in the following sections.
Once you have successfully restored your services by applying a fix within your SLA, you still need to explore as to why that happened. Especially if there are linked incidents or multiple incidents of the same type. You need to log a separate problem ticket and populate as much detail as possible as problem fixes are about being thorough and not about speed.
Categorizing and prioritizing problems is essential so you can have the right teams working on them and in the right timeframes. After you have investigated the problem thoroughly you document the information and details about the error and look to apply a long-term fix.
Very often the error fix can lead to a change request and this could be a standard/emergency/normal change depending on the urgency and impact of the fix required.
Knowledge sharing is the key differentiator between having a smooth transition from incidents to problems in an organized manner, i.e. proactive problem management.
As outlined above, once you document the information about error details and (once approved/applied) the details of the fix – you start creating a known error database (KEDB)
This enables you to map all the information between incidents and problems in a structured manner and will include the following information:
- Past incident information/links
- Category – Classifying the information so making it easier for you to find it
- Root cause – the original reason for an incident or a problem
- Error – the flaw that needed to be fixed
- Workaround – temporary fix if required
- Resolution – more permanent fix often including details of the change request
Should the same or similar incidents occur in the future, your team and you can resolve them faster and restore services quicker? Each time updating the KEDB as you resolve more incidents or identify more problems. Once you have enough information you also start noticing trends. For example: if a certain software’s quarterly update is known to cause bugs you will know what workaround to apply to avoid incidents. You will also have a standard change process in place to schedule and apply a long-term fix during downtime.
When new incidents occur the process repeats itself shifting between reactive and proactive, thereby improving incident and problem hand-offs.
At Valiantys, we have leveraged our customer experience and core expertise to build a solution that works seamlessly together for such scenarios using core elements in Jira Service Desk and Confluence. We have also added further enhancements using Opsgenie and Statuspage for more wide-scale incident management processes.
Having an effective and symbiotic incident and problem management relationship in place requires the right business strategy, tools, and people. It enables you to have increased service availability, by reducing the number of incidents and improve customer satisfaction, due to faster resolution times. In a wider context, this helps organizations be more agile and commercially competitive due to the high availability of their services ultimately exemplifying business value to all stakeholders.