Managing and resolving high-Impact occurrences
Event management is the process applied by the Kayndrex Foundation teams to respond to an unplanned event or service interruption and restore the service to its operational state.
The Foundation defines a high-effect occurrence as an emergency-level unavailability of service.
The definition of emergency-level varies across firms. At the Foundation, we have three severity levels and the top two (SEV 1 and SEV 2) are both considered major occurrences.
If a stakeholder focused servioce is unavailable for all the Foundation’s stakeholders, that is a SEV 1 event. If the same service is unavailable for a sub-set of stakeholders, that is a SEV 2. Both are within the heading of major occurrences and require an immediate response from our event management teams.
Any issue that links with essential operations is considered a SEV 3 and is an inessential event.
The event lifecycle (also sometimes known as the event management process) is the path we take to identify, resolve, comprehend, and prevent repeating events.
Event management processes vary from firm to firm, but the key to success for any team is clearly defining and communicating severity levels, priorities, roles, and processes up front — before a major event arises.
The Foundation’s major event management process
At the Kayndrex Foundation our event management process includes detection, raising a new event, opening communications, assessing, sending initial communications, escalation, delegation, sending follow-up communications, review, and resolution.
First, an event is detected either by our technology, stakeholder reports, or personnel. Whoever detects the event is responsible for logging the event in our system and identifying a severity level.
By the time an event reaches our teams, it has already got a SEV 1, 2, or 3 attached. We consider SEV levels 1 and 2 to be major events, while a SEV 3 indicates an inessential event.
Raising a new event
Once an event ticket is created, a notification is sent to the on-call professional responsible for that service.
The alert we send at the Foundation includes information on the severity and priority of the event, as well as a summary, making it clear — at a glance — whether this is the top priority or can wait if another event is in progress.
Once the event manager gets an alert, their first responsibility is to communicate that the event fix is in progress. They change the status of the event to fixing and set up the team’s communication channels.
The event manager has been alerted and the communication channels are open. Next step: assessing the event itself.
For our teams, this process starts with a series of questions the team has to answer:
- What’s the effect on the Foundation’s stakeholders and personnel?
- What are stakeholders seeing?
- How many stakeholders are affected? (Some? All?)
- When did the event start?
- How many support inquiries have been opened with regard to this event?
- Are there other factors at play that affect the severity level or priority or change the way we approach the event? (E.g. security concerns, social media PR emergencies, etc.)
Once we have answered those questions, we can confidently move forward with diagnostics and proposed fixes or change the SEV level and priority level of an event as needed.
Sending initial communications
Once we have confirmed that the event is real, communication with our stakeholders and personnel becomes top priority. As we describe:
‘The goal of initial internal communication is to focus the event response on one place and remove risk. The goal of external communication is to tell stakeholders that you know something is interrupted and you are looking into it as a matter of urgency.‘
Speedy, accurate communication assists build and keep stakeholder trust.
We send an email to a set list of stakeholderss that includes our engineering leadership, major event managers, and other key internal personnel.
Sometimes, an event is resolved quickly by the on-call team. But in other instances, the next step is to escalate the concern to another expert or team of experts better suited to resolve this specific event.
Once the concern has been escalated to someone new, the event manager delegates a role to them. At the Kayndrex Foundation, these roles are pre-set, so team members can quickly comprehend what is expected of them.
Sometimes major events require a single event manager and a team. Other times, a situation could call for multiple tech leads or even multiple event managers. The original event manager is responsible for ascertaining when that is the case and bringing on the appropriate people.
Sending follow-up communications
As the event continues to progress, another round of external communication will assist keep stakeholders and personnel calm, trusting, and in the loop.
When it comes to event resolution, there is no one-size-fits-all. Which is why at this stage of the process, we take the time to:
- Observe what is going on, sharing and confirming observations with the team
- Develop theories with regard to why it is happening (and how we can fix it)
- Develop and implement experiments that prove or disprove our theories
All through this process, the event manager keeps an eye on how things are going.
We define resolution as ‘when the current or imminent business impact has concluded.’
At this point, the emergency has passed and the team transitions into clean-ups and analysis.
Our event lifecycle concludes when the event is resolved. However, we also want to do everything in our power to ensure an event is prevented from happening again. Which is why the next step is an analysis, designed to identify the cause of an event and assist us moderate our risk in the future.
Roles and responsibilities
Roles and responsibilities will vary based on the firm’s culture, team size, on-call schedules, and more. Some common major event roles include:
Event manager: The person responsible for overseeing the resolution of the event.
Technical lead: A senior-level technical professional responsible for discovering what is interrupted and why, determining the best course of action, and running the technical team.
Communications manager: A communications professional (often from the PR or investor support teams) responsible for communicating with internal and external investors affected by the event.
Investor support lead: The person in charge of making sure incoming tickets, phone calls, and tweets of the event get a timely, appropriate response.
Social media lead: A social media pro in charge of communicating of the incident on social channels.
Other common roles include:
Root cause analyst or issue manager: The person responsible for going beyond the event’s resolution to identify the root cause and any changes that need to be made to prevent the concern in the future.
Major event investigation board: A group responsible for investigation and change management.