IIMS – Operational Workflows & Lifecycle

Information Infrastructure Management System (IIMS)

How alerts become incidents, incidents become actions, and operations stay under control

Purpose

This document describes the main operational workflows and lifecycles in IIMS, aligned with the current iims-api implementation and Flutter UI.

It explains:

How alerts are ingested and suppressed by maintenance
How incidents are created and managed
How tickets are created and linked
How topology and geo views provide situational awareness
How caches and dashboards are refreshed

Overview of Operational Flow

At a high level, IIMS manages five operational lifecycles:

Alert ingestion and normalization
Maintenance suppression and control
Incident creation and lifecycle management
Ticket creation and limited synchronization
Topology and geo visualization for situational awareness

These lifecycles transform raw monitoring signals into structured human workflows.

1. Alert Ingestion Workflow

1.1 Alert Sources

Alerts originate primarily from:

Zabbix (primary monitoring provider in Implemented)

Notes:

Prometheus and additional providers are planned for Planned

Alerts represent raw technical signals. They may be:

Frequent
Short-lived
Repetitive and noisy

1.2 Alert Ingestion Flow

Implemented flow:

Monitoring provider generates an alert
Provider adapter receives and normalizes the alert
Alert is stored in:
Alert history
Alert cache (for dashboards and maps)
Maintenance suppression rules are applied immediately

At this stage:

The alert is a technical event
No automatic incident creation is guaranteed

Planned Planned:

Multi-provider alert ingestion
Streaming ingestion pipelines

2. Maintenance Suppression Workflow

Maintenance is a first-class control mechanism in Implemented.

2.1 Maintenance Matching

For each incoming or updated alert:

Active maintenance windows are evaluated
Matching is done by:
Site
Asset
Tags and scope

2.2 Suppression Behavior

If maintenance applies:

Alert is marked as suppressed
Alert cannot create or update incidents
Ticket creation is blocked
Alert history is still preserved for audit

This prevents:

False incidents
Operator fatigue
Incorrect escalation during planned work

Planned Planned:

Provider-side maintenance synchronization
Impact simulation during maintenance windows

3. Alert Correlation Workflow

3.1 Correlation Purpose

Raw alerts are not directly suitable for human workflows.

Correlation groups alerts into meaningful operational problems and reduces noise.

3.2 Correlation Rules (Implemented Scope)

In Implemented, correlation is limited and rule-light:

Alerts may be grouped by:
Asset
Site
Simple time proximity

Notes:

No advanced correlation engine is implemented yet
Topology-based correlation is not automatic in Implemented

3.3 Correlation Flow

Implemented behavior:

New alert arrives
IIMS searches for an existing open incident for the same asset or site
If found, the alert is attached to that incident
If not found, an operator or simple rule may create a new incident

Rules:

Many alerts may belong to one incident
One alert belongs to at most one active incident

Planned Planned:

Advanced correlation rules
Automatic incident creation and merging
Topology-aware correlation

4. Incident Creation and Lifecycle

4.1 Incident Creation

An incident is created when:

One or more alerts indicate a real operational problem
An operator or simple rule decides to open an incident

Incidents are the central operational objects in IIMS.

4.2 Incident Lifecycle States

Implemented lifecycle:

New
Investigating / Open
Resolved
Closed

Limited / optional:

Suppressed (maintenance)

Notes:

Assigned / In-Progress / SLA states are not fully implemented in Implemented

Planned Planned:

Assigned and In-Progress states
Incident merging and duplication handling
SLA timers and escalation policies

4.3 Incident Updates

During the lifecycle :

Alerts may be added or removed
Severity may change
Tickets may be created or linked
Comments and actions are recorded

All important changes generate activity events.

4.4 Incident Ownership and Responsibility

Implemented tracking:

Incident status
Related alerts
Linked ticket (optional)

Limitations:

Ownership and team assignment are basic
No automated SLA enforcement

Planned Planned:

Full ownership and team assignment model
SLA timers and breach detection
Escalation workflows

5. Ticket Synchronization Workflow

5.1 Ticket Creation Policy

Tickets are optional.

A ticket may be created when:

Operator requests ticket creation manually
Incident reaches high severity

Not every incident requires a ticket.

5.2 Ticket Creation Flow

Implemented flow:

Operator requests ticket creation from an incident
IIMS sends request to the Zammad adapter
Zammad creates a ticket
Ticket reference is stored in the incident

Maintenance safeguard:

Ticket creation is blocked if maintenance is active for the asset or site

5.3 Ticket Synchronization

Current behavior:

Ticket reference and status are stored in IIMS
Limited periodic synchronization may update status

Limitations:

No full bi-directional real-time sync
Comments and workflow states are not fully mirrored

Planned Planned:

Real-time bi-directional synchronization
Multi-ticket per incident
Multi-provider ticketing support

6. Topology and Geo Impact Workflow

6.1 Purpose of Topology in Implemented

Topology and geo views are used primarily for:

Visualization of connectivity
Situational awareness
Manual impact assessment

6.2 Impact Behavior

Implemented behavior:

Asset and link status are computed individually
Link status is evaluated using manual bindings and policy rules
GeoMap shows clusters, assets, and links

Limitations:

No automatic impact propagation
No root cause inference engine

Planned Planned:

Automatic dependency traversal
Blast-radius computation
Root cause candidate identification
Service-level impact modeling

7. Activity and Audit Workflow

7.1 Activity Event Generation

Activity events are generated for:

Incident creation and updates
Ticket creation and status changes
Maintenance creation and updates
User comments and actions

7.2 Audit and Timeline

Implemented provides:

Per-incident activity timeline
Audit trail for operational actions

Planned Planned:

Cross-object timelines (site, asset, service)
Post-mortem and reporting tools
SLA and compliance reporting

8. Dashboard and Cache Update Workflow

8.1 Operational Caches

IIMS maintains read caches for:

Alert summaries
Incident counts
Asset and site health
GeoMap and topology views

8.2 Refresh Flow

When operational state changes:

Domain services update persistent state
Background workers refresh summary caches
UI queries only IIMS APIs and caches

This ensures fast UI performance and scalability.

Planned Planned:

Streaming and push-based UI updates
Real-time dashboards with WebSockets

9. Failure Handling and Recovery

9.1 Provider Failures

Behavior:

Provider errors are captured and logged
Background retries handle recovery
IIMS core remains operational

Limitations:

No automatic provider failover

Planned Planned:

Provider health scoring
Automatic failover and degraded-mode routing

9.2 Idempotency and De-duplication

Implemented safeguards:

Alert ingestion is idempotent
Incident creation prevents duplicates
Ticket creation is protected against repeated requests

Planned Planned:

Global deduplication rules
Cross-provider idempotency

10. End-to-End Example Workflow

Typical failure scenario:

Router interface goes down in Zabbix
Alert is ingested into IIMS
Maintenance rules are checked (none active)
Operator reviews alert and creates or attaches to an incident
Link and asset status are visualized on GeoMap
Operator creates a ticket in Zammad
Engineer investigates and resolves the issue
Alerts clear and incident moves to Resolved
Ticket is closed and incident is Closed

Planned Enhancements

Recommended Planned focus areas:

Automated alert correlation and incident creation
Topology-based impact propagation and RCA
SLA timers, escalation, and assignment workflows
Real-time ticket synchronization
Streaming dashboards and push notifications
Service and business impact modeling

12. Summary

In Implemented, IIMS operational workflows provide:

Reliable alert ingestion and suppression
Manual and rule-light incident handling
Safe ticket creation with maintenance gating
Visual topology and geo situational awareness
Full audit and activity tracking

This establishes a stable operational foundation.

Future phases extend this foundation with:

Automation and intelligence
Root cause and impact engines
SLA-driven workflows
Real-time collaboration and dashboards

These enhancements build naturally on the Implemented architecture without breaking existing operations.