Skip to content

IIMS – Operational Workflows & Lifecycle

Information Infrastructure Management System (IIMS)

How alerts become incidents, incidents become actions, and operations stay under control


Purpose

This document describes the main operational workflows and lifecycles in IIMS, aligned with the current iims-api implementation and Flutter UI.

It explains:

  • How alerts are ingested and suppressed by maintenance
  • How incidents are created and managed
  • How tickets are created and linked
  • How topology and geo views provide situational awareness
  • How caches and dashboards are refreshed

Overview of Operational Flow

At a high level, IIMS manages five operational lifecycles:

  • Alert ingestion and normalization
  • Maintenance suppression and control
  • Incident creation and lifecycle management
  • Ticket creation and limited synchronization
  • Topology and geo visualization for situational awareness

These lifecycles transform raw monitoring signals into structured human workflows.


1. Alert Ingestion Workflow

1.1 Alert Sources

Alerts originate primarily from:

  • Zabbix (primary monitoring provider in Implemented)

Notes:

  • Prometheus and additional providers are planned for Planned

Alerts represent raw technical signals. They may be:

  • Frequent
  • Short-lived
  • Repetitive and noisy

1.2 Alert Ingestion Flow

Implemented flow:

  • Monitoring provider generates an alert
  • Provider adapter receives and normalizes the alert
  • Alert is stored in:

  • Alert history

  • Alert cache (for dashboards and maps)
  • Maintenance suppression rules are applied immediately

At this stage:

  • The alert is a technical event
  • No automatic incident creation is guaranteed

Planned Planned:

  • Multi-provider alert ingestion
  • Streaming ingestion pipelines

2. Maintenance Suppression Workflow

Maintenance is a first-class control mechanism in Implemented.

2.1 Maintenance Matching

For each incoming or updated alert:

  • Active maintenance windows are evaluated
  • Matching is done by:

  • Site

  • Asset
  • Tags and scope

2.2 Suppression Behavior

If maintenance applies:

  • Alert is marked as suppressed
  • Alert cannot create or update incidents
  • Ticket creation is blocked
  • Alert history is still preserved for audit

This prevents:

  • False incidents
  • Operator fatigue
  • Incorrect escalation during planned work

Planned Planned:

  • Provider-side maintenance synchronization
  • Impact simulation during maintenance windows

3. Alert Correlation Workflow

3.1 Correlation Purpose

Raw alerts are not directly suitable for human workflows.

Correlation groups alerts into meaningful operational problems and reduces noise.


3.2 Correlation Rules (Implemented Scope)

In Implemented, correlation is limited and rule-light:

  • Alerts may be grouped by:

  • Asset

  • Site
  • Simple time proximity

Notes:

  • No advanced correlation engine is implemented yet
  • Topology-based correlation is not automatic in Implemented

3.3 Correlation Flow

Implemented behavior:

  • New alert arrives
  • IIMS searches for an existing open incident for the same asset or site
  • If found, the alert is attached to that incident
  • If not found, an operator or simple rule may create a new incident

Rules:

  • Many alerts may belong to one incident
  • One alert belongs to at most one active incident

Planned Planned:

  • Advanced correlation rules
  • Automatic incident creation and merging
  • Topology-aware correlation

4. Incident Creation and Lifecycle

4.1 Incident Creation

An incident is created when:

  • One or more alerts indicate a real operational problem
  • An operator or simple rule decides to open an incident

Incidents are the central operational objects in IIMS.


4.2 Incident Lifecycle States

Implemented lifecycle:

  • New
  • Investigating / Open
  • Resolved
  • Closed

Limited / optional:

  • Suppressed (maintenance)

Notes:

  • Assigned / In-Progress / SLA states are not fully implemented in Implemented

Planned Planned:

  • Assigned and In-Progress states
  • Incident merging and duplication handling
  • SLA timers and escalation policies

4.3 Incident Updates

During the lifecycle :

  • Alerts may be added or removed
  • Severity may change
  • Tickets may be created or linked
  • Comments and actions are recorded

All important changes generate activity events.


4.4 Incident Ownership and Responsibility

Implemented tracking:

  • Incident status
  • Related alerts
  • Linked ticket (optional)

Limitations:

  • Ownership and team assignment are basic
  • No automated SLA enforcement

Planned Planned:

  • Full ownership and team assignment model
  • SLA timers and breach detection
  • Escalation workflows

5. Ticket Synchronization Workflow

5.1 Ticket Creation Policy

Tickets are optional.

A ticket may be created when:

  • Operator requests ticket creation manually
  • Incident reaches high severity

Not every incident requires a ticket.


5.2 Ticket Creation Flow

Implemented flow:

  • Operator requests ticket creation from an incident
  • IIMS sends request to the Zammad adapter
  • Zammad creates a ticket
  • Ticket reference is stored in the incident

Maintenance safeguard:

  • Ticket creation is blocked if maintenance is active for the asset or site

5.3 Ticket Synchronization

Current behavior:

  • Ticket reference and status are stored in IIMS
  • Limited periodic synchronization may update status

Limitations:

  • No full bi-directional real-time sync
  • Comments and workflow states are not fully mirrored

Planned Planned:

  • Real-time bi-directional synchronization
  • Multi-ticket per incident
  • Multi-provider ticketing support

6. Topology and Geo Impact Workflow

6.1 Purpose of Topology in Implemented

Topology and geo views are used primarily for:

  • Visualization of connectivity
  • Situational awareness
  • Manual impact assessment

6.2 Impact Behavior

Implemented behavior:

  • Asset and link status are computed individually
  • Link status is evaluated using manual bindings and policy rules
  • GeoMap shows clusters, assets, and links

Limitations:

  • No automatic impact propagation
  • No root cause inference engine

Planned Planned:

  • Automatic dependency traversal
  • Blast-radius computation
  • Root cause candidate identification
  • Service-level impact modeling

7. Activity and Audit Workflow

7.1 Activity Event Generation

Activity events are generated for:

  • Incident creation and updates
  • Ticket creation and status changes
  • Maintenance creation and updates
  • User comments and actions

7.2 Audit and Timeline

Implemented provides:

  • Per-incident activity timeline
  • Audit trail for operational actions

Planned Planned:

  • Cross-object timelines (site, asset, service)
  • Post-mortem and reporting tools
  • SLA and compliance reporting

8. Dashboard and Cache Update Workflow

8.1 Operational Caches

IIMS maintains read caches for:

  • Alert summaries
  • Incident counts
  • Asset and site health
  • GeoMap and topology views

8.2 Refresh Flow

When operational state changes:

  • Domain services update persistent state
  • Background workers refresh summary caches
  • UI queries only IIMS APIs and caches

This ensures fast UI performance and scalability.

Planned Planned:

  • Streaming and push-based UI updates
  • Real-time dashboards with WebSockets

9. Failure Handling and Recovery

9.1 Provider Failures

Behavior:

  • Provider errors are captured and logged
  • Background retries handle recovery
  • IIMS core remains operational

Limitations:

  • No automatic provider failover

Planned Planned:

  • Provider health scoring
  • Automatic failover and degraded-mode routing

9.2 Idempotency and De-duplication

Implemented safeguards:

  • Alert ingestion is idempotent
  • Incident creation prevents duplicates
  • Ticket creation is protected against repeated requests

Planned Planned:

  • Global deduplication rules
  • Cross-provider idempotency

10. End-to-End Example Workflow

Typical failure scenario:

  • Router interface goes down in Zabbix
  • Alert is ingested into IIMS
  • Maintenance rules are checked (none active)
  • Operator reviews alert and creates or attaches to an incident
  • Link and asset status are visualized on GeoMap
  • Operator creates a ticket in Zammad
  • Engineer investigates and resolves the issue
  • Alerts clear and incident moves to Resolved
  • Ticket is closed and incident is Closed

Planned Enhancements

Recommended Planned focus areas:

  1. Automated alert correlation and incident creation
  2. Topology-based impact propagation and RCA
  3. SLA timers, escalation, and assignment workflows
  4. Real-time ticket synchronization
  5. Streaming dashboards and push notifications
  6. Service and business impact modeling

12. Summary

In Implemented, IIMS operational workflows provide:

  • Reliable alert ingestion and suppression
  • Manual and rule-light incident handling
  • Safe ticket creation with maintenance gating
  • Visual topology and geo situational awareness
  • Full audit and activity tracking

This establishes a stable operational foundation.

Future phases extend this foundation with:

  • Automation and intelligence
  • Root cause and impact engines
  • SLA-driven workflows
  • Real-time collaboration and dashboards

These enhancements build naturally on the Implemented architecture without breaking existing operations.