IIMS – Operational Workflows & Lifecycle
Information Infrastructure Management System (IIMS)
How alerts become incidents, incidents become actions, and operations stay under control
Purpose
This document describes the main operational workflows and lifecycles in IIMS, aligned with the current iims-api implementation and Flutter UI.
It explains:
- How alerts are ingested and suppressed by maintenance
- How incidents are created and managed
- How tickets are created and linked
- How topology and geo views provide situational awareness
- How caches and dashboards are refreshed
Overview of Operational Flow
At a high level, IIMS manages five operational lifecycles:
- Alert ingestion and normalization
- Maintenance suppression and control
- Incident creation and lifecycle management
- Ticket creation and limited synchronization
- Topology and geo visualization for situational awareness
These lifecycles transform raw monitoring signals into structured human workflows.
1. Alert Ingestion Workflow
1.1 Alert Sources
Alerts originate primarily from:
- Zabbix (primary monitoring provider in Implemented)
Notes:
- Prometheus and additional providers are planned for Planned
Alerts represent raw technical signals. They may be:
- Frequent
- Short-lived
- Repetitive and noisy
1.2 Alert Ingestion Flow
Implemented flow:
- Monitoring provider generates an alert
- Provider adapter receives and normalizes the alert
-
Alert is stored in:
-
Alert history
- Alert cache (for dashboards and maps)
- Maintenance suppression rules are applied immediately
At this stage:
- The alert is a technical event
- No automatic incident creation is guaranteed
Planned Planned:
- Multi-provider alert ingestion
- Streaming ingestion pipelines
2. Maintenance Suppression Workflow
Maintenance is a first-class control mechanism in Implemented.
2.1 Maintenance Matching
For each incoming or updated alert:
- Active maintenance windows are evaluated
-
Matching is done by:
-
Site
- Asset
- Tags and scope
2.2 Suppression Behavior
If maintenance applies:
- Alert is marked as suppressed
- Alert cannot create or update incidents
- Ticket creation is blocked
- Alert history is still preserved for audit
This prevents:
- False incidents
- Operator fatigue
- Incorrect escalation during planned work
Planned Planned:
- Provider-side maintenance synchronization
- Impact simulation during maintenance windows
3. Alert Correlation Workflow
3.1 Correlation Purpose
Raw alerts are not directly suitable for human workflows.
Correlation groups alerts into meaningful operational problems and reduces noise.
3.2 Correlation Rules (Implemented Scope)
In Implemented, correlation is limited and rule-light:
-
Alerts may be grouped by:
-
Asset
- Site
- Simple time proximity
Notes:
- No advanced correlation engine is implemented yet
- Topology-based correlation is not automatic in Implemented
3.3 Correlation Flow
Implemented behavior:
- New alert arrives
- IIMS searches for an existing open incident for the same asset or site
- If found, the alert is attached to that incident
- If not found, an operator or simple rule may create a new incident
Rules:
- Many alerts may belong to one incident
- One alert belongs to at most one active incident
Planned Planned:
- Advanced correlation rules
- Automatic incident creation and merging
- Topology-aware correlation
4. Incident Creation and Lifecycle
4.1 Incident Creation
An incident is created when:
- One or more alerts indicate a real operational problem
- An operator or simple rule decides to open an incident
Incidents are the central operational objects in IIMS.
4.2 Incident Lifecycle States
Implemented lifecycle:
- New
- Investigating / Open
- Resolved
- Closed
Limited / optional:
- Suppressed (maintenance)
Notes:
- Assigned / In-Progress / SLA states are not fully implemented in Implemented
Planned Planned:
- Assigned and In-Progress states
- Incident merging and duplication handling
- SLA timers and escalation policies
4.3 Incident Updates
During the lifecycle :
- Alerts may be added or removed
- Severity may change
- Tickets may be created or linked
- Comments and actions are recorded
All important changes generate activity events.
4.4 Incident Ownership and Responsibility
Implemented tracking:
- Incident status
- Related alerts
- Linked ticket (optional)
Limitations:
- Ownership and team assignment are basic
- No automated SLA enforcement
Planned Planned:
- Full ownership and team assignment model
- SLA timers and breach detection
- Escalation workflows
5. Ticket Synchronization Workflow
5.1 Ticket Creation Policy
Tickets are optional.
A ticket may be created when:
- Operator requests ticket creation manually
- Incident reaches high severity
Not every incident requires a ticket.
5.2 Ticket Creation Flow
Implemented flow:
- Operator requests ticket creation from an incident
- IIMS sends request to the Zammad adapter
- Zammad creates a ticket
- Ticket reference is stored in the incident
Maintenance safeguard:
- Ticket creation is blocked if maintenance is active for the asset or site
5.3 Ticket Synchronization
Current behavior:
- Ticket reference and status are stored in IIMS
- Limited periodic synchronization may update status
Limitations:
- No full bi-directional real-time sync
- Comments and workflow states are not fully mirrored
Planned Planned:
- Real-time bi-directional synchronization
- Multi-ticket per incident
- Multi-provider ticketing support
6. Topology and Geo Impact Workflow
6.1 Purpose of Topology in Implemented
Topology and geo views are used primarily for:
- Visualization of connectivity
- Situational awareness
- Manual impact assessment
6.2 Impact Behavior
Implemented behavior:
- Asset and link status are computed individually
- Link status is evaluated using manual bindings and policy rules
- GeoMap shows clusters, assets, and links
Limitations:
- No automatic impact propagation
- No root cause inference engine
Planned Planned:
- Automatic dependency traversal
- Blast-radius computation
- Root cause candidate identification
- Service-level impact modeling
7. Activity and Audit Workflow
7.1 Activity Event Generation
Activity events are generated for:
- Incident creation and updates
- Ticket creation and status changes
- Maintenance creation and updates
- User comments and actions
7.2 Audit and Timeline
Implemented provides:
- Per-incident activity timeline
- Audit trail for operational actions
Planned Planned:
- Cross-object timelines (site, asset, service)
- Post-mortem and reporting tools
- SLA and compliance reporting
8. Dashboard and Cache Update Workflow
8.1 Operational Caches
IIMS maintains read caches for:
- Alert summaries
- Incident counts
- Asset and site health
- GeoMap and topology views
8.2 Refresh Flow
When operational state changes:
- Domain services update persistent state
- Background workers refresh summary caches
- UI queries only IIMS APIs and caches
This ensures fast UI performance and scalability.
Planned Planned:
- Streaming and push-based UI updates
- Real-time dashboards with WebSockets
9. Failure Handling and Recovery
9.1 Provider Failures
Behavior:
- Provider errors are captured and logged
- Background retries handle recovery
- IIMS core remains operational
Limitations:
- No automatic provider failover
Planned Planned:
- Provider health scoring
- Automatic failover and degraded-mode routing
9.2 Idempotency and De-duplication
Implemented safeguards:
- Alert ingestion is idempotent
- Incident creation prevents duplicates
- Ticket creation is protected against repeated requests
Planned Planned:
- Global deduplication rules
- Cross-provider idempotency
10. End-to-End Example Workflow
Typical failure scenario:
- Router interface goes down in Zabbix
- Alert is ingested into IIMS
- Maintenance rules are checked (none active)
- Operator reviews alert and creates or attaches to an incident
- Link and asset status are visualized on GeoMap
- Operator creates a ticket in Zammad
- Engineer investigates and resolves the issue
- Alerts clear and incident moves to Resolved
- Ticket is closed and incident is Closed
Planned Enhancements
Recommended Planned focus areas:
- Automated alert correlation and incident creation
- Topology-based impact propagation and RCA
- SLA timers, escalation, and assignment workflows
- Real-time ticket synchronization
- Streaming dashboards and push notifications
- Service and business impact modeling
12. Summary
In Implemented, IIMS operational workflows provide:
- Reliable alert ingestion and suppression
- Manual and rule-light incident handling
- Safe ticket creation with maintenance gating
- Visual topology and geo situational awareness
- Full audit and activity tracking
This establishes a stable operational foundation.
Future phases extend this foundation with:
- Automation and intelligence
- Root cause and impact engines
- SLA-driven workflows
- Real-time collaboration and dashboards
These enhancements build naturally on the Implemented architecture without breaking existing operations.