How to Set Up Effective Server Monitoring and Alerting with ChatOps | Raymond Trinidad | DevOps Engineer | Cloud Infrastructure

Server monitoring is one of those things that gets ignored until 2 AM when everything is on fire. This post covers the principles and tooling I’ve used to build monitoring stacks that catch issues before users do - and how to wire them into Microsoft Teams so the right person gets notified in the right channel automatically.

Why Most Monitoring Setups Fail

The common mistake is monitoring too much noise and not enough signal. Teams instrument everything, set broad thresholds, and then start ignoring alerts because they fire constantly. The result: alert fatigue, and the one critical alert that matters gets buried.

Good monitoring answers three questions fast:

Is the system up? (availability)
Is it performing within acceptable bounds? (performance)
Is something trending toward failure? (predictive)

The Core Metrics to Track

For any Linux server or AWS workload, these are non-negotiable:

CPU - Sustained high CPU (above 85% for more than 5 minutes) usually signals a runaway process or undersized infrastructure. Short spikes are normal; sustained load is not.

Memory - Watch both usage and swap activity. Heavy swap activity on a box with free RAM usually means memory fragmentation or a leak. Note: EC2 does not publish memory metrics by default - you need the CloudWatch Agent installed to get mem_used_percent.

Disk - Monitor both usage percentage and inode consumption. Disks that are 80% full on data but 100% full on inodes will cause failures that look completely unrelated. Like memory, disk_used_percent requires the CloudWatch Agent.

Network I/O - Baseline your normal throughput, then alert on sustained deviations. Unexpected spikes can indicate data exfiltration, DDoS, or a misconfigured backup job hammering an S3 endpoint.

Load average - Compare against the number of CPU cores. A load average of 4.0 on a 4-core box is very different from 4.0 on a 32-core box.

Structuring Your Alerts

Before writing a single alert rule, define severity levels:

Severity	Response Time	Example
P1 – Critical	Immediate	Service down, data loss risk
P2 – High	Within 1 hour	Performance degraded, approaching limits
P3 – Medium	Next business day	Trending toward a threshold

Only P1 and P2 should page people. P3s should go to a ticket or dashboard - not a Teams notification at midnight.

ChatOps: Routing Alerts into Microsoft Teams

Done right, engineers never need to open the AWS console to know something broke - the alert finds them in the channel they’re already in. The architecture is:

CloudWatch Alarm
      |
      v
   SNS Topic
      |
      v
Lambda Function
      |
      v
Teams Workflow Webhook
      |
      v
  #ops-alerts channel

Step 1 - Create a Teams Workflow Webhook

Microsoft retired Office 365 Connectors (the old outlook.office.com/webhook/ format) in 2024. The current approach uses Teams Workflows (Power Automate):

In Teams, go to your #ops-alerts channel
Click ... (More options) > Workflows
Search for “Post to a channel when a webhook request is received”
Click Add workflow > name it CloudWatch Alerts > select your Team and Channel
Click Add workflow and copy the webhook URL

The URL will look like:

https://prod-XX.westus.logic.azure.com:443/workflows/xxxxxxxx/triggers/manual/paths/invoke?api-version=2016-06-01&...

Keep this URL secret - anyone with it can post to your channel.

Prerequisite: The disk_used_percent metric (namespace CWAgent) only appears after installing and configuring the CloudWatch Agent on your EC2 instances. CPU metrics (AWS/EC2 namespace) work without it.

# SNS topic  -  receives CloudWatch alarm state changes
resource "aws_sns_topic" "ops_alerts" {
  name = "ops-alerts"
}

# CPU alarm  -  fires when CPU > 85% sustained across 2 consecutive 5-minute periods (10 min total)
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
  alarm_name          = "HighCPU"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2        # 2 periods must breach before alarm fires
  metric_name         = "CPUUtilization"
  namespace           = "AWS/EC2"
  period              = 300      # 5-minute periods; alarm fires after 10 minutes sustained
  statistic           = "Average"
  threshold           = 85
  alarm_description   = "CPU above 85% for 10 consecutive minutes"
  alarm_actions       = [aws_sns_topic.ops_alerts.arn]
  ok_actions          = [aws_sns_topic.ops_alerts.arn] # also notify on recovery

  dimensions = {
    InstanceId = "<YOUR_INSTANCE_ID>" # replace with your aws_instance resource or literal ID
  }
}

# Disk alarm  -  requires CloudWatch Agent publishing to the CWAgent namespace
resource "aws_cloudwatch_metric_alarm" "disk_high" {
  alarm_name          = "HighDiskUsage"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 1
  metric_name         = "disk_used_percent"
  namespace           = "CWAgent"
  period              = 300
  statistic           = "Average"
  threshold           = 80
  alarm_description   = "Disk usage above 80%"
  alarm_actions       = [aws_sns_topic.ops_alerts.arn]
  ok_actions          = [aws_sns_topic.ops_alerts.arn]

  dimensions = {
    path   = "/"
    device = "xvda1"
    fstype = "ext4"
  }
}

# Subscribe the Lambda to the SNS topic
resource "aws_sns_topic_subscription" "teams" {
  topic_arn = aws_sns_topic.ops_alerts.arn
  protocol  = "lambda"
  endpoint  = aws_lambda_function.teams_notifier.arn
}

# Grant SNS permission to invoke the Lambda  -  required, or invocations will be denied
resource "aws_lambda_permission" "allow_sns" {
  statement_id  = "AllowSNSInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.teams_notifier.function_name
  principal     = "sns.amazonaws.com"
  source_arn    = aws_sns_topic.ops_alerts.arn
}

Step 3 - Lambda Function (Python)

This Lambda receives the SNS event, builds an Adaptive Card (the current Teams message format), and POSTs it to your Workflow webhook. Color-coded: red for ALARM, green for OK.

import json
import os
import urllib.parse   # quote() lives here  -  not in urllib.request
import urllib.request

TEAMS_WEBHOOK = os.environ["TEAMS_WEBHOOK_URL"]

# Adaptive Card colour tokens (Teams Workflow format)
COLORS = {
    "ALARM": "Attention",  # red
    "OK":    "Good",       # green
}


def lambda_handler(event, context):
    for record in event["Records"]:
        message = json.loads(record["Sns"]["Message"])
        _post_to_teams(message)


def _post_to_teams(msg: dict):
    state       = msg.get("NewStateValue", "UNKNOWN")
    alarm_name  = msg.get("AlarmName", "Unknown Alarm")
    reason      = msg.get("NewStateReason", "")
    region      = msg.get("Region", "")
    account     = msg.get("AWSAccountId", "")
    description = msg.get("AlarmDescription", "")

    icon  = "[ALARM]" if state == "ALARM" else "[OK]" if state == "OK" else "[WARN]"
    color = COLORS.get(state, "Default")

    # Build CloudWatch deep-link (urllib.parse.quote handles spaces and special chars)
    encoded_name = urllib.parse.quote(alarm_name)
    cw_url = (
        f"https://console.aws.amazon.com/cloudwatch/home"
        f"?region={region}#alarmsV2:alarm/{encoded_name}"
    )

    # Adaptive Card payload  -  compatible with Teams Workflows (Power Automate)
    card = {
        "type": "message",
        "attachments": [
            {
                "contentType": "application/vnd.microsoft.card.adaptive",
                "content": {
                    "$schema": "http://adaptivecards.io/schemas/adaptive-card.json",
                    "type": "AdaptiveCard",
                    "version": "1.4",
                    "body": [
                        {
                            "type": "TextBlock",
                            "text": f"{icon} {alarm_name}  -  {state}",
                            "size": "Large",
                            "weight": "Bolder",
                            "color": color,
                            "wrap": True,
                        },
                        {
                            "type": "TextBlock",
                            "text": description or "CloudWatch Alarm",
                            "isSubtle": True,
                            "wrap": True,
                        },
                        {
                            "type": "FactSet",
                            "facts": [
                                {"title": "State",   "value": state},
                                {"title": "Reason",  "value": reason},
                                {"title": "Region",  "value": region},
                                {"title": "Account", "value": account},
                            ],
                        },
                    ],
                    "actions": [
                        {
                            "type": "Action.OpenUrl",
                            "title": "View in CloudWatch",
                            "url": cw_url,
                        }
                    ],
                },
            }
        ],
    }

    req = urllib.request.Request(
        TEAMS_WEBHOOK,
        data=json.dumps(card).encode(),
        headers={"Content-Type": "application/json"},
        method="POST",
    )
    with urllib.request.urlopen(req, timeout=5) as resp:
        return resp.status

Deploy the function. You need an IAM execution role with AWSLambdaBasicExecutionRole at minimum:

# Create execution role
aws iam create-role \
  --role-name lambda-teams-notifier \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"lambda.amazonaws.com"},"Action":"sts:AssumeRole"}]}'

aws iam attach-role-policy \
  --role-name lambda-teams-notifier \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole

# Package and deploy
zip teams-notifier.zip handler.py

aws lambda create-function \
  --function-name ops-teams-notifier \
  --runtime python3.12 \
  --handler handler.lambda_handler \
  --zip-file fileb://teams-notifier.zip \
  --role arn:aws:iam::ACCOUNT_ID:role/lambda-teams-notifier \
  --environment "Variables={TEAMS_WEBHOOK_URL=https://prod-XX.westus.logic.azure.com/workflows/...}"

What the Teams message looks like

When a CPU alarm fires, your #ops-alerts channel receives an Adaptive Card:

[ALARM] HighCPU  -  ALARM
   CPU above 85% for 10 consecutive minutes

   State   | ALARM
   Reason  | Threshold Crossed: 2 out of 2 datapoints were
           | greater than the threshold (85.0). The most
           | recent datapoints: [92.4, 88.1].
   Region  | ap-southeast-1
   Account | 942521690250

   [ View in CloudWatch ]

When the issue clears, a green recovery card posts automatically because ok_actions is wired to the same SNS topic.

Automation First

Any alert that fires more than twice a week for the same root cause should be automated away. If disk fills up because log rotation isn’t running, fix log rotation - don’t keep getting paged for it. The Teams integration is not a replacement for fixing root causes; it’s a faster path to knowing something needs fixing.

Runbooks are a good start, but auto-remediation scripts are better for predictable, low-risk issues. Save the Teams ping for problems you haven’t seen before.

Trust Is the Real Metric

Monitoring is only useful if people trust it. A #ops-alerts channel full of noise will get muted. Start with fewer, high-signal alarms. Tune thresholds aggressively in the first month. Review your alarm history monthly and delete anything that hasn’t produced an actionable incident - the goal is zero false positives, not full coverage.

When an alarm fires and someone actually fixes something because of it, that’s when the system is working.