Q6: Explain Prometheus data model and metric types.
Answer:
Understanding Prometheus's data model is fundamental to effectively using the monitoring system. The data model defines how metrics are structured, stored, and identified, forming the foundation for all monitoring and alerting capabilities.
Prometheus Data Model - The Foundation
Core Concept: Prometheus stores all data as time series - sequences of timestamped values belonging to the same metric and the same set of labeled dimensions.
Time Series Identity: Each time series is uniquely identified by:
- Metric Name: What you're measuring (e.g.,
http_requests_total
, cpu_usage_percent
)
- Labels: Key-value pairs that add dimensions (e.g.,
method="GET"
, status="200"
)
- Timestamp: When the measurement was taken (Unix timestamp)
- Value: The actual measurement (64-bit floating-point number)
Data Storage Format
Standard Format:
metric_name{label1="value1", label2="value2"} value timestamp
Real Example:
http_requests_total{method="GET", status="200", endpoint="/api/users"} 1027.0 1641024000
Data Model Visualization:
Time Series Database Structure:
┌─────────────────────────────────────────────────────────────────┐
│ PROMETHEUS TSDB │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Metric: http_requests_total │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Labels: {method="GET", status="200", endpoint="/api"} │ │
│ │ ┌─────────────┬─────────────┬─────────────┬───────────┐ │ │
│ │ │ Timestamp │ Value │ Timestamp │ Value │ │ │
│ │ │ 1641024000 │ 1027.0 │ 1641024015 │ 1031.0 │ │ │
│ │ │ 1641024030 │ 1035.0 │ 1641024045 │ 1040.0 │ │ │
│ │ └─────────────┴─────────────┴─────────────┴───────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Labels: {method="POST", status="200", endpoint="/api"} │ │
│ │ ┌─────────────┬─────────────┬─────────────┬───────────┐ │ │
│ │ │ Timestamp │ Value │ Timestamp │ Value │ │ │
│ │ │ 1641024000 │ 45.0 │ 1641024015 │ 47.0 │ │ │
│ │ └─────────────┴─────────────┴─────────────┴───────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Label Theory and Best Practices
Label Cardinality: The combination of all possible label values creates the cardinality of a metric. High cardinality can impact performance.
Good Label Example:
http_requests_total{method="GET", status="200", service="api"}
# Low cardinality: method (5 values), status (10 values), service (20 values)
# Total combinations: 5 × 10 × 20 = 1,000 time series
Bad Label Example (High Cardinality):
http_requests_total{method="GET", status="200", user_id="12345"}
# High cardinality: user_id could have millions of values
# Avoid using user IDs, request IDs, or other unbounded values as labels
Four Metric Types in Detail
1. Counter - The Accumulator
Theory: Counters represent cumulative metrics that only increase (or reset to zero on restart). They're perfect for counting events like requests, errors, or bytes transferred.
Key Characteristics:
- Monotonically increasing
- Resets only when process restarts
- Rate of change is more meaningful than absolute value
- Always starts from 0
Mathematical Representation:
Counter(t) ≥ Counter(t-1) for all t (except resets)
Practical Examples:
prometheus
# HTTP requests counter
http_requests_total{method="GET", status="200"} 1027
# Bytes sent counter
bytes_sent_total{interface="eth0"} 482847392
# Error counter
errors_total{type="timeout", service="payment"} 15
Common PromQL Operations with Counters:
promql
# Rate of HTTP requests per second over 5 minutes
rate(http_requests_total[5m])
# Total increase in requests over 1 hour
increase(http_requests_total[1h])
# Requests per minute
rate(http_requests_total[5m]) * 60
Counter Implementation Example (Go):
go
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
httpRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "status", "endpoint"},
)
)
func init() {
prometheus.MustRegister(httpRequests)
}
func handleRequest(w http.ResponseWriter, r *http.Request) {
// Increment counter for each request
httpRequests.WithLabelValues(r.Method, "200", r.URL.Path).Inc()
w.Write([]byte("Hello World"))
}
Counter Reset Detection:
promql
# Detect counter resets (useful for calculating rates across restarts)
resets(http_requests_total[1h])
2. Gauge - The Thermometer
Theory: Gauges represent point-in-time values that can go up and down. They're like a thermometer or speedometer - showing current state rather than cumulative values.
Key Characteristics:
- Values can increase or decrease
- Represents current state/level
- Instant value is meaningful
- No reset behavior
Mathematical Representation:
Gauge(t) can be any value relative to Gauge(t-1)
Practical Examples:
prometheus
# CPU usage percentage
cpu_usage_percent{cpu="0", mode="user"} 45.2
# Memory usage in bytes
memory_usage_bytes{type="heap"} 1073741824
# Current temperature
temperature_celsius{sensor="cpu", location="server_room"} 68.5
# Active connections
active_connections{service="database", pool="primary"} 42
Common PromQL Operations with Gauges:
promql
# Current CPU usage
cpu_usage_percent
# Average memory usage across instances
avg(memory_usage_bytes) by (instance)
# Maximum temperature in last hour
max_over_time(temperature_celsius[1h])
# Memory usage trend (derivative)
deriv(memory_usage_bytes[10m])
Gauge Implementation Example (Python):
python
from prometheus_client import Gauge, start_http_server
import psutil
import time
# Create gauge metrics
cpu_usage = Gauge('cpu_usage_percent', 'CPU usage percentage', ['cpu'])
memory_usage = Gauge('memory_usage_bytes', 'Memory usage in bytes')
disk_usage = Gauge('disk_usage_percent', 'Disk usage percentage', ['device'])
def collect_system_metrics():
while True:
# Update CPU usage for each core
cpu_percentages = psutil.cpu_percent(percpu=True)
for i, usage in enumerate(cpu_percentages):
cpu_usage.labels(cpu=str(i)).set(usage)
# Update memory usage
memory = psutil.virtual_memory()
memory_usage.set(memory.used)
# Update disk usage
for partition in psutil.disk_partitions():
try:
usage = psutil.disk_usage(partition.mountpoint)
disk_usage.labels(device=partition.device).set(
(usage.used / usage.total) * 100
)
except PermissionError:
continue
time.sleep(15)
if __name__ == '__main__':
start_http_server(8000)
collect_system_metrics()
3. Histogram - The Distribution Analyzer
Theory: Histograms sample observations and count them in configurable buckets. They're designed to measure distributions of values like request latencies, response sizes, or any measurement where you need to understand the distribution pattern.
Key Characteristics:
- Samples observations into buckets
- Provides count, sum, and bucket counts
- Buckets are cumulative (le = "less than or equal")
- Enables percentile calculations
Histogram Components:
prometheus
# Bucket counters (cumulative)
http_request_duration_seconds_bucket{le="0.1"} 24054
http_request_duration_seconds_bucket{le="0.25"} 26335
http_request_duration_seconds_bucket{le="0.5"} 27534
http_request_duration_seconds_bucket{le="1.0"} 28126
http_request_duration_seconds_bucket{le="2.5"} 28312
http_request_duration_seconds_bucket{le="5.0"} 28358
http_request_duration_seconds_bucket{le="10.0"} 28367
http_request_duration_seconds_bucket{le="+Inf"} 28367
# Total count of all observations
http_request_duration_seconds_count 28367
# Sum of all observed values
http_request_duration_seconds_sum 1896.04
Histogram Bucket Visualization:
Distribution of Response Times:
┌─────────────────────────────────────────────────────────────────┐
│ Bucket Analysis for http_request_duration_seconds │
├─────────────────────────────────────────────────────────────────┤
│ │
│ le="0.1" │████████████████████████████████████████│ 24,054 │
│ le="0.25" │██│ 2,281 (26,335 - 24,054) │
│ le="0.5" │█│ 1,199 (27,534 - 26,335) │
│ le="1.0" │█│ 592 (28,126 - 27,534) │
│ le="2.5" ││ 186 (28,312 - 28,126) │
│ le="5.0" ││ 46 (28,358 - 28,312) │
│ le="10.0" ││ 9 (28,367 - 28,358) │
│ le="+Inf" ││ 0 (28,367 - 28,367) │
│ │
│ Total Observations: 28,367 │
│ Sum of All Values: 1,896.04 seconds │
│ Average Response Time: 0.067 seconds │
└─────────────────────────────────────────────────────────────────┘
Percentile Calculations with Histograms:
promql
# 50th percentile (median) response time
histogram_quantile(0.5, rate(http_request_duration_seconds_bucket[5m]))
# 95th percentile response time
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 99th percentile response time
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
Histogram Implementation Example (Go):
go
package main
import (
"math/rand"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var (
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request latency distributions",
Buckets: []float64{0.1, 0.25, 0.5, 1, 2.5, 5, 10}, // Custom buckets
},
[]string{"method", "endpoint"},
)
)
func init() {
prometheus.MustRegister(requestDuration)
}
func simulateRequest(method, endpoint string) {
start := time.Now()
// Simulate request processing
processingTime := rand.Float64() * 2 // 0-2 seconds
time.Sleep(time.Duration(processingTime * float64(time.Second)))
// Record the duration
duration := time.Since(start).Seconds()
requestDuration.WithLabelValues(method, endpoint).Observe(duration)
}
4. Summary - The Quantile Calculator
Theory: Summaries are similar to histograms but calculate quantiles directly on the client side. They provide count, sum, and configurable quantiles, offering an alternative approach to understanding distributions.
Key Characteristics:
- Calculates quantiles client-side
- Provides count, sum, and quantiles
- Lower server-side computational cost
- Less flexible for aggregation across instances
Summary Components:
prometheus
# Configured quantiles
http_request_duration_seconds{quantile="0.5"} 0.052
http_request_duration_seconds{quantile="0.9"} 0.564
http_request_duration_seconds{quantile="0.99"} 1.245
# Total count of observations
http_request_duration_seconds_count 144320
# Sum of all observed values
http_request_duration_seconds_sum 1896.04
Summary vs Histogram Comparison:
Aspect | Histogram | Summary |
---|
Quantile Calculation | Server-side with PromQL | Client-side |
Accuracy | Approximated from buckets | Exact for configured quantiles |
Aggregation | Can aggregate across instances | Cannot aggregate quantiles |
Storage | Stores bucket counts | Stores quantile values |
Flexibility | Any quantile can be calculated | Only pre-configured quantiles |
Performance | Higher server load for queries | Higher client memory usage |
Summary Implementation Example (Python):
python
from prometheus_client import Summary, start_http_server
import time
import random
# Create summary with custom quantiles
request_latency = Summary(
'request_processing_seconds',
'Time spent processing requests',
['method', 'endpoint']
)
# Decorator for automatic timing
@request_latency.labels(method='GET', endpoint='/api/users').time()
def process_get_users():
# Simulate processing time
time.sleep(random.uniform(0.1, 0.5))
return "users data"
# Manual timing
def process_post_users():
with request_latency.labels(method='POST', endpoint='/api/users').time():
# Simulate processing
time.sleep(random.uniform(0.2, 1.0))
return "user created"
if __name__ == '__main__':
start_http_server(8000)
while True:
process_get_users()
process_post_users()
time.sleep(1)
Choosing the Right Metric Type
Decision Matrix:
Use Case | Metric Type | Reasoning |
---|
Count events (requests, errors) | Counter | Events only increase over time |
Current state (CPU, memory, queue size) | Gauge | Values can go up and down |
Measure distributions (latency, size) | Histogram | Need percentiles and buckets |
Simple quantiles (response time) | Summary | Pre-defined quantiles sufficient |
Best Practices:
- Use descriptive names:
http_requests_total
vs requests
- Include units:
duration_seconds
, size_bytes
- Use consistent labels: Same labels across related metrics
- Avoid high cardinality: Don't use user IDs or request IDs as labels
- Document metrics: Include help text describing what each metric measures
Q7: How do you configure Prometheus to scrape targets?
Answer:
Prometheus configuration is the cornerstone of effective monitoring. The configuration file defines what to monitor, how often to collect data, and how to process that data. Understanding configuration is essential for building robust monitoring systems.
Configuration File Structure
Prometheus uses a YAML configuration file (typically prometheus.yml
) with several main sections:
Complete Configuration Template:
yaml
# Global configuration
global:
scrape_interval: 15s # How often to scrape targets by default
evaluation_interval: 15s # How often to evaluate rules
external_labels: # Labels attached to any time series
monitor: 'production'
datacenter: 'us-east-1'
# Rule files
rule_files:
- "rules/*.yml"
- "alerts/*.yml"
# Alerting configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
timeout: 10s
api_version: v2
# Scrape configurations
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100', 'server2:9100']
scrape_interval: 5s
metrics_path: /metrics
scheme: https
# Remote write configuration (optional)
remote_write:
- url: "https://remote-storage-endpoint/api/v1/write"
# Remote read configuration (optional)
remote_read:
- url: "https://remote-storage-endpoint/api/v1/read"
Global Configuration Section
Purpose: Sets default values that apply to all jobs unless overridden.
Key Parameters:
yaml
global:
scrape_interval: 15s # Default scrape frequency
scrape_timeout: 10s # Default scrape timeout
evaluation_interval: 15s # How often to evaluate recording and alerting rules
external_labels: # Labels added to all time series
region: 'us-west-2'
environment: 'production'
Theory Behind Intervals:
- Scrape Interval: Balance between data resolution and system load
- Evaluation Interval: How quickly alerts are triggered
- Timeout: Must be less than scrape interval
Scrape Configuration Deep Dive
Basic Static Configuration:
yaml
scrape_configs:
- job_name: 'web-servers' # Logical grouping name
static_configs:
- targets:
- 'web1.example.com:8080'
- 'web2.example.com:8080'
- 'web3.example.com:8080'
scrape_interval: 30s # Override global interval
scrape_timeout: 10s # Maximum time to wait for response
metrics_path: '/metrics' # Path to metrics endpoint
scheme: 'http' # http or https
params: # URL parameters
format: ['prometheus']
basic_auth: # HTTP basic authentication
username: 'prometheus'
password: 'secret'
bearer_token: 'abc123' # Bearer token authentication
tls_config: # TLS configuration
ca_file: '/path/to/ca.pem'
cert_file: '/path/to/cert.pem'
key_file: '/path/to/key.pem'
insecure_skip_verify: false
Service Discovery Mechanisms
Theory: Modern infrastructure is dynamic. Containers start and stop, auto-scaling changes instance counts, and services move between hosts. Static configuration doesn't scale. Service discovery automatically finds targets to monitor.
Kubernetes Service Discovery:
yaml
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod # Discover pods
namespaces:
names:
- default
- monitoring
relabel_configs:
# Only scrape pods with annotation prometheus.io/scrape=true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom metrics path if specified
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
# Use custom port if specified
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
# Add pod name as label
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
Service Discovery Flow Diagram:
┌─────────────────────────────────────────────────────────────────┐
│ SERVICE DISCOVERY FLOW │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ API Query ┌─────────────┐ │
│ │ Prometheus │ ────────────── │ Kubernetes │ │
│ │ Server │ │ API Server │ │
│ └─────────────┘ └─────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Target │ │ Pod │ │
│ │ Discovery │ ◀──────────── │ Metadata │ │
│ └─────────────┘ Pod Info └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Relabeling │ │
│ │ Rules │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ HTTP GET ┌─────────────┐ │
│ │ Scrape │ ────────────── │ Target │ │
│ │ Targets │ /metrics │ Endpoints │ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
AWS EC2 Service Discovery:
yaml
scrape_configs:
- job_name: 'ec2-instances'
ec2_sd_configs:
- region: us-west-2
port: 9100
filters:
- name: 'tag:Environment'
values: ['production', 'staging']
- name: 'instance-state-name'
values: ['running']
relabel_configs:
# Use instance ID as instance label
- source_labels: [__meta_ec2_instance_id]
target_label: instance
# Add environment from EC2 tag
- source_labels: [__meta_ec2_tag_Environment]
target_label: environment
# Add instance type
- source_labels: [__meta_ec2_instance_type]
target_label: instance_type
Consul Service Discovery:
yaml
scrape_configs:
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul.service.consul:8500'
services: ['web', 'api', 'database']
relabel_configs:
# Use service name as job label
- source_labels: [__meta_consul_service]
target_label: job
# Add datacenter information
- source_labels: [__meta_consul_datacenter]
target_label: datacenter
# Filter healthy services only
- source_labels: [__meta_consul_health]
regex: passing
action: keep
Relabeling - The Power Tool
Theory: Relabeling is Prometheus's Swiss Army knife. It allows you to modify, add, or remove labels before storing metrics. This is crucial for:
- Filtering unwanted targets
- Adding contextual information
- Standardizing label names
- Reducing cardinality
Relabeling Actions:
- replace: Replace label value with new value
- keep: Keep only targets where label matches regex
- drop: Drop targets where label matches regex
- labelmap: Map label names using regex
- labeldrop: Drop labels matching regex
- labelkeep: Keep only labels matching regex
Advanced Relabeling Examples:
yaml
relabel_configs:
# Keep only targets with specific annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: 'true'
# Extract service name from pod name
- source_labels: [__meta_kubernetes_pod_name]
regex: '([^-]+)-.*'
replacement: '${1}'
target_label: service
# Add custom labels based on namespace
- source_labels: [__meta_kubernetes_namespace]
regex: 'production'
replacement: 'prod'
target_label: environment
# Drop internal labels (starting with __)
- regex: '__meta_.*'
action: labeldrop
# Map Kubernetes labels to Prometheus labels
- regex: '__meta_kubernetes_pod_label_(.+)'
action: labelmap
replacement: 'k8s_${1}'
Authentication and Security
Basic Authentication:
yaml
scrape_configs:
- job_name: 'secured-app'
static_configs:
- targets: ['app.example.com:8080']
basic_auth:
username: 'monitoring'
password_file: '/etc/prometheus/passwords/app_password'
Bearer Token Authentication:
yaml
scrape_configs:
- job_name: 'api-service'
static_configs:
- targets: ['api.example.com:8080']
bearer_token_file: '/etc/prometheus/tokens/api_token'
TLS Configuration:
yaml
scrape_configs:
- job_name: 'https-service'
static_configs:
- targets: ['secure.example.com:8443']
scheme: https
tls_config:
ca_file: '/etc/prometheus/ca.pem'
cert_file: '/etc/prometheus/client.pem'
key_file: '/etc/prometheus/client-key.pem'
server_name: 'secure.example.com'
insecure_skip_verify: false
Configuration Validation and Best Practices
Validation Commands:
bash
# Check configuration syntax
promtool check config prometheus.yml
# Check rules syntax
promtool check rules /etc/prometheus/rules/*.yml
# Test configuration without starting server
prometheus --config.file=prometheus.yml --dry-run
Best Practices:
- Start Simple: Begin with static configs, add service discovery later
- Use Meaningful Job Names: Make them descriptive and consistent
- Group Related Targets: Use job names to group similar services
- Implement Proper Labeling: Use consistent label naming conventions
- Monitor Configuration Changes: Version control your config files
- Test Before Deploy: Always validate configuration changes
- Use Appropriate Intervals: Balance data resolution with system load
- Secure Credentials: Use file-based authentication, never hardcode passwords
Example Production Configuration:
yaml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-2'
rule_files:
- '/etc/prometheus/rules/recording.yml'
- '/etc/prometheus/rules/alerting.yml'
alerting:
alertmanagers:
- kubernetes_sd_configs:
- role: pod
namespaces:
names: ['monitoring']
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: alertmanager
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node exporters via service discovery
- job_name: 'node-exporter'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: node-exporter
- source_labels: [__address__]
regex: '([^:]+):.*'
replacement: '${1}:9100'
target_label: __address__
# Application pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Q8: What are Prometheus exporters and how do they work?
Answer:
Prometheus exporters are specialized programs that bridge the gap between Prometheus and systems that don't natively expose metrics in Prometheus format. They're essential components of the Prometheus ecosystem, enabling monitoring of virtually any system or service.
Exporter Architecture and Theory
Core Concept: An exporter acts as a translator, collecting metrics from a target system (database, operating system, application) and exposing them in Prometheus format via an HTTP endpoint.
Exporter Workflow:
┌─────────────────────────────────────────────────────────────────┐
│ EXPORTER ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ Native API ┌─────────────┐ │
│ │ Target │ ────────────── │ Exporter │ │
│ │ System │ (MySQL API, │ │ │
│ │ (Database, │ /proc files, │ - Collects │ │
│ │ App, etc.) │ REST API) │ - Converts │ │
│ └─────────────┘ │ - Exposes │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ HTTP Server │ │
│ │ /metrics │ │
│ │ endpoint │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ HTTP GET ┌─────────────┐ │
│ │ Prometheus │ ────────────── │ Prometheus │ │
│ │ Server │ /metrics │ Metrics │ │
│ │ │ │ Format │ │
│ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
How Exporters Work - Step by Step
Step 1: Data Collection
The exporter connects to the target system using its native API, protocols, or interfaces:
- Databases: SQL queries, admin commands
- Operating Systems: /proc filesystem, system calls
- APIs: REST endpoints, GraphQL queries
- Files: Log parsing, configuration files
Step 2: Data Transformation
Raw data is converted into Prometheus metric format:
- Apply naming conventions
- Add appropriate labels
- Choose correct metric types
- Handle missing or invalid data
Step 3: HTTP Exposition
Metrics are exposed via HTTP endpoint (typically /metrics):
- Standard Prometheus text format
- Real-time generation (not cached)
- Consistent response format
Step 4: Prometheus Scraping
Prometheus periodically scrapes the exporter endpoint according to configuration.
Official Prometheus Exporters
1. Node Exporter - System Metrics Champion
Purpose: Exposes hardware and OS metrics for Unix systems (Linux, FreeBSD, macOS).
Installation and Setup:
bash
# Download and install
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
# Run with default configuration
./node_exporter
# Run with specific collectors enabled
./node_exporter --collector.systemd --collector.processes --no-collector.hwmon
# Run as systemd service
sudo useradd --no-create-home --shell /bin/false node_exporter
sudo mv node_exporter /usr/local/bin/
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
Systemd Service Configuration:
ini
# /etc/systemd/system/node_exporter.service
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--collector.systemd \
--collector.processes
[Install]
WantedBy=multi-user.target
Key Metrics Provided:
prometheus
# CPU Metrics
node_cpu_seconds_total{cpu="0",mode="idle"} 123456.78
node_cpu_seconds_total{cpu="0",mode="user"} 9876.54
node_cpu_seconds_total{cpu="0",mode="system"} 5432.10
# Memory Metrics
node_memory_MemTotal_bytes 8589934592
node_memory_MemFree_bytes 2147483648
node_memory_MemAvailable_bytes 4294967296
node_memory_Buffers_bytes 536870912
node_memory_Cached_bytes 1073741824
# Disk Metrics
node_filesystem_size_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 107374182400
node_filesystem_free_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 85899345920
node_filesystem_avail_bytes{device="/dev/sda1",fstype="ext4",mountpoint="/"} 80530636800
# Network Metrics
node_network_receive_bytes_total{device="eth0"} 12345678901
node_network_transmit_bytes_total{device="eth0"} 9876543210
node_network_receive_packets_total{device="eth0"} 87654321
node_network_transmit_packets_total{device="eth0"} 65432109
# Load Average
node_load1 0.85
node_load5 0.92
node_load15 1.05
# Boot Time
node_boot_time_seconds 1641024000
# Time
node_time_seconds 1641110400
Useful PromQL Queries for Node Exporter:
promql
# CPU usage percentage (average across all cores)
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage percentage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
# Disk usage percentage
(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100
# Network throughput (bytes per second)
irate(node_network_receive_bytes_total[5m]) + irate(node_network_transmit_bytes_total[5m])
# Disk I/O operations per second
irate(node_disk_reads_completed_total[5m]) + irate(node_disk_writes_completed_total[5m])
2. MySQL Exporter - Database Monitoring
Purpose: Exposes MySQL server metrics for performance monitoring and capacity planning.
Installation and Configuration:
bash
# Download MySQL exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.15.0/mysqld_exporter-0.15.0.linux-amd64.tar.gz
tar xvfz mysqld_exporter-0.15.0.linux-amd64.tar.gz
# Create MySQL user for monitoring
mysql -u root -p << EOF
CREATE USER 'exporter'@'localhost' IDENTIFIED BY 'password';
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';
FLUSH PRIVILEGES;
EOF
Configuration File (.my.cnf):
ini
[client]
user=exporter
password=secretpassword
host=localhost
port=3306
Running MySQL Exporter:
bash
# Using configuration file
export DATA_SOURCE_NAME="exporter:password@(localhost:3306)/"
./mysqld_exporter --config.my-cnf=.my.cnf
# Using connection string
./mysqld_exporter --config.my-cnf=.my.cnf --collect.info_schema.tables
Key MySQL Metrics:
prometheus
# Connection Metrics
mysql_global_status_connections 123456
mysql_global_status_threads_connected 45
mysql_global_status_threads_running 8
mysql_global_variables_max_connections 151
# Query Performance
mysql_global_status_queries 9876543
mysql_global_status_slow_queries 123
mysql_global_status_com_select 654321
mysql_global_status_com_insert 98765
mysql_global_status_com_update 54321
mysql_global_status_com_delete 12345
# InnoDB Metrics
mysql_global_status_innodb_buffer_pool_read_requests 8765432
mysql_global_status_innodb_buffer_pool_reads 87654
mysql_global_status_innodb_buffer_pool_pages_data 45678
mysql_global_status_innodb_buffer_pool_pages_free 12345
# Replication Metrics (for slaves)
mysql_slave_lag_seconds 0.5
mysql_slave_sql_running 1
mysql_slave_io_running 1
MySQL Monitoring Queries:
promql
# Connection usage percentage
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100
# Query rate per second
rate(mysql_global_status_queries[5m])
# Slow query rate
rate(mysql_global_status_slow_queries[5m])
# Buffer pool hit ratio (should be > 99%)
(mysql_global_status_innodb_buffer_pool_read_requests - mysql_global_status_innodb_buffer_pool_reads) / mysql_global_status_innodb_buffer_pool_read_requests * 100
# Average query response time
rate(mysql_global_status_queries[5m]) / rate(mysql_global_status_connections[5m])
3. Blackbox Exporter - External Monitoring
Purpose: Probes endpoints over HTTP, HTTPS, DNS, TCP, and ICMP to monitor external services and network connectivity.
Configuration (blackbox.yml):
yaml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
method: GET
headers:
User-Agent: "Prometheus Blackbox Exporter"
follow_redirects: true
preferred_ip_protocol: "ip4"
http_post_2xx:
prober: http
timeout: 5s
http:
valid_status_codes: [200]
method: POST
headers:
Content-Type: application/json
body: '{"test": "data"}'
tcp_connect:
prober: tcp
timeout: 5s
tcp:
preferred_ip_protocol: "ip4"
icmp:
prober: icmp
timeout: 5s
icmp:
preferred_ip_protocol: "ip4"
dns:
prober: dns
timeout: 5s
dns:
query_name: "example.com"
query_type: "A"
valid_rcodes:
- NOERROR
Prometheus Configuration for Blackbox:
yaml
scrape_configs:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx] # Default module
static_configs:
- targets:
- http://prometheus.io
- https://prometheus.io
- http://example.com:8080
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
- job_name: 'blackbox-tcp'
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets:
- example.com:22
- example.com:80
- example.com:443
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Building Custom Exporters
Theory Behind Custom Exporters
Sometimes you need to monitor systems that don't have existing exporters. Building custom exporters allows you to:
- Monitor proprietary applications
- Expose business metrics
- Integrate with internal APIs
- Create specialized monitoring solutions
Custom Exporter Best Practices
- Use Official Client Libraries: Leverage Prometheus client libraries for your language
- Follow Naming Conventions: Use descriptive metric names with units
- Handle Errors Gracefully: Don't crash on collection failures
- Keep It Simple: Focus on essential metrics
- Document Your Metrics: Include help text and examples
- Test Thoroughly: Verify metric accuracy and performance
Simple Custom Exporter (Python)
python
#!/usr/bin/env python3
"""
Custom application exporter example
Monitors a web application's performance and business metrics
"""
import time
import requests
import json
from prometheus_client import start_http_server, Gauge, Counter, Histogram, Info
from prometheus_client.core import CollectorRegistry, REGISTRY
import threading
import logging
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ApplicationExporter:
"""Custom exporter for application metrics"""
def __init__(self, app_url, app_token):
self.app_url = app_url
self.app_token = app_token
# Define metrics
self.app_info = Info('app_info', 'Application information')
self.app_up = Gauge('app_up', 'Application availability', ['service'])
self.app_response_time = Histogram(
'app_response_time_seconds',
'Application response time',
['endpoint', 'method'],
buckets=[0.1, 0.25, 0.5, 1, 2.5, 5, 10]
)
self.app_requests_total = Counter(
'app_requests_total',
'Total application requests',
['endpoint', 'method', 'status']
)
self.app_active_users = Gauge(
'app_active_users',
'Number of active users',
['type']
)
self.app_database_connections = Gauge(
'app_database_connections',
'Active database connections',
['pool']
)
self.app_queue_size = Gauge(
'app_queue_size',
'Queue size',
['queue_name']
)
self.app_revenue_total = Counter(
'app_revenue_total_cents',
'Total revenue in cents',
['product_type']
)
# Application info (static)
self.app_info.info({
'version': self.get_app_version(),
'environment': 'production',
'build_date': '2024-01-15'
})
def get_app_version(self):
"""Get application version from API"""
try:
response = requests.get(f"{self.app_url}/api/version",
headers={'Authorization': f'Bearer {self.app_token}'},
timeout=5)
return response.json().get('version', 'unknown')
except Exception as e:
logger.error(f"Failed to get app version: {e}")
return 'unknown'
def collect_health_metrics(self):
"""Collect application health metrics"""
try:
# Check main application
start_time = time.time()
response = requests.get(f"{self.app_url}/health",
headers={'Authorization': f'Bearer {self.app_token}'},
timeout=10)
response_time = time.time() - start_time
if response.status_code == 200:
self.app_up.labels(service='main').set(1)
self.app_response_time.labels(endpoint='/health', method='GET').observe(response_time)
# Parse health data
health_data = response.json()
# Database connections
for pool_name, count in health_data.get('database_pools', {}).items():
self.app_database_connections.labels(pool=pool_name).set(count)
# Queue sizes
for queue_name, size in health_data.get('queues', {}).items():
self.app_queue_size.labels(queue_name=queue_name).set(size)
else:
self.app_up.labels(service='main').set(0)
except Exception as e:
logger.error(f"Failed to collect health metrics: {e}")
self.app_up.labels(service='main').set(0)
def collect_business_metrics(self):
"""Collect business metrics"""
try:
response = requests.get(f"{self.app_url}/api/metrics",
headers={'Authorization': f'Bearer {self.app_token}'},
timeout=10)
if response.status_code == 200:
metrics_data = response.json()
# Active users
for user_type, count in metrics_data.get('active_users', {}).items():
self.app_active_users.labels(type=user_type).set(count)
# Revenue (business metric)
for product_type, revenue in metrics_data.get('revenue_today', {}).items():
# Convert to cents to avoid floating point issues
revenue_cents = int(revenue * 100)
self.app_revenue_total.labels(product_type=product_type)._value._value = revenue_cents
# Request statistics
for endpoint_data in metrics_data.get('endpoints', []):
endpoint = endpoint_data['path']
method = endpoint_data['method']
for status_code, count in endpoint_data.get('status_codes', {}).items():
self.app_requests_total.labels(
endpoint=endpoint,
method=method,
status=status_code
)._value._value = count
except Exception as e:
logger.error(f"Failed to collect business metrics: {e}")
def collect_all_metrics(self):
"""Collect all metrics"""
logger.info("Collecting application metrics...")
self.collect_health_metrics()
self.collect_business_metrics()
logger.info("Metrics collection completed")
def run_exporter():
"""Run the custom exporter"""
# Configuration
APP_URL = "https://api.myapp.com"
APP_TOKEN = "your-api-token-here"
EXPORTER_PORT = 8000
COLLECTION_INTERVAL = 30 # seconds
# Create exporter instance
exporter = ApplicationExporter(APP_URL, APP_TOKEN)
# Start HTTP server
start_http_server(EXPORTER_PORT)
logger.info(f"Custom exporter started on port {EXPORTER_PORT}")
# Collection loop
while True:
try:
exporter.collect_all_metrics()
time.sleep(COLLECTION_INTERVAL)
except KeyboardInterrupt:
logger.info("Exporter stopped by user")
break
except Exception as e:
logger.error(f"Error in collection loop: {e}")
time.sleep(10) # Wait before retrying
if __name__ == '__main__':
run_exporter()
Advanced Custom Exporter (Go)
go
package main
import (
"context"
"database/sql"
"encoding/json"
"fmt"
"log"
"net/http"
"time"
_ "github.com/lib/pq"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
type DatabaseExporter struct {
db *sql.DB
// Metrics
dbUp prometheus.Gauge
dbConnections *prometheus.GaugeVec
queryDuration *prometheus.HistogramVec
slowQueries prometheus.Counter
tableRows *prometheus.GaugeVec
tableSize *prometheus.GaugeVec
}
func NewDatabaseExporter(dsn string) (*DatabaseExporter, error) {
db, err := sql.Open("postgres", dsn)
if err != nil {
return nil, err
}
return &DatabaseExporter{
db: db,
dbUp: prometheus.NewGauge(prometheus.GaugeOpts{
Name: "database_up",
Help: "Database availability",
}),
dbConnections: prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "database_connections",
Help: "Database connections by state",
},
[]string{"state"},
),
queryDuration: prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "database_query_duration_seconds",
Help: "Database query duration",
Buckets: []float64{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5},
},
[]string{"query_type"},
),
slowQueries: prometheus.NewCounter(prometheus.CounterOpts{
Name: "database_slow_queries_total",
Help: "Total number of slow queries",
}),
tableRows: prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "database_table_rows",
Help: "Number of rows in each table",
},
[]string{"table", "schema"},
),
tableSize: prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "database_table_size_bytes",
Help: "Size of each table in bytes",
},
[]string{"table", "schema"},
),
}, nil
}
func (e *DatabaseExporter) collectConnectionMetrics() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
start := time.Now()
// Test database connectivity
if err := e.db.PingContext(ctx); err != nil {
e.dbUp.Set(0)
log.Printf("Database ping failed: %v", err)
return
}
e.dbUp.Set(1)
// Collect connection statistics
query := `
SELECT state, count(*)
FROM pg_stat_activity
WHERE datname = current_database()
GROUP BY state
`
rows, err := e.db.QueryContext(ctx, query)
if err != nil {
log.Printf("Failed to query connection stats: %v", err)
return
}
defer rows.Close()
for rows.Next() {
var state string
var count float64
if err := rows.Scan(&state, &count); err != nil {
log.Printf("Failed to scan connection stats: %v", err)
continue
}
e.dbConnections.WithLabelValues(state).Set(count)
}
e.queryDuration.WithLabelValues("connection_stats").Observe(time.Since(start).Seconds())
}
func (e *DatabaseExporter) collectTableMetrics() {
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
start := time.Now()
query := `
SELECT
schemaname,
tablename,
n_tup_ins + n_tup_upd + n_tup_del as total_rows,
pg_total_relation_size(schemaname||'.'||tablename) as table_size
FROM pg_stat_user_tables
`
rows, err := e.db.QueryContext(ctx, query)
if err != nil {
log.Printf("Failed to query table stats: %v", err)
return
}
defer rows.Close()
for rows.Next() {
var schema, table string
var rowCount, tableSize float64
if err := rows.Scan(&schema, &table, &rowCount, &tableSize); err != nil {
log.Printf("Failed to scan table stats: %v", err)
continue
}
e.tableRows.WithLabelValues(table, schema).Set(rowCount)
e.tableSize.WithLabelValues(table, schema).Set(tableSize)
}
e.queryDuration.WithLabelValues("table_stats").Observe(time.Since(start).Seconds())
}
func (e *DatabaseExporter) collectSlowQueries() {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
start := time.Now()
query := `
SELECT count(*)
FROM pg_stat_statements
WHERE mean_time > 1000
`
var slowCount float64
if err := e.db.QueryRowContext(ctx, query).Scan(&slowCount); err != nil {
log.Printf("Failed to query slow queries: %v", err)
return
}
// This is a simplified example - in practice you'd track the delta
e.slowQueries.Add(slowCount)
e.queryDuration.WithLabelValues("slow_queries").Observe(time.Since(start).Seconds())
}
func (e *DatabaseExporter) Collect(ch chan<- prometheus.Metric) {
e.collectConnectionMetrics()
e.collectTableMetrics()
e.collectSlowQueries()
e.dbUp.Collect(ch)
e.dbConnections.Collect(ch)
e.queryDuration.Collect(ch)
e.slowQueries.Collect(ch)
e.tableRows.Collect(ch)
e.tableSize.Collect(ch)
}
func (e *DatabaseExporter) Describe(ch chan<- *prometheus.Desc) {
e.dbUp.Describe(ch)
e.dbConnections.Describe(ch)
e.queryDuration.Describe(ch)
e.slowQueries.Describe(ch)
e.tableRows.Describe(ch)
e.tableSize.Describe(ch)
}
func main() {
dsn := "postgresql://user:password@localhost/dbname?sslmode=disable"
exporter, err := NewDatabaseExporter(dsn)
if err != nil {
log.Fatalf("Failed to create exporter: %v", err)
}
prometheus.MustRegister(exporter)
http.Handle("/metrics", promhttp.Handler())
log.Println("Database exporter started on :8080")
log.Fatal(http.ListenAndServe(":8080", nil))
}
Exporter Deployment Patterns
Sidecar Pattern (Kubernetes):
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-with-exporter
spec:
replicas: 3
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: application
image: myapp:latest
ports:
- containerPort: 3000
- name: exporter
image: custom-exporter:latest
ports:
- containerPort: 8080
env:
- name: APP_URL
value: "http://localhost:3000"
- name: SCRAPE_INTERVAL
value: "30s"
Standalone Exporter:
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: database-exporter
spec:
replicas: 1
selector:
matchLabels:
app: database-exporter
template:
metadata:
labels:
app: database-exporter
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9104"
spec:
containers:
- name: exporter
image: prom/mysqld-exporter:latest
ports:
- containerPort: 9104
env:
- name: DATA_SOURCE_NAME
valueFrom:
secretKeyRef:
name: mysql-secret
key: dsn