Kibana Dashboards for ft_transcendence: Visualizing Project Data

March 31, 2024 — #kibana #dashboards #visualization #monitoring #ft_transcendence

Kibana Dashboards for ft_transcendence: Visualizing Project Data

Why Dashboards Matter

Dashboards are crucial for transforming raw log data into actionable insights. For our ft_transcendence project, well-designed Kibana dashboards help us:

Monitor System Health: Quickly spot issues with any service
Track User Experience: Understand how users interact with the application
Identify Security Concerns: Detect unusual patterns or potential threats
Optimize Performance: Pinpoint bottlenecks and inefficiencies

Dashboard Structure for ft_transcendence

We've organized our dashboards into three main categories:

1. Operations Dashboards

System Overview: High-level health of all services
Service-Specific: Detailed metrics for each service
Error Tracking: Aggregated view of all errors

2. User Experience Dashboards

API Performance: Response times and error rates
User Sessions: Login patterns and session durations
Feature Usage: Which parts of the app users engage with most

3. Security Dashboards

Authentication Events: Logins, logouts, and failures
Access Patterns: Unusual activity detection
Error Clustering: Finding patterns in security-related errors

Creating Our Main Dashboard

Here's how we built our System Overview dashboard:

Step 1: Index Pattern Configuration

First, we created index patterns to access our data:

In Kibana, go to Stack Management > Index Patterns
Create patterns for each service:
- django-*
- nginx-*
- nextjs-*
- postgresql-*
- redis-*

Step 2: Building Visualizations

We created these key visualizations:

Service Health Status

{
  "aggs": {
    "services": {
      "terms": {
        "field": "service.keyword",
        "size": 10
      },
      "aggs": {
        "errors": {
          "filter": {
            "bool": {
              "should": [
                { "term": { "log_level.keyword": "ERROR" } },
                { "term": { "log_level.keyword": "CRITICAL" } },
                { "range": { "http_status": { "gte": 500 } } }
              ]
            }
          }
        },
        "health": {
          "bucket_script": {
            "buckets_path": {
              "errors": "errors._count"
            },
            "script": "params.errors > 0 ? 'error' : 'healthy'"
          }
        }
      }
    }
  }
}

Request Volume Over Time

{
  "aggs": {
    "requests_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "30s"
      },
      "aggs": {
        "by_service": {
          "terms": {
            "field": "service.keyword",
            "size": 5
          }
        }
      }
    }
  }
}

Error Distribution Pie Chart

{
  "aggs": {
    "error_types": {
      "terms": {
        "field": "error_type.keyword",
        "size": 10
      }
    }
  },
  "query": {
    "bool": {
      "must": [{ "term": { "log_level.keyword": "ERROR" } }]
    }
  }
}

Step 3: Assembling the Dashboard

We arranged visualizations in a logical flow:

Top Section: Status indicators showing at-a-glance health
Middle Section: Time-series graphs showing activity trends
Bottom Section: Detailed tables for drilling down into specific issues

Service-Specific Dashboards

Django Dashboard Example

For our Django application, we focused on:

View Performance: Response times by view function
Database Queries: Query execution times and counts
Error Traceback: Full error details with context

Key visualization for slow database queries:

{
  "aggs": {
    "over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "1m"
      },
      "aggs": {
        "slow_queries": {
          "filter": {
            "bool": {
              "must": [
                { "exists": { "field": "query_time" } },
                { "range": { "query_time": { "gte": 100 } } }
              ]
            }
          }
        }
      }
    }
  }
}

Nginx Dashboard Example

For Nginx monitoring, we focus on:

Request Rate: Requests per second
Status Codes: Distribution of HTTP response codes
Response Times: Latency percentiles
Top URLs: Most frequently accessed endpoints
Client IPs: Source of requests by geography

User Activity Dashboard

This dashboard helps us understand user behavior:

Session Flow Visualization: Shows how users navigate through the app
User Retention: How often users return to the app
Feature Adoption: Which features are used most

Example visualization for session duration:

{
  "aggs": {
    "users": {
      "terms": {
        "field": "user_id.keyword",
        "size": 100
      },
      "aggs": {
        "session_start": {
          "min": {
            "field": "@timestamp"
          }
        },
        "session_end": {
          "max": {
            "field": "@timestamp"
          }
        },
        "session_duration": {
          "bucket_script": {
            "buckets_path": {
              "start": "session_start",
              "end": "session_end"
            },
            "script": "(params.end - params.start) / 60000"
          }
        }
      }
    }
  }
}

Security Monitoring Dashboard

Our security dashboard focuses on:

Authentication Events: Success/failure tracking
Geographic Anomalies: Unusual access locations
Rate Limiting Violations: Potential brute force attacks
Permission Denials: Unauthorized access attempts

Example visualization for failed login attempts:

{
  "aggs": {
    "over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "5m"
      },
      "aggs": {
        "by_ip": {
          "terms": {
            "field": "client_ip.keyword",
            "size": 10
          }
        }
      }
    }
  },
  "query": {
    "bool": {
      "must": [{ "term": { "event.keyword": "login_failed" } }]
    }
  }
}

Setting Up Alerts

We've configured alerts to notify us of critical issues:

Error Rate Alert

{
  "trigger": {
    "schedule": {
      "interval": "5m"
    }
  },
  "input": {
    "search": {
      "indices": ["*"],
      "body": {
        "query": {
          "bool": {
            "must": [
              { "range": { "@timestamp": { "gte": "now-5m" } } },
              { "terms": { "log_level.keyword": ["ERROR", "CRITICAL"] } }
            ]
          }
        }
      }
    }
  },
  "condition": {
    "compare": {
      "ctx.payload.hits.total": {
        "gt": 10
      }
    }
  },
  "actions": {
    "notify_team": {
      "webhook": {
        "url": "https://hooks.slack.com/services/YOUR_WEBHOOK",
        "body": "Error rate exceeded threshold: {{ctx.payload.hits.total}} errors in the last 5 minutes"
      }
    }
  }
}

Performance Considerations

Since we're running in a limited environment, we've optimized our dashboards:

Time Range Control: Default to shorter time ranges (last 15 minutes)
Reduced Refresh Rate: Default to manual refresh or 1-minute intervals
Aggregation Over Raw Data: Use aggregations instead of raw document tables
Limited Cardinality: Avoid visualizations with high cardinality fields

Sharing Dashboards

To make dashboards accessible to team members:

Export/Import: Share dashboard JSON definitions
Saved Object Management: Export and import through Kibana UI
Version Control: Store dashboard definitions in Git

Dashboard Templates

We've created templates for quick setup of new dashboards:

{
  "attributes": {
    "title": "Service Dashboard Template",
    "hits": 0,
    "description": "Template for monitoring any service",
    "panelsJSON": "[{\"id\":\"request-rate\",\"type\":\"visualization\",\"panelIndex\":1,\"gridData\":{\"x\":0,\"y\":0,\"w\":24,\"h\":8},\"version\":\"7.10.0\"},{\"id\":\"error-rate\",\"type\":\"visualization\",\"panelIndex\":2,\"gridData\":{\"x\":24,\"y\":0,\"w\":24,\"h\":8},\"version\":\"7.10.0\"},{\"id\":\"response-time\",\"type\":\"visualization\",\"panelIndex\":3,\"gridData\":{\"x\":0,\"y\":8,\"w\":48,\"h\":8},\"version\":\"7.10.0\"}]"
  }
}

Conclusion

Well-designed Kibana dashboards transform our raw log data into valuable insights for ft_transcendence. By taking the time to create effective visualizations and dashboards, we gain:

Better understanding of system behavior
Faster troubleshooting of issues
Deeper insights into user activity
Improved security monitoring

In our resource-constrained environment, these dashboards help us maximize the value of our observability data without requiring excessive resources.