RiotKit’s Infracheck¶
HTTP healthcheck endpoint + shell healthcheck runner. Simple, easy to setup, easy to understand. Works perfectly with Docker. A perfectly fitting universal brick in your monitoring.
{
"checks": {
"disk-space": {
"ident": "disk-space=True",
"output": "There is 350.8GB disk space at '/', nothing to worry about, defined minimum is 15GB\n",
"status": true
},
"docker-health": {
"ident": "docker-health=True",
"output": "Docker daemon reports that there is no 'unhealthy' service running in '' space\n",
"status": true
},
"minio": {
"ident": "minio=True",
"output": "",
"status": true
},
"replication-running": {
"ident": "replication-running=True",
"output": "Replica seems to be in good state\n",
"status": true
},
"storage-synchronization": {
"ident": "storage-synchronization=True",
"output": "Storage synchronization looks fine\n",
"status": true
}
},
"global_status": true
}
Quick start¶
To monitor applications and the infrastructure parts you need to configure checks. A configured check is a json file that defines a method name (script to be used) and the input parameters. Each check is executed when your external monitoring software invokes the HTTP endpoint, or when you execute the shell command.
Infracheck can work as a HTTP endpoint responding with JSON, or as a console command.

1. Requirements¶
You need to install all requirements manually if you decide not to use a docker container.
Requirements:
- Python 3.7+
- OpenSSH Client
- sshpass (for SSH checks)
- whois (for domain checks)
- mysql-client (for MySQL checks)
- postgresql-client (for PostgreSQL checks)
- docker client (for Docker checks)
- curl
Python package requirements:
ovh >= 0.5.0, < 1.1
psutil >= 5.7.2, < 6
psycopg2-binary >= 2.8.4, < 3
python-dateutil >= 2.8.1, < 3
pytz >= 2019.3
six >= 1.15.0, < 2
tornado >= 5.1.1, < 7
whois >= 0.9.13, < 1
influxdb >= 5.3.1, < 6
msgpack >= 1.0, < 2
rkd>=2.3.3, <3
rkd-python>=2.3.3, <3
docker >= 5
croniter >= 1.0.13, < 1.4
2. Structure¶
You need to create a project structure from following template:
- checks/
- http
- smtp
- port
- configured/
- redis
- duckduckgo_http
- smtp_is_alive
In checks there should be scripts that will take parameters as environment variables, process and give results. For simpler cases you may not need to define any scripts, just configure pre-defined ones.
configured should contain your actual use cases, for example “duckduckgo_http” from above example could use “http” check with url “https://duckduckgo.com” as a parameter.
3. Configuring a first check¶
Let’s assume that we need to check if a page contains given keyword, and does not contain another defined one. Following check will use curl to fetch page content.
Test cases:
- If page will not load, then THE CHECK RETURNS FAILURE
- If page contains “Server error”, then THE CHECK RETURNS FAILURE
- If page will not contain keyword “iwa”, then THE CHECK RETURNS FAILURE
- If page loads properly and contains “iwa” keyword, then THE CHECK RETURNS SUCCESS
{
"type": "http",
"input": {
"url": "http://iwa-ait.org",
"expect_keyword": "iwa",
"not_expect_keyword": "Server error"
}
}
Hint: You can pass environment variables in parameters - see: Templating section.
4. Running checks¶
With Docker
You can use a ready-to-use docker image quay.io/riotkit/infracheck or quay.io/riotkit/infracheck for ARM. Please check the list of available versions.
The image will by default expose a HTTP endpoint.
# create directory structure that will be present in "/data" inside container (see one of previous steps about the structure)
mkdir checks configured
sudo docker run --name infracheck -p 8000:8000 -v $(pwd):/data -d --rm quay.io/riotkit/infracheck:v2.0-x86_64 \
--directory=/data --server-path-prefix=/your-secret-code-there
# now test it
curl http://localhost:8000/your-secret-code-there/
List of supported environment variables:
- REFRESH_TIME=120
- CHECK_TIMEOUT=120
- WAIT_TIME=0
Without Docker
git clone https://github.com/riotkit-org/infracheck
cd infracheck
rkd :install
# run checks in the shell
infracheck --directory=/your-project-directory-path-there --no-server
# run the application with webserver and background worker
infracheck --directory=/your-project-directory-path-there --server-port=8000 --refresh-time=120 --log-level=info
Using PIP
sudo pip install infracheck
# run checks in the shell
infracheck --directory=/your-project-directory-path-there --no-server
# run the application with webserver and background worker
infracheck --directory=/your-project-directory-path-there --server-port=8000 --refresh-time=120 --log-level=info
Advanced¶
Setting timeout per check: Set INFRACHECK_TIMEOUT
environment variable in json file to adjust timeout for given check.
Hooks¶
After each execution of your checks there is a possibility to execute some commands.
Example:
{
"type": "disk-space",
"input": {
"dir": "/",
"min_req_space": "6"
},
"hooks": {
"on_each_up": [
"rm -f /tmp/maintenance.html"
],
"on_each_down": [
"echo \"Site under maintenance\" > /tmp/maintenance.html"
]
}
}
Example above will delete a /tmp/maintenance.html file when disk space will be at acceptable level. If there will be no enough disk space, then “Site under maintenance” will be written to the /tmp/maintenance.html With this practical example you can add a rule to your NGINX/Apache gateway to show a maintenance page, when a file is present.
Predefined check types reference¶
Infracheck comes by default with some standard checks, there is a list of them:
http¶
Performs a HTTP call using curl.
Example:
{
"type": "http",
"input": {
"url": "http://iwa-ait.org",
"expect_keyword": "iwa",
"not_expect_keyword": "Server error"
}
}
Parameters:
- url
- expect_keyword
- not_expect_keyword
rkd://¶
Infracheck can execute RiotKit-Do tasks. RKD is a task executor, similar to Makefile or Gradle. It’s essential feature is a possibility to load tasks from PyPI (Python packages).
Using RKD you can write a Python class, version and release it to PyPI with a list of dependencies, and install in any place with PIP. A packaged task can require extra dependencies you do not want always to install eg. MySQL, PostgreSQL, Redis or other clients you want to selectively install on your Infracheck instances.
More information on how to write RKD tasks: in RiotKit-Do’s documentation
{
"type": "rkd://rkd.standardlib.shell:sh",
"input": {
"-c": "ps aux |grep X11"
}
}
{
"type": "rkd://my_rkd_check:mysql:temporary-table-size-check",
"input": {
"--max": "100000",
"--host: "localhost",
"--port": 3306,
"--user": "infracheck",
"--password": "${TEMP_TABLE_SIZE_CHECK_PASSWORD}"
}
}
docker-health¶
Checks if containers are healthy.
Parameters:
- docker_env_name (it’s a prefix, to check only containers that names begins with this - idea of docker-compose)
port-open¶
Checks if the port is open.
Parameters:
- po_host
- po_port (in seconds)
- po_timeout (in seconds)
replication-running¶
Checks if the MySQL replication is in good state. Works with Docker only.
Parameters:
- container
- mysql_root_password
free-ram¶
Monitors RAM memory usage to notify that a maximum percent of memory was used.
Parameters:
- max_ram_percentage (in percents eg. 80)
domain-expiration¶
Check if the domain is close to expiration date or if it is already expired.
Notice: Multiple usage of this check can cause a “request limit exceeded” error to happen
Warning: Due to limits per IP on whois usage we recommend to strongly cache the health check ex. 1-2 days cache, and in case of checking multiple domains to use feature called “wait time” to sleep between checks, to not send too many requests a once
Parameters:
- domain (domain name)
- alert_days_before (number of days before expiration date to start alerting)
disk-space¶
Monitors disk space.
Parameters:
- min_req_space (in gigabytes)
- dir (path)
Example JSON:
{
"type": "disk-space",
"input": {
"dir": "/",
"min_req_space": "6"
}
}
ovh-expiration¶
Checks if a VPS is not expired. Grab credentials at https://api.ovh.com/createToken/index.cgi
Required privileges on OVH API: “GET /vps*”
Parameters:
- endpoint (ex. ovh-eu)
- app_key
- app_secret
- app_consumer_key
- service_name (ex. somevps.ovh.net)
- days_to_alert (ex. 30 for 30 days)
Example JSON:
{
"type": "ovh-expiration",
"input": {
"endpoint": "ovh-eu",
"app_key": "xyyyyyyyyyyyyzz",
"app_secret": "xyxyxyxyyxyxyxyxyxyxxyyxyxyxyxy",
"app_consumer_key": "xyxyyxyxyxyxyxyxyxyyxyxyxyxyxy",
"service_name": "vps12345678.ovh.net",
"days_to_alert": 5
}
}
ssh-fingerprint¶
Verifies if remote host fingerprint matches. Helps detecting man-in-the-middle and server takeover attacks.
Parameters:
- expected_fingerprint (example: zsp.net.pl ssh-rsa SOMESOMESOMESOMESOMEKEYHERE)
- method (default: rsa)
- host (example: zsp.net.pl)
- port (example: 22)
ssh-files-checksum¶
Calls remote process using SSH and expects: the listed files and checksums will be matching
Parameters:
- user (default: root)
- host
- port (default: 22)
- private_key
- password
- ssh_bin (default: ssh)
- sshpass_bin (default: sshpass)
- ssh_opts (example: -o StrictHostKeyChecking=no)
- known_hosts_file (default: ~/.ssh/known_hosts)
- command (default: uname -a)
- timeout: (default: 15, unit: seconds)
- method (default: sha256sum)
- expects (json dict, example: {“/usr/bin/bahub”: “d6e85b50756a08e24c1d46f07b68e288c9e7e565fd662a15baca214f576c34be”})
ssh-command¶
Calls remote process using SSH and expects: exit code, keywords in the output
Parameters:
- user (default: root)
- host
- port (default: 22)
- private_key
- password
- ssh_bin (default: ssh)
- sshpass_bin (default: sshpass)
- ssh_opts (example: -o StrictHostKeyChecking=no)
- known_hosts_file (default: ~/.ssh/known_hosts)
- command (default: uname -a)
- timeout: (default: 15, unit: seconds)
- expected_keywords (Keywords expected to be in stdout/stderr. Separated by “;”)
- unexpected_keywords (Keywords not expected to be present in stdout/stderr. Separated by “;”)
- expected_exit_code (default: 0)
reminder¶
Reminds about the recurring date. Example: To extend validity of your hosting account
Parameters:
- ref_date (example: 2019-05-01 for a 1th of May 2019)
- each (values: week; month; year, default: year)
- alert_days_before (default: 5, the health check will be red when there will be 5 days before)
load-average-auto¶
Checks if the load average is not more than 100%
Parameters:
- maximum_above (unit: processor cores, default: 0.5 - half of a core)
- timing (default: 15. The load average time: 1, 5, 15)
load-average¶
Checks if the load average is not below specified number
Parameters:
- max_load (unit: processor cores, example: 5.0, default: 1)
- timing (default: 15. The load average time: 1, 5, 15)
swap-usage-max-percent¶
Defines maximum percentage of allowed swap usage
Parameters:
- max_allowed_percentage (default: 0.0)
influxdb-query¶
Queries an InfluxDB database and compares results.
Parameters:
- host
- port (default: 8086)
- user
- password
- database
- query
- expected: A json serialized result (not pretty formatted)
Example of JSON serialized result for query ‘select value from cpu_load_short;’:
[
[
{"time": "2009-11-10T23:00:10Z", "value": 10.64},
{"time": "2009-11-10T23:00:20Z", "value": 20.64},
{"time": "2009-11-10T23:00:30Z", "value": 30.64},
{"time": "2009-11-10T23:00:40Z", "value": 40.64}
]
]
postgres¶
Uses pg_isready tool to verify if PostgreSQL is up and ready to connect.
Parameters:
- pg_host (hostname or socket path, defaults to “localhost” which will use local unix socket, use IP address eg. 127.0.0.1 to connect via tcp)
- pg_port (port, defaults to 5432)
- pg_db_name (database name to connect to, defaults to “postgres”)
- pg_user (username, defaults to “postgres”)
- pg_conn_timeout (defaults to 15 which means 15 seconds)
postgres-primary-streaming-status¶
Verifies if local PostgreSQL instance is currently serving WALs to a specified replica. The SQL command that is validated: select * from pg_stat_replication;
Parameters:
- pg_host (hostname or socket path, defaults to “localhost” which will use local unix socket, use IP address eg. 127.0.0.1 to connect via tcp)
- pg_port (port, defaults to 5432)
- pg_db_name (database name to connect to, defaults to “postgres”)
- pg_user (username, defaults to “postgres”)
- pg_password
- pg_conn_timeout (defaults to 15 which means 15 seconds)
- expected_status (defaults to “streaming”)
- expected_replication_user: Expected user that will be used for replication connection (defaults to “replication”)
postgres-replica-status¶
Checks if local PostgreSQL server acts as a replication server, by validating the list of active wal receivers. The SQL command that is validated: select * from pg_stat_wal_receiver;
Parameters:
- pg_host (hostname or socket path, defaults to “localhost” which will use local unix socket, use IP address eg. 127.0.0.1 to connect via tcp)
- pg_port (port, defaults to 5432)
- pg_db_name (database name to connect to, defaults to “postgres”)
- pg_user (username, defaults to “postgres”)
- pg_password
- pg_conn_timeout (defaults to 15 which means 15 seconds)
- expected_status (defaults to “streaming”)
- expected_replication_user: Expected user that will be used for replication connection (defaults to “replication”)
docker-container-log¶
Searches docker container logs for matching given regular expression.
Parameters:
- container: Docker container name
- regexp: Regular expression
- max_lines: Number of last lines to check (defaults to 5)
- since_seconds: Get only logs since this time (eg. last 5 minutes = 5 * 60 = 300) (defaults to 300)
- present: Boolean, if the string should be present in the output or not
smtp_credentials_check.py¶
Verifies connection, TLS certificate and credentials to a SMTP server by doing a ping + authorization try.
Parameters:
- smtp_host (example: bakunin.example.org)
- smtp_port (example: 25)
- smtp_user (example: noreply@example.org)
- smtp_password (example: bakunin-1936)
- smtp_encryption (example: starttls. Values: “”, “ssl”, “starttls”)
- smtp_timeout (default: 30, unit: seconds)
tls¶
TLS/SSL certificate expiration validation
Parameters:
- domain: TLS certificate domain for which the certificate was created
- host: IP address or DNS hostname from which the certificate should be downloaded (defaults to domain value)
- port: Port (defaults to 443)
- alert_days_before: Number of days before expiration date to start alerting (defaults to 3)
tls-docker-network¶
Automated TLS certificate verification for docker-based flows like docker-gen. Scans list of docker containers basing on a label or environment variable that contains a domain name.
Parameters:
- parameter_type: Label or environment variable
- parameter_name: Name of the label or environment variable
- alert_days_before: Number of days before expiration date to start alerting (defaults to 3)
- docker_host: (Optional) The URL to the Docker host.
- docker_tls_verify: (Optional) Verify the host against a CA certificate.
- docker_cert_path: (Optional) A path to a directory containing TLS certificates to use when connecting to the Docker host
- debug: (Optional) Debugging mode
Check configuration reference¶
{
"type": "http",
"description": "IWA-AIT check",
"results_cache_time": 300,
"input": {
"url": "http://iwa-ait.org",
"expect_keyword": "iwa",
"not_expect_keyword": "Server error"
},
"hooks": {
"on_each_up": [
"rm -f /var/www/maintenance.html"
],
"on_each_down": [
"echo \"Site under maintenance\" > /var/www/maintenance.html"
]
},
"quiet_periods": [
{"starts": "30 00 * * *", "duration": 60}
]
}
type¶
Name of the binary/script file placed in the “checks” directory. At first will look at path specified by “–directory” CLI parameter, then will fallback to Infracheck internal check library.
Example values:
- disk-space
- load-average
- http
- smtp_credentials_check.py
description¶
Optional text field, there can be left a note for other administrators to exchange knowledge in a quick way in case of a failure.
results_cache_time¶
How long the check result should be kept in cache (in seconds)
input¶
Parameters passed to the binary/script file (chosen in “type” field). Case insensitive, everything is converted to UPPERCASE and passed as environment variables.
Notice: Environment variables and internal variables can be injected using templating feature - check Templating
hooks¶
(Optional) Execute shell commands on given events.
- on_each_up: Everytime the check is OK
- on_each_down: Everytime the check is FAILING
Templating¶
In order to increase the security there is a simple templating mechanism that allows to inject variables into parameters you define that are passed to the checks.
Example:
{
"type": "ssh-command",
"input": {
"user": "thesecurityman",
"host": "iwa-ait.org",
"port": 6200,
"password": "${ENV.IWA_SECURITY_MAN_PASSWD}",
"command": "/usr/bin/some-security-check --is-secure",
"expected_exit_code": 0,
"timeout": 30
}
}
Reference table¶
Pattern | Example | Description |
---|---|---|
${ENV.*} | ${ENV.USER} | Injects an environment variable from the host |
${checkName} | http | Name of the currently executed check |
${date} | 2019-10-31T07:53:45.380307 | Current date and time |
Example strategy of deploying passwords with Docker Compose and Ansible¶
- Encrypt your passwords with ansible-vault
- Decrypt them during deployment into .env on target machine for docker-compose
- In docker-compose service definition pass variable explicitly from the .env file
environment:
# variables in checks
- IWA_SECURITY_MAN_PASSWD=${IWA_SECURITY_MAN_PASSWD}
Writing custom checks¶
Infracheck provides very basic scripts for health checking, you may probably want to write your own. It’s really simple.
- “check” scripts are in “checks” directory of your project structure, here you can add a new check script
- Your script needs to take uppercase environment variables as input
- It is considered a good practice to validate environment variables presence in scripts
- Your script needs to return a valid exit code when:
- Any of environment variables is missing or has invalid value
- The check fails
- The check success
That’s all!
A few examples:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | #!/bin/bash # # Directory presence check # # @author Krzysztof Wesołowski # @url https://iwa-ait.org # if [[ ! "${DIR}" ]]; then echo "DIR parameter is missing" exit 1 fi if [[ ! -d "${DIR}" ]]; then echo "Failed asserting that directory at '${DIR}' is present" exit 1 fi echo "'${DIR}' directory is present" exit 0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | #!/usr/bin/env python3 """ <sphinx> load-average ------------ Checks if the load average is not below specified number Parameters: - max_load (unit: processor cores, example: 5.0, default: 1) - timing (default: 15. The load average time: 1, 5, 15) </sphinx> """ import os import sys import inspect path = os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe()))) + '/../../' sys.path.insert(0, path) from infracheck.infracheck.checklib.loadavg import BaseLoadAverageCheck class LoadAverageAuto(BaseLoadAverageCheck): def main(self, timing: str, max_load: float): current_load_average = self.get_load_average(timing) if current_load_average > max_load: return False, "Load {:.2f} exceeds allowed max. load of {:.2f}. Current load: {:s}".format( current_load_average, max_load, self.get_complete_avg() ) return True, "Load at level of {:.2f} is ok, current load: {:s}".format( current_load_average, self.get_complete_avg()) if __name__ == '__main__': app = LoadAverageAuto() status, message = app.main( timing=os.getenv('TIMING', '15'), max_load=float(os.getenv('MAX_LOAD', 1)) ) print(message) sys.exit(0 if status else 1) |
Cache and freshness¶
It can be harmful to the server to run all checks on each HTTP endpoint call, so the application is running them periodically every X seconds specified by –refresh-time switch or REFRESH_TIME environment variable (in docker)
Refresh time¶
If you use an official docker image, then you can set an environment variable.
Example: check once a day (good for domains whois check).
REFRESH_TIME=86400
From CLI you can set –refresh-time=86400
Wait time¶
Some checks could call external APIs, those can have limits. A good example is a domain-expiration check which is using whois. Set –wait=60 to for example wait 60 seconds before each check - where check is a single entry on the list of checks.
Customizing check freshness time per check¶
Beside the global setting of refresh time there could be a per-check setting called “results_cache_time”.
Example of caching the check result for at least 300 seconds
{
"type": "swap-usage-max-percent",
"results_cache_time": "300",
"input": {
"max_allowed_percentage": 0
}
}
From authors¶
Project was started as a part of RiotKit initiative, for the needs of grassroot organizations such as:
- Fighting for better working conditions syndicalist (International Workers Association for example)
- Tenants rights organizations
- Various grassroot organizations that are helping people to organize themselves without authority
RiotKit Collective