Hi all,
I've set up a RabbitMQ service on to AKS using the latest (at time of writing ver 4.1.1) Cluster and Topology Operators . The setup was intuitive enough by setting up some manifest files and then applying these to the AKS cluster using kubectl commands.
I was able to set up:
- a Namespace
- Generate a External Certificate using Cert-Man (and Let's Encrypt) for TLS
- Generate a Internal Self Signed Certificate using Cert-Man
- Installing the RabbitMQ cluster-operator
- Installing the RabbitMQ messaging-topology-operator-with-certmanager operator
- Applying a Cluster (also enabling MQTT plugin)
- Set Up Users (using Secrets and Permissions)
- Add an External load balancer (and applying the ports for client access)
- Add an Internal Loadbalancer (and applying the ports for services)
- and an ingress resource using nginx for the internal tls proxy
The setup is running good and my clients and services are connecting to the service without issue, so technically I have a good running service...
-- BUT --
My pod logs have been flooded with TLS errors but originating from internal namespaces including:
- kube-system
- calico-system
- tigera-operator
(I cannot distinguish which one it is as they are shareing the same IP's)
These are the errors:
[error] timeout error during handshake for protocol 'amqp/ssl' and peer 10.2.1.11:48397
[error] timeout error during handshake for protocol 'mqtt/ssl' and peer 10.2.1.11:63038
[error] timeout error during handshake for protocol 'mqtt/ssl' and peer 10.2.1.11:35356
[error] timeout error during handshake for protocol 'amqp/ssl' and peer 10.2.1.11:63171
[error] timeout error during handshake for protocol 'mqtt/ssl' and peer 10.2.1.11:32319
[error] timeout error during handshake for protocol 'amqp/ssl' and peer 10.2.1.12:25044
[error] timeout error during handshake for protocol 'mqtt/ssl' and peer 10.2.1.12:19395
[error] timeout error during handshake for protocol 'mqtt/ssl' and peer 10.2.1.12:52790
[error] timeout error during handshake for protocol 'mqtt/ssl' and peer 10.2.1.11:55244
I reached out to the RabbitMQ team, and they grasiously helped me, and it soon was found that these errors were logged correctly, but was originating from one of the services mentioned, and as this is a AKS deployment, this is how fare they could help with the debugging.
It would seem that the issue is with the AKS monitoring service using prometheus scrapeing to get monitoring statistics into Azure (which by the way is enabled by default), and it would seem that because the monitoring service does not have the self-generated Cert to access the TLS ports, that it is inherently creating a error log in the Rabbit MQ Pod.
This is something out of my domain, and I found this kind of "hidden" section way deep in the docs where you can opt out of the metrics collection (Disable control plane metrics on your AKS Cluster), but it is a kind of "opt out of everything" option, and there is no real granular settings I ould find to not monitor the TLS ports of my pods. It sounds a bit bizar, but that is why I'm posting this question.
So, the question:
How can I configure the Azure Monitoring Service (by assumption using Prometheus Scrapeing), to not include trying to connect to my pod endpoints over TLS, but then still give me the option to have some monitoring (without flooding the pod logs).
PS. I have disabled the AKS monitoring on my cluster using the command, rebuilt my RabbitMQ instance and the default monitoring is still enabled in the Aure portal, and unfortunately the errors are still being generated in the Pod logs.
Cheers,
Pieter