HomeNetwork/docs/mqtt-broker-broad-analysis.md
MaddoScientisto 6cc281d372 Add MQTT/Zigbee diagnostics tool and documentation updates
- Introduced `scripts/mqtt_z2m_diag.py` for reusable MQTT and Zigbee2MQTT diagnostics.
- Added `copilot-instructions.md` section for MQTT/Zigbee diagnostics tool usage.
- Created `docs/mqtt-broker-broad-analysis.md` for comprehensive MQTT broker analysis.
- Documented Salotto Overview Switch investigation in `docs/salotto-overview-switch-investigation.md`.
2026-04-17 18:25:37 +02:00

4.4 KiB

MQTT broker broad analysis

Scope

Broad passive review of the MQTT broker used for casa, focused on signs of broker stress, unusual traffic, excessive retained state, or noisy publishers that could affect performance or the network.

Method

  1. Verified broker reachability on TCP/1883.
  2. Took a 45-second full subscription sample (# + $SYS/#) to capture retained-state bursts and broker metrics.
  3. Took a 60-second steady-state sample with a 5-second warmup ignored to separate normal retained snapshots from ongoing traffic.

Findings

Overall status

No obvious broker health issue showed up. The broker appears stable and not under noticeable pressure:

Signal Observation
Connected clients 46
Max clients seen 47
Subscriptions 945
Dropped publishes 0
Retained messages stored 1026
Retained store size 1,494,548 bytes
Heap current / max 4,149,612 / 4,887,547 bytes

The $SYS counters did not suggest backlog, churn, or message loss.

Traffic shape

The first sample had a large initial burst, but it was mostly explained by retained state replay and Zigbee2MQTT bridge metadata sent immediately after subscribing:

  • 970 retained messages were seen on connect.
  • Largest payloads were:
    • zigbee2mqtt_2/bridge/definitions - 245,350 bytes
    • zigbee2mqtt/bridge/definitions - 217,824 bytes
    • zigbee2mqtt/bridge/devices - 82,585 bytes
    • zigbee2mqtt_2/bridge/devices - 81,884 bytes

That explains the one-shot peak of 1084 messages/second during the broad sample. It looks like a subscription snapshot, not an ongoing flood.

Steady-state load

After excluding the initial retained burst:

Metric Value
Sample window 60 seconds
Non-retained messages 1511
Non-retained bytes 42,949
Average rate 25.18 messages/second
Peak second 97 messages
Unique topics seen 202

That is a fairly modest steady-state load. The broker is handling a reasonable message rate without signs of distress.

Noisiest publishers

The clear dominant talker is a Shelly EM3 namespace:

  • Root prefix shellies accounted for 1362 / 1511 steady-state messages.
  • The top topics were all from shellies/shellyem3-485519D91C40/emeter/....
  • Individual EM3 topics appeared 35 times in 60 seconds, which is chatty but not bandwidth-heavy.

Important nuance:

  • This is mostly a message-count issue, not a bandwidth issue.
  • The same steady-state sample shows shellies produced only 7004 bytes total.

So the Shelly EM3 is the main source of ongoing chatter, but it does not currently look like a broker or network problem by itself.

Large payloads

Outside the retained startup burst, large payloads were minimal:

  • Only one large non-retained payload was observed in the steady-state sample:
    • frigate/stats - 10,743 bytes

Early large-byte topics from frigate snapshots and hass.agent thumbnails appeared in the broad capture, but they did not show up as sustained heavy traffic in the steady-state sample.

Topic naming oddities

Several Zigbee2MQTT topics contain spaces, for example:

  • zigbee2mqtt/Btcino coso salotto/availability
  • zigbee2mqtt/Letty Condizionatore Ufficio/availability

This is not a broker anomaly, but it is worth noting:

  • it can make tooling and ad-hoc topic handling more brittle
  • it increases the chance of mistakes in scripts, automations, and CLI work

If you want cleaner topic hygiene, consider slug-style friendly names for Zigbee2MQTT devices.

Conclusion

What looks healthy

  • No dropped publishes
  • No sign of broker backlog or unstable client churn
  • Retained store is present but not unusually large
  • Heap usage is not alarming
  • Steady-state traffic volume is modest

What stands out

  1. A retained-state burst on subscribe, mostly from Zigbee2MQTT bridge metadata. This is expected behavior and not a live flood.
  2. A very chatty Shelly EM3 publisher dominating message count. It is the main thing to watch, but at current byte volume it does not look harmful.
  3. Topic names with spaces in Zigbee2MQTT. Not a performance issue, but a maintainability footgun.

Recommendation

No urgent remediation is indicated from this pass.

If you want to reduce noise further, the best next place to look would be the publish frequency/config of shellies/shellyem3-485519D91C40, since that device is responsible for most of the ongoing message count.