Add MQTT/Zigbee diagnostics tool and documentation updates

- Introduced `scripts/mqtt_z2m_diag.py` for reusable MQTT and Zigbee2MQTT diagnostics.
- Added `copilot-instructions.md` section for MQTT/Zigbee diagnostics tool usage.
- Created `docs/mqtt-broker-broad-analysis.md` for comprehensive MQTT broker analysis.
- Documented Salotto Overview Switch investigation in `docs/salotto-overview-switch-investigation.md`.
This commit is contained in:
MaddoScientisto 2026-04-17 18:25:37 +02:00
commit 6cc281d372
5 changed files with 925 additions and 0 deletions

View file

@ -0,0 +1,117 @@
# MQTT broker broad analysis
## Scope
Broad passive review of the MQTT broker used for **casa**, focused on signs of broker stress, unusual traffic, excessive retained state, or noisy publishers that could affect performance or the network.
## Method
1. Verified broker reachability on TCP/1883.
2. Took a **45-second full subscription sample** (`#` + `$SYS/#`) to capture retained-state bursts and broker metrics.
3. Took a **60-second steady-state sample** with a **5-second warmup ignored** to separate normal retained snapshots from ongoing traffic.
## Findings
### Overall status
**No obvious broker health issue showed up.** The broker appears stable and not under noticeable pressure:
| Signal | Observation |
| --- | --- |
| Connected clients | 46 |
| Max clients seen | 47 |
| Subscriptions | 945 |
| Dropped publishes | 0 |
| Retained messages stored | 1026 |
| Retained store size | 1,494,548 bytes |
| Heap current / max | 4,149,612 / 4,887,547 bytes |
The `$SYS` counters did **not** suggest backlog, churn, or message loss.
### Traffic shape
The first sample had a large initial burst, but it was mostly explained by **retained state replay** and **Zigbee2MQTT bridge metadata** sent immediately after subscribing:
- 970 retained messages were seen on connect.
- Largest payloads were:
- `zigbee2mqtt_2/bridge/definitions` - 245,350 bytes
- `zigbee2mqtt/bridge/definitions` - 217,824 bytes
- `zigbee2mqtt/bridge/devices` - 82,585 bytes
- `zigbee2mqtt_2/bridge/devices` - 81,884 bytes
That explains the one-shot peak of **1084 messages/second** during the broad sample. It looks like a subscription snapshot, **not** an ongoing flood.
### Steady-state load
After excluding the initial retained burst:
| Metric | Value |
| --- | --- |
| Sample window | 60 seconds |
| Non-retained messages | 1511 |
| Non-retained bytes | 42,949 |
| Average rate | 25.18 messages/second |
| Peak second | 97 messages |
| Unique topics seen | 202 |
That is a fairly modest steady-state load. The broker is handling a reasonable message rate without signs of distress.
### Noisiest publishers
The clear dominant talker is a **Shelly EM3** namespace:
- Root prefix `shellies` accounted for **1362 / 1511** steady-state messages.
- The top topics were all from `shellies/shellyem3-485519D91C40/emeter/...`.
- Individual EM3 topics appeared **35 times in 60 seconds**, which is chatty but not bandwidth-heavy.
Important nuance:
- This is mostly a **message-count** issue, not a **bandwidth** issue.
- The same steady-state sample shows `shellies` produced only **7004 bytes** total.
So the Shelly EM3 is the main source of ongoing chatter, but it does **not** currently look like a broker or network problem by itself.
### Large payloads
Outside the retained startup burst, large payloads were minimal:
- Only one large non-retained payload was observed in the steady-state sample:
- `frigate/stats` - 10,743 bytes
Early large-byte topics from `frigate` snapshots and `hass.agent` thumbnails appeared in the broad capture, but they did **not** show up as sustained heavy traffic in the steady-state sample.
### Topic naming oddities
Several Zigbee2MQTT topics contain spaces, for example:
- `zigbee2mqtt/Btcino coso salotto/availability`
- `zigbee2mqtt/Letty Condizionatore Ufficio/availability`
This is **not** a broker anomaly, but it is worth noting:
- it can make tooling and ad-hoc topic handling more brittle
- it increases the chance of mistakes in scripts, automations, and CLI work
If you want cleaner topic hygiene, consider slug-style friendly names for Zigbee2MQTT devices.
## Conclusion
### What looks healthy
- No dropped publishes
- No sign of broker backlog or unstable client churn
- Retained store is present but not unusually large
- Heap usage is not alarming
- Steady-state traffic volume is modest
### What stands out
1. **A retained-state burst on subscribe**, mostly from Zigbee2MQTT bridge metadata. This is expected behavior and not a live flood.
2. **A very chatty Shelly EM3 publisher** dominating message count. It is the main thing to watch, but at current byte volume it does not look harmful.
3. **Topic names with spaces** in Zigbee2MQTT. Not a performance issue, but a maintainability footgun.
## Recommendation
No urgent remediation is indicated from this pass.
If you want to reduce noise further, the best next place to look would be the publish frequency/config of `shellies/shellyem3-485519D91C40`, since that device is responsible for most of the ongoing message count.

View file

@ -0,0 +1,298 @@
# Salotto Overview Switch Investigation
Date: 2026-04-17
Instance: Casa
No write actions were taken during this pass. This was a read-only Home Assistant investigation.
## Scope
This document investigates why the main **Salotto** light control in the **Overview** dashboard no longer works, while the individual Salotto lights still turn on correctly.
MQTT-level debugging was not needed for this pass because the Home Assistant entity, dashboard, history, and automation data were enough to isolate the problem.
## Executive Summary
The Overview control is not directly toggling the Salotto light group.
Instead:
1. The card displays `light.luci_buone_salotto`.
2. Its tap action calls `switch.toggle` on `switch.nitori_salotto_1_left`.
3. The automation `automation.pulsanti_luce_salotto` listens for changes on that switch and then turns `light.luci_buone_salotto` on or off.
That means the dashboard button depends on an indirect path:
`Overview card` -> `switch.nitori_salotto_1_left` -> `automation.pulsanti_luce_salotto` -> `light.luci_buone_salotto`
Right now that path is out of sync:
- `light.luci_buone_salotto` is currently `off`
- `switch.nitori_salotto_1_left` is currently `on`
- the switch has not changed state since `2026-04-14T16:17:11Z`
- the light group has continued changing independently through `2026-04-17T09:59:53Z`
So the Overview card is using the light group as its displayed state, but it is controlling a different entity whose state no longer matches the light group. That is the reason the button appears broken.
## Confirmed Facts
### 1. The Casa instance was queried
The active Home Assistant instance returned:
- location name: `Home`
- base URL: `http://supervisor/core`
- architecture: Casa-sized instance with about `1414` entities and `11` areas
This matches the expected Casa fingerprint.
### 2. The Overview dashboard Salotto control targets the wrong entity for a direct light toggle
In the `lovelace` dashboard, the main Salotto controls were found in multiple places, including:
- `.views[0].badges[4]`
- `.views[0].sections[0].cards[1]`
- `.views[9].sections[0].cards[3]`
All of them display `light.luci_buone_salotto`, but the action is:
```yaml
tap_action:
action: perform-action
perform_action: switch.toggle
target:
entity_id:
- switch.nitori_salotto_1_left
```
The icon tap action is wired the same way.
So the card is not toggling the light entity it displays.
### 3. The switch is part of a Zigbee2MQTT wall switch device
`switch.nitori_salotto_1_left` belongs to device:
- device: `nitori_salotto_1`
- model: `Smart light switch - 3 gang without neutral wire`
- integration: `zigbee2mqtt`
Related entities:
- `switch.nitori_salotto_1_left`
- `switch.nitori_salotto_1_center`
- `switch.nitori_salotto_1_right`
### 4. The automation still links that switch to the Salotto light group
`automation.pulsanti_luce_salotto` is configured to react to state changes of the left switch and then control the group:
- if the switch is off -> `light.turn_off` on `light.luci_buone_salotto`
- if the switch is on -> `light.turn_on` on `light.luci_buone_salotto`
So the dashboard button currently relies on this automation path instead of directly toggling the group.
### 5. The switch and the light group are no longer aligned
Current state snapshot:
- `switch.nitori_salotto_1_left`: `on`
- `light.luci_buone_salotto`: `off`
Recent history shows:
- the switch last changed on `2026-04-14T16:17:11Z`
- the light group continued changing after that, including:
- `2026-04-15T15:28:57Z` -> `on`
- `2026-04-15T22:52:06Z` -> `off`
- `2026-04-16T16:03:40Z` -> `on`
- `2026-04-16T22:12:21Z` -> `off`
- `2026-04-17T09:57:13Z` -> `on`
- `2026-04-17T09:59:53Z` -> `off`
This confirms the light group is being controlled independently of the switch, so the dashboard button and the displayed light state can no longer be trusted to represent the same thing.
### 6. The automation path itself is not missing
Recent traces exist for `automation.pulsanti_luce_salotto`, including runs on `2026-04-13` and `2026-04-14`, triggered by `switch.nitori_salotto_1_left`.
That means the automation definition is present and did run when the switch state changed. The issue is not a missing automation.
## Diagnosis
The Overview control broke because it mixes:
- **display state** from `light.luci_buone_salotto`
- **control action** on `switch.nitori_salotto_1_left`
This only behaves correctly while the switch state and the light group state stay synchronized.
They no longer do.
Once the switch remained `on` while the group was later turned `off` by some other path, pressing the Overview button stopped behaving like a normal room-light toggle. The card looks like a light control, but it is really driving an unrelated intermediate switch entity.
## Secondary Findings
- `light.salotto` also exists as a separate group entity that includes all Salotto lights.
- `light.luce_salotto` exists but is currently `unavailable` and appears to be a Tuya entity. It does not appear to be the Overview control target found in this investigation.
- The main failure is the dashboard action wiring, not a full outage of the Salotto light entities.
## Recommended Fix
The safest fix is to make the Overview Salotto card toggle the actual light group it displays instead of the wall-switch entity.
Good options:
1. Change the card action to toggle `light.luci_buone_salotto` directly.
2. If the desired room-level target is broader, use `light.salotto` directly instead.
Less robust option:
1. Keep the switch-based path and add more logic to keep the switch state synchronized with the light group.
That indirect design is the source of the breakage, so direct light control is the cleaner fix.
## Applied Changes
The following Home Assistant changes were applied on the Casa instance during this pass:
1. The `lovelace` Overview Salotto controls were changed to toggle `light.luci_buone_salotto` directly instead of calling `switch.toggle` on `switch.nitori_salotto_1_left`.
2. A new automation, `automation.sync_salotto_control_switches`, was created to mirror the state of `light.luci_buone_salotto` back to:
- `switch.nitori_salotto_1_left`
- `switch.switch_cucina_neo_l2`
The intent of that automation is:
- when the light group turns on -> both control switches should be turned on
- when the light group turns off -> both control switches should be turned off
This preserves:
- direct and reliable dashboard control of the actual light group
- consistent switch state for the two wall-switch control paths
## Additional Finding: the Nitori switch is currently stale at the service layer
After applying the config fix, a direct `switch.turn_off` call was sent to `switch.nitori_salotto_1_left` to reconcile its stale `on` state with the currently `off` light group.
The service call was accepted, but Home Assistant could not verify a state change.
Current evidence:
- `light.luci_buone_salotto` is `off`
- `switch.switch_cucina_neo_l2` is `off`
- `switch.nitori_salotto_1_left` still reports `on`
- `switch.nitori_salotto_1_left` has not changed state since `2026-04-14T16:17:11Z`
- Home Assistant system logs contain repeated warnings that `switch.nitori_salotto_1_left` is "missing or not currently available" when targeted by services
This means the original dashboard problem and the current switch-state problem are related but not identical:
1. **Dashboard problem:** fixed by changing the card to control the light group directly.
2. **Nitori state-sync problem:** still blocked because the `switch.nitori_salotto_1_left` entity is not currently accepting or reflecting Home Assistant service commands reliably.
## Practical Conclusion
The reliable control path is now:
`Overview card` -> `light.luci_buone_salotto`
The intended synchronization path is now:
`light.luci_buone_salotto` -> `automation.sync_salotto_control_switches` -> both wall-switch entities
However, the Nitori leg of that sync will only work once `switch.nitori_salotto_1_left` is healthy again at the Zigbee2MQTT/device layer.
So the configuration fix is in place, but full synchronization of the Salotto Nitori switch still depends on restoring normal commandability of that switch entity.
## MQTT and Zigbee2MQTT Restore Diagnostics
Broker-level MQTT diagnostics were run against the Casa Zigbee2MQTT base topic after creating the local file:
- `.local/mqtt-home.env`
### What Zigbee2MQTT still knows about the device
The retained `bridge/devices` record still contains `nitori_salotto_1`:
- friendly name: `nitori_salotto_1`
- IEEE address: `0xa4c1386a5b20e7a7`
- model: `TS0013`
- vendor: `Tuya`
- interview state: `SUCCESSFUL`
- supported: `true`
- type: `EndDevice`
So the device has not been removed from Zigbee2MQTT and still exists in the coordinator database.
### Important difference vs the working kitchen switch
Using the same broker-level probe against the working fallback switch `Switch_Cucina_Neo` produced:
- a live topic payload on `zigbee2mqtt_2/Switch_Cucina_Neo`
- fresh state values such as `state_l2`
- a successful response on `zigbee2mqtt_2/bridge/response/device/configure`
Using the same probe against `nitori_salotto_1` produced:
- no live topic payload on `zigbee2mqtt_2/nitori_salotto_1`
- no availability payload
- no bridge log output
- no response on `zigbee2mqtt_2/bridge/response/device/configure`
This is a strong indicator that Zigbee2MQTT still has the device definition, but the device is not currently responding on the Zigbee network.
### What this means
At this point the likely fault domain is one of:
1. the switch has lost effective connectivity to the Zigbee mesh
2. the switch is powered but stuck and not answering Zigbee commands
3. the device has fallen off the mesh badly enough that reconfigure requests never complete
This is not behaving like a dashboard, automation, or Home Assistant entity-registry issue anymore.
## Best Next Restore Step
The highest-probability recovery action now is:
1. physically power-cycle the Nitori switch circuit or otherwise restore power to the device
2. immediately retest Zigbee2MQTT commandability
3. if it still does not answer, re-pair or re-interview the device in Zigbee2MQTT while preserving the same friendly name if possible
Because the kitchen comparison proved the MQTT request flow is correct, there is no value in continuing to tweak the Home Assistant dashboard or automation config until the Nitori device starts answering Zigbee2MQTT again.
## Successful Recovery
The device was recovered without a power-cycle by putting `nitori_salotto_1` back into pairing mode while Zigbee2MQTT permit-join was open.
During the recovery watch, Zigbee2MQTT reported:
- `bridge/response/permit_join` with `status: ok`
- `bridge/event` with `type: device_announce` for `nitori_salotto_1`
- fresh live payloads again on `zigbee2mqtt_2/nitori_salotto_1`
Fresh MQTT payloads after re-announce included:
- `state_left: OFF`
- `state_center: OFF`
- `state_right: OFF`
- `backlight_mode: normal`
- linkquality around `134-145`
After the re-announce:
- Home Assistant updated `switch.nitori_salotto_1_left` back to `off`
- `switch.nitori_salotto_1_right` also recovered from stale state and updated to `off`
- the Salotto light group and kitchen fallback switch remained synchronized
Final validation:
1. The left Nitori button was pressed once after recovery.
2. `switch.nitori_salotto_1_left` changed from `off` to `on`.
3. `light.luci_buone_salotto` turned `on` immediately after.
4. `switch.switch_cucina_neo_l2` was then synchronized to `on` by `automation.sync_salotto_control_switches`.
So the switch is now restored at the Zigbee2MQTT layer and the end-to-end control path is working again:
`nitori_salotto_1_left` -> `automation.pulsanti_luce_salotto` -> `light.luci_buone_salotto` -> `automation.sync_salotto_control_switches` -> fallback switch state