[{"content":"This post documents the troubleshooting and repair of an IIoT ingress deployment. The target was straightforward: run EMQX 5.8.9 in Podman Rootless on a cloud host, then complete MQTT device access, HTTP-based dynamic authentication, webhook forwarding, and time-series persistence.\nIn practice, the path exposed four separate problems: emqx ctl could not reliably reach the running node inside the container, EMQX crashed at startup after switching to a prewritten cluster.hocon, external MQTT clients kept timing out, and the backend returned 403 Forbidden for M2M HTTP requests sent by EMQX. These symptoms belonged to four different layers: runtime control, configuration schema, network policy, and security middleware.\nNote: project names, service names, domains, IPs, paths, accounts, table names, request headers, log snippets, and command outputs in this article are all desensitized, abstracted, or rewritten placeholders. They preserve only the technical details relevant to the troubleshooting path.\nBackground And Goal Container runtime: Podman Rootless Broker: EMQX 5.8.9 Authentication mode: HTTP AuthN Data path: MQTT -\u0026gt; Rule Engine -\u0026gt; Webhook -\u0026gt; TimescaleDB Backend protection: browser-side CSRF defense + M2M token validation The intended workflow was simple: devices connect over MQTT, EMQX calls the backend for dynamic authentication, then telemetry is forwarded through a webhook and persisted in the time-series database.\nIncident Summary This was not a single-point failure. Four fault chains overlapped:\nPodman Rootless isolation made Erlang EPMD IPC unreliable, so emqx ctl could not serve as a reliable runtime configuration path EMQX 5.8.9 enforced stricter HOCON schema validation on bridge configuration, and legacy fields triggered unknown_fields The cloud security group did not allow inbound traffic on 1883, so external MQTT access was blocked The backend CSRF middleware protected all POST requests by default and accidentally blocked AuthN and webhook calls from EMQX The final remediation matched those four chains:\nstop relying on emqx ctl for dynamic loading and render a complete cluster.hocon before EMQX starts remove obsolete resource_opts fields open inbound 1883 in the security group bypass CSRF for /api/v1/iot/* and add constant-time X-EMQX-Token validation on /iot/ingest Troubleshooting Timeline Phase 1: Why Dynamic Configuration Was Unreliable The initial deployment plan started EMQX first and then injected HTTP AuthN and webhook configuration with:\nemqx ctl conf load --merge\nWhat the container repeatedly returned was:\nNode emqx@\u0026lt;service\u0026gt;-emqx not responding to pings The first useful conclusion was that the problem was not yet in business logic or payload format. It was in the runtime control plane. If emqx ctl could not reliably reconnect to the running Erlang node, the entire hot-load approach had no stable foundation.\nHypothesis A: The node hostname was wrong The first guess was a node-name resolution issue, so the Compose file was updated with an explicit hostname. That made the node naming more consistent, but emqx ctl was still unreliable. Hostname alignment helped remove noise, but it was not the root cause.\nHypothesis B: Compose expanded the sed placeholder too early Another confirmed issue was that ${VAR} in the command string was expanded too early by podman-compose, which broke the sed replacement target and caused token injection to fail. Replacing that pattern with a static placeholder fixed token rendering, but emqx ctl was still unstable. This showed that template replacement had a defect, but it still was not the main reason the deployment path failed.\nPhase 2: Bypass emqx ctl and Write Config Before Startup Once the problem was narrowed to runtime IPC, the next step was to avoid that control path entirely. The rendered configuration was written directly to:\n/opt/emqx/data/configs/cluster.hocon\nand EMQX was started only after that file was in place. This bypassed emqx ctl, but immediately exposed a second blocker: EMQX crashed during startup.\nThe decisive log lines were:\n[error] failed_to_check_schema: emqx_conf_schema [error] #{reason =\u0026gt; unknown_fields, path =\u0026gt; \u0026#34;bridges.webhook.\u0026lt;service\u0026gt;_backend_ingest_bridge.resource_opts\u0026#34;, unknown =\u0026gt; \u0026#34;pool_size,buffer_type,max_retries,...\u0026#34;} This made the next conclusion clear. Pre-rendering the configuration was the right direction, but the template still contained legacy fields that were no longer accepted by the current EMQX version.\nHypothesis C: Removing only part of resource_opts would be enough The first attempt removed fields such as buffer_type and max_retries while keeping pool_size. EMQX still failed schema validation. That showed the issue was not a bad value in one field; the resource_opts subtree itself no longer matched the current version. EMQX started normally only after the whole block was removed.\nA second issue discovered at the same stage This phase also exposed a configuration-overwrite risk. Once cluster.hocon was written as a full replacement, a template that only contained webhook rules could silently wipe previously configured HTTP AuthN settings. That would let the broker start but break device authentication.\nThe final template therefore declared three parts together:\nHTTP AuthN rule engine rules webhook bridge That change was not cosmetic. It ensured startup configuration remained atomic instead of being split across multiple mutation paths.\nPhase 3: The Broker Was Up, but External Devices Still Timed Out Once EMQX started normally and container logs showed:\nListener tcp:default on 0.0.0.0:1883 started external clients were still timing out. At that point, the problem was no longer inside the broker. The investigation moved outward to the network boundary.\nmosquitto_pub from outside the host still timed out against \u0026lt;IP\u0026gt;:1883, which narrowed the problem down to the IaaS layer. The final cause was a missing inbound 1883 rule in the cloud security group. Once that rule was added, MQTT connectivity from devices recovered immediately.\nThis is a common failure pattern: a container can bind the port correctly while the real access path is still blocked farther out.\nPhase 4: M2M HTTP Requests Were Blocked by Backend Security Middleware After MQTT connectivity was restored, EMQX still received 403 Forbidden on HTTP AuthN and webhook requests. That led the investigation into the backend security layer, where the root cause turned out to be the global CSRF middleware.\nThe backend originally enforced strict Origin and Referer checks for browser traffic. That is a reasonable default for web-facing endpoints, but it does not fit machine-to-machine requests from EMQX. The issue is not that such requests can never carry those headers. It is that they do not operate within the browser trust model and therefore cannot reliably satisfy CSRF checks built around Origin and Referer.\nThe resolution was not to turn CSRF off globally, but to split the security model:\nbypass CSRF for the /api/v1/iot/* path prefix validate X-EMQX-Token on /iot/ingest compare the token with crypto/hmac.Equal to avoid timing side channels This kept browser-facing protection in place while switching the IoT ingress path to a mechanism that actually matches M2M traffic.\nRoot Cause Analysis 1. Runtime Friction Between Podman Rootless and Erlang EPMD emqx ctl depends on Erlang distribution and EPMD node discovery. In a Rootless container environment, user namespaces and the virtual network stack alter loopback and local IPC behavior enough that this control path can become unreliable. The problem was not that the command syntax was wrong. The problem was that the chosen runtime control mechanism was not a stable fit for this environment.\n2. EMQX 5.8.x Tightened Acceptance of Legacy HOCON Fields Older webhook bridge configurations often used fields such as resource_opts.pool_size, buffer_type, and max_retries. In 5.8.x, those paths are no longer accepted in the same way. The broker now performs strict startup validation and rejects unknown fields instead of ignoring them, so legacy configuration can block boot completely.\n3. Network Policy and Application Logs Can Drift Apart The fact that EMQX listened successfully on 1883 only proved that the broker had bound the local socket. It did not prove that the end-to-end ingress path from devices was open. When clients time out and the application stays quiet, the next checks need to move outward to security groups, firewall rules, and external routing.\n4. Browser Security Logic Should Not Be Applied Directly to M2M Traffic CSRF is designed for browser-originated traffic. AuthN and webhook calls from EMQX are M2M requests. If the same browser-oriented checks are applied to both classes of traffic, legitimate broker calls will be treated as suspicious simply because they do not carry browser semantics. The correct fix is not \u0026ldquo;less security,\u0026rdquo; but a different security model for a different caller class.\nResolution The final solution had four parts.\n1. Render A Complete cluster.hocon Before Startup The hot-load approach was abandoned. Instead, the container entrypoint renders the final configuration and writes it before EMQX starts. This removes the dependency on unstable runtime IPC.\n2. Merge AuthN, Rule Engine, and Webhook Into One Template The final template contains:\nHTTP AuthN telemetry forwarding rules webhook bridge and removes the incompatible resource_opts block entirely.\n3. Open the External Network Path After verifying EMQX was listening on 0.0.0.0:1883, the cloud security-group rule was added so devices could actually reach the broker from outside.\n4. Split the M2M Security Path from Browser Security The backend now bypasses CSRF on /api/v1/iot/* and enforces a shared-secret check on /iot/ingest. The two request classes are now protected differently:\nbrowser endpoints: CSRF remains enabled IoT M2M endpoints: X-EMQX-Token is required That makes the backend security boundary more explicit and avoids mixing browser and device traffic under one validation model.\nValidation Validation was completed in three layers.\n1. Backend Middleware Validation Tests for CSRF and token middleware confirmed:\nordinary web/API requests still return 403 when Origin or Referer is missing or incorrect the /api/v1/iot/* prefix bypasses CSRF /iot/ingest rejects missing or invalid tokens 2. End-to-End MQTT Validation An external publish test was executed with:\nmosquitto_pub -d -h \u0026lt;Domain\u0026gt; -p 1883 \\ -i \u0026#34;\u0026lt;Device_SN\u0026gt;\u0026#34; -u \u0026#34;\u0026lt;Device_SN\u0026gt;\u0026#34; -P \u0026#34;\u0026lt;Token\u0026gt;\u0026#34; \\ -t \u0026#34;telemetry/fms/\u0026lt;Device_SN\u0026gt;\u0026#34; \\ -m \u0026#39;{\u0026#34;fcs_speed_rpm\u0026#34;: 7766, \u0026#34;fcs_soc\u0026#34;: 88.9, \u0026#34;fms_fault_code\u0026#34;: 0}\u0026#39; The observed result confirmed that:\ndevice connection succeeded HTTP AuthN passed the message entered the broker and was processed by the rule engine 3. Persistence Validation A database query confirmed successful writes into the time-series table, while the _iot data-plane role remained isolated from control-plane permissions.\nTakeaways And Next Steps This troubleshooting cycle leaves a few clear engineering lessons:\nunder Rootless Podman, prefer declarative configuration over runtime Erlang IPC whenever both are possible when middleware versions change, advanced tuning fields are more fragile than baseline functional settings, so validation should start from a minimal config \u0026ldquo;the service is listening\u0026rdquo; does not mean \u0026ldquo;the service is reachable\u0026rdquo;; ingress must be verified layer by layer out to the security group or firewall browser security and M2M security should be split explicitly instead of forced into one middleware model At the current stage, the single-node template-rendering solution is sufficient for MVP verification. If the system moves toward production-grade availability, the next concerns are already clear:\nbroker clustering and centralized configuration management asynchronous buffering between webhook and backend migration from shared-secret authentication toward mTLS or a stronger device identity model The value of this repair is not that one startup error disappeared. The real value is that a failure chain spread across runtime, configuration, network, and security was turned into an ingress path that is understandable, testable, and repeatable.\n","permalink":"https://intent.me/en-us/blog/tech/iiot-emqx-rootless-deployment-postmortem/","summary":"A postmortem of an IIoT ingress deployment failure: under Podman Rootless, EMQX 5.8 exposed unstable Erlang IPC, HOCON schema validation failures, a blocked security-group port, and M2M requests rejected by CSRF middleware. This post covers the confirmed causes, fixes, and validation path.","title":"IIoT Ingress Postmortem: Troubleshooting EMQX 5.8 Under Podman Rootless"},{"content":"In a 100ms polling workload, CPU usage of the process hosting TrafficMonitor stayed around 3.6%~4.2% for a long period. Beyond the elevated resource cost, the UI also showed visible refresh lag and less responsive scrolling during high-frequency traffic. After comparing two implementations, the optimized version brought CPU usage down to 0.1%~0.6%, while noticeably reducing UI stutter under continuous polling.\nThe gain did not come from a single hot function. It came from a broader engineering simplification: cutting intermediate layers, state machines, and unnecessary wakeups from high-frequency paths so the system could return to a shape closer to its intended responsibility boundary.\nThe optimization discussed here comes from the open-source tool Modbus-Tools. If you want the broader product context first, the introduction article is a better starting point.\nBackground And Conclusion This optimization can be reduced to two core conclusions:\nTrafficMonitor returns to being a lightweight traffic viewer instead of acting like a full logging system worker threads truly block when no event is present, rather than waking up on a fixed time slice At the code level, the main changes fall into four areas:\nTrafficMonitorWidget moves from an event model back to direct text append ModbusTcpView and ModbusRtuView remove polling summary and suppression logic ModbusClient switches from wait_for + 5ms slice to wait_until SerialChannel and TcpChannel remove thread diagnostic logs from high-frequency paths Key Changes 1. Shrink The UI Logging Path In the more complex version, TrafficMonitor introduced a full event abstraction. The logging path looked like this:\ncaller -\u0026gt; build event object -\u0026gt; classify level/mode -\u0026gt; render text -\u0026gt; write to list\nThe optimized version removes that intermediate layer and keeps only direct entry points such as appendTx(), appendRx(), appendInfo(), and appendWarning(). The path becomes:\ncaller -\u0026gt; format text directly -\u0026gt; append to QListWidget\nThis cuts object creation, event classification, second-pass rendering, and state synchronization cost. It is one of the main sources of the CPU reduction.\n2. Move From Model/View Back To A Direct List The complex TrafficMonitorWidget was not just a visual control. It also carried a full data layer, including:\nQListView + QAbstractListModel pendingEvents_ eventHistory_ flushTimer_ rebuildScheduled_ pausedEventCount_ This design is more feature-complete, but under 100ms high-frequency refresh it keeps amplifying costs in:\nevent enqueueing and batch flushing text formatting and filter checks history rebuilding and scroll synchronization The optimized version goes back to direct append on QListWidget. In practice, that means removing a resident data-management layer. For high-frequency log display, the benefit is immediate.\n3. Remove High-Frequency Extra Features And Polling State Machines The complex version added several observability-oriented features:\nPause View Raw Frames Level Filter paused counters visible list rebuild Poll Summary state machine Each feature is reasonable on its own, but together they turn every log line into an event that must be evaluated and scheduled, instead of plain text that can be displayed directly.\nThis is especially visible in the Poll Summary logic inside ModbusTcpView and ModbusRtuView. It reduced visible log volume, but did not reduce total system work. Each polling cycle still had to update counters, maintain state, check timing windows, and format summary strings. Once this logic was removed, the high-frequency path became much shorter.\n4. Make The Wait Model Truly Blocking Outside the UI, the other key change was the wait strategy inside ModbusClient.\nThe complex version used cv_.wait_for(lock, 5ms, ...) together with an event pumping pattern. That means the thread woke up every 5ms even when no real event existed. Under a 100ms polling workload, that creates a large amount of useless wakeup traffic.\nThe optimized version standardizes on:\ncv_.wait_until(lock, deadline, predicate)\nThe result is:\nthreads stay asleep when no event is present wakeup happens only on timeout or when the condition is satisfied current-thread event pumping is no longer done periodically This directly cuts worker-thread spinning and is the other main source of improvement.\n5. Remove Debug Output From High-Frequency Paths In SerialChannel.cpp and TcpChannel.cpp, the complex version kept thread-context diagnostics such as:\nthreadToken(...) logThreadContextOnce(...) thread logs inside open(), onReadyRead(), and onConnected() Each individual log is cheap, but these calls sit on high-frequency send/receive paths. Over time they amplify formatting and branch overhead. Removing them is a typical example of reducing weight on a hot path.\nWhy The Gain Was So Visible The value of this change is not that one function dropped from 10ms to 2ms. The real value is that multiple multiplicative terms were removed from a high-frequency path.\nA single polling send/receive cycle in the complex version usually involved:\nI/O callback event object construction level and mode checks queue operations timer wakeup batch rendering filter checks history rebuild auto-scroll The optimized version is much closer to:\nI/O callback text formatting direct append to the list At a fixed 100ms call frequency, this difference in path length keeps compounding. The final result is a CPU drop from 3.6%~4.2% to 0.1%~0.6%. Since the main thread no longer carries continuous pressure from event dispatch, batch refresh, and list rebuild, UI interaction also returns to a much steadier state.\nEngineering Takeaways This optimization leaves a few clear lessons:\nin high-frequency scenarios, feature complexity itself is a performance cost GUI performance issues often come not from painting, but from event systems and state machines before painting if a wait can block, do not poll; if a wakeup can be condition-based, do not wake on time slices the clearer a logging component\u0026rsquo;s responsibility is, the easier it is to keep both implementation and performance stable In the end, this was not a story about clever micro-optimizations. It was a story about shrinking system boundaries again. Once TrafficMonitor returned to a lightweight viewer and ModbusClient returned to truly blocking waits, the system no longer paid high-frequency cost for low-frequency features, and the overall load dropped accordingly.\nFor a broader view of the tool itself, see the Modbus-Tools introduction. Source code is available on GitHub: mingyucheng692/Modbus-Tools.\n","permalink":"https://intent.me/en-us/blog/tech/trafficmonitor-cpu-optimization-postmortem/","summary":"A postmortem on CPU optimization under a high-frequency polling workload: by shrinking the UI logging path, removing high-frequency state machines, and fixing the wait model, CPU usage dropped from 3.6%~4.2% to 0.1%~0.6%.","title":"TrafficMonitor CPU Optimization Postmortem: An Engineering Simplification from 4% to 0.6%"},{"content":"This post documents the investigation and repair of a failed container recovery path. In a service stack managed with Rootless Podman, a database container exited unexpectedly and was not restarted as intended. That initial miss then exposed several adjacent failures: startup timeouts, services reported as inactive while containers were still online, and rate-limit lockouts after repeated restart attempts.\nNote: project names, service names, paths, domains, usernames, host identifiers, timestamps, PIDs, command outputs, and log snippets in this article are all desensitized, abstracted, or rewritten placeholders. They preserve only the technical details relevant to the troubleshooting path and do not map to any real production identifiers.\nRuntime: Rootless Podman 4.9.x Process supervision: Systemd user units Service components: TimescaleDB, Redis, Go backend, Nginx frontend Intended capability: let Systemd recover containers automatically after abnormal exits Incident Summary Trigger context: after the database container exited unexpectedly, the first on-site state was that the database Unit became inactive while the other service Units still appeared active Observable symptoms: once the repair work began, the database service recovered to active, but the other services were not yet reliably handed over to systemd, which led to startup timeouts, container state drifting away from Unit state, and rate-limit lockouts on some services Confirmed contributing factors: the main Unit had been edited in place with sed -i and was later corrupted after a PowerShell escaping truncation; Compose and Systemd shared control of the runtime; generated Unit files still required post-processing; and the recovery scripts did not yet fully cover dependency cleanup and restart-failure handling Fix direction: stop editing the main Unit with sed -i, regenerate .service files and patch them in Python, explicitly hand runtime control back to systemd, and add state polling plus a final health check Current result: in the current environment, the recovery path now runs in a fixed order and exposes a clear diagnostic entry point on failure; readiness semantics and observability still need more work Symptoms and Impact The initial on-site state was not \u0026ldquo;everything went down at once.\u0026rdquo; The database Unit first turned inactive while the other services still looked active. The picture became more complicated during the repair and handoff phase:\nsome services showed timeout-based restart loops in systemctl --user status some containers were already Up while the corresponding systemd services were still inactive some services had failed often enough that systemd stopped retrying entirely That meant this was not just one container failure. The real issue was drift between three things: actual container state, systemd supervision state, and the assumptions baked into the deployment scripts. This write-up focuses on why the recovery path failed and how the scripts were adjusted afterward. It does not try to expand on the business-side reason the database exited in the first place.\nInvestigation Approach: First Identify Who Actually Owns the Process We did not begin with a single error line. The first question was more basic: when a container exits, who is actually responsible for noticing the failure, deciding it is unhealthy, and bringing it back. Following that thread eventually exposed four overlapping problems.\nProblem 1: The Main Unit Was Edited In Place and Eventually Corrupted The first visible issue was that the database unit file had been damaged into an empty file:\nsystemd[\u0026lt;pid\u0026gt;]: /home/\u0026lt;deploy-user\u0026gt;/.config/systemd/user/container-\u0026lt;db-service\u0026gt;.service:1: Missing \u0026#39;=\u0026#39;. Inspecting the file itself showed that it had been truncated to 0 bytes:\nls -l /home/\u0026lt;deploy-user\u0026gt;/.config/systemd/user/container-\u0026lt;db-service\u0026gt;.service # -rw-rw-r-- 1 \u0026lt;deploy-user\u0026gt; \u0026lt;deploy-user\u0026gt; 0 \u0026lt;timestamp\u0026gt; container-\u0026lt;db-service\u0026gt;.service file /home/\u0026lt;deploy-user\u0026gt;/.config/systemd/user/container-\u0026lt;db-service\u0026gt;.service # container-\u0026lt;db-service\u0026gt;.service: empty At this point we can describe the trigger chain more concretely. The older script did in fact modify the main Unit with sed -i. When that command was dispatched from PowerShell, escaping was handled incorrectly, the target file was not written back successfully, and the main Unit was truncated to 0 bytes. The issue was not merely that the script used sed -i; it was that the main Unit was treated as something safe to edit in place, and that choice was then amplified by cross-shell escaping differences.\nThe older script path looked roughly like this:\n# old approach: edit the main Unit directly sed -i \u0026#39;s/Restart=always/Restart=no/g\u0026#39; \u0026#34;$service\u0026#34; The design problem is clear enough on its own: the main service file was being used both as the runtime definition and as an operational toggle surface. Once the main Unit breaks, parsing, startup, and restart policy all fail together.\nThe later adjustment was straightforward:\nkeep the main service file as a generated artifact only, and stop editing it with sed -i move restart-policy switches and similar runtime knobs into .service.d/watchdog.conf handle compatibility patches for generated .service files in Python, while letting the script overwrite Drop-ins explicitly In the current script, the Drop-in write path looks roughly like this:\ncat \u0026gt; \u0026#34;$DROP_IN_DIR/watchdog.conf\u0026#34; \u0026lt;\u0026lt; EOF [Service] Restart=always RestartSec=10s ExecStartPre=-/usr/bin/podman rm -f \u0026lt;db-service\u0026gt; EOF This at least restores a cleaner boundary: the main Unit is treated as a rebuildable artifact, while runtime switches live in Drop-ins. Based on the current scripts and subsequent field checks, we have not seen the same \u0026ldquo;main Unit damaged by in-place edits\u0026rdquo; problem recur.\nProblem 2: Systemd Reported Failure Even Though the Container Was Already Running The second issue was subtler. During the incident, systemd marked some services as startup timeouts:\nActive: activating (auto-restart) (Result: timeout) Main PID: \u0026lt;pid\u0026gt; (code=exited, status=0/SUCCESS) At the same time, podman ps showed that the container itself was already up. In other words, the process contract had split.\npodman generate systemd uses Type=notify by default. That means systemd does not consider process existence alone to be enough for startup success; it expects an sd_notify readiness signal from the running container back to the host.\nBefore the recovery path was patched, the generated Unit still contained the default fields:\nType=notify NotifyAccess=all When we re-collected the environment after the repair, freshly generated output still carried that default, while the Unit that actually went live after the fix had already been changed to Type=simple:\n# default generated result during the incident path Type=notify # effective Unit after the repair went live Type=simple What we can confirm from the incident is this: while services were still using the default generated Unit with Type=notify, containers could already be Up and yet the corresponding Unit would still be treated as timed out and restarted by systemd. The repair changed generated Units from Type=notify to Type=simple, shifting status recognition back to \u0026ldquo;systemd tracks the foreground process directly.\u0026rdquo;\nThe boundary here matters. This is not proof that notify is categorically unusable in all Rootless scenarios. It only shows that, in this environment, the readiness semantics delivered by notify were not stable enough to rely on. For these long-running foreground processes, Type=simple was the safer and easier-to-debug choice.\nProblem 3: Some Containers Were Not Actually Under Systemd Control More precisely, this dangerous state showed up during the repair process rather than in the initial incident snapshot: the database Unit had already recovered to active, while the other service Units still showed inactive (dead).\nThe state on the host looked roughly like this:\npodman ps # ... some service containers might already be back up systemctl --user list-units \u0026#39;container-*.service\u0026#39; # ... container-\u0026lt;db-service\u0026gt;.service was active # ... the other service units were inactive (dead) The reason was that those containers had not been started by systemd in the first place. They had been launched earlier through podman-compose up -d. In that state, even if the container process already exists, systemd still does not own it and therefore cannot take over cleanly, detect failure consistently, or recover it.\nThat exposed an easy-to-miss prerequisite: supervision capability depends not only on whether a Unit file exists, but on whether the process is actually held by systemd. If runtime control still belongs to Compose, then \u0026ldquo;restart configuration exists\u0026rdquo; and \u0026ldquo;recovery really works\u0026rdquo; are not the same thing.\nAn implementation detail also matters here: the current deployment script does not eliminate Compose entirely. It still uses podman-compose for build and cold start, then calls watchdog-enable.sh to generate and patch Units, explicitly stops the Compose-launched containers, and finally hands control over through ordered systemctl --user start calls. The issue was never \u0026ldquo;Compose appeared in the script\u0026rdquo;; it was whether Compose still owned the runtime process when recovery was expected. The later change made that handoff explicit:\nstop existing non-supervised containers and let systemctl --user start take over add state checks around the handoff so the system does not silently fall back into \u0026ldquo;container is running but systemd does not know it\u0026rdquo; Problem 4: Generated Units and Recovery Scripts Still Needed More Detail Work 1) Generated Units Still Required Post-Processing After switching back to podman generate systemd --new, at least three categories of issues showed up:\nparameters introduced by tools such as podman-compose can leak into the generated result; for example, if the original container was created with -d, podman generate systemd --new will carry that forward, which makes podman run -d exit immediately from systemd\u0026rsquo;s point of view and prevents it from tracking the real container process Redis is another typical case: if the original configuration disables dangerous commands with an empty argument such as rename-command \u0026quot;\u0026quot;, the generator silently drops the empty string and turns the final command into redis-server --rename-command FLUSHALL, which causes startup-parameter errors the default Type=notify value would reintroduce the same Rootless timeout problem The Redis case is directly visible by comparing original container arguments with the generated output:\n# original container arguments command: - --rename-command - \u0026#34;FLUSHALL\u0026#34; - \u0026#34;\u0026#34; # unpatched generated result redis-server --rename-command FLUSHALL Even after recovery, rerunning podman generate systemd --new --name \u0026lt;redis-service\u0026gt; still shows the empty string being dropped, while the effective Unit already has it restored:\n# regenerated output after recovery --rename-command FLUSHALL # effective Unit after the fix went live --rename-command FLUSHALL \u0026#34;\u0026#34; -d is the same kind of problem. Comparing regenerated output with the effective Unit after the fix shows that the default generated result still contains -d, while the Unit used for supervision removes it:\n# regenerated output after recovery ExecStart=/usr/bin/podman run \\ ... -d \\ --sdnotify=conmon \\ # effective Unit after the fix went live ExecStart=/usr/bin/podman run \\ ... --sdnotify=conmon \\ The current watchdog-enable.sh workflow is therefore: rerun podman generate systemd --new --name ..., then patch the generated result in Python. In practice, it does three things:\nchange Type=notify to Type=simple remove any leftover -d from ExecStart restore Redis rename-command \u0026quot;\u0026quot; so empty arguments are not lost during generation Python was chosen for a specific reason: it handles multi-line commands split with trailing \\ more safely and predictably.\n2) Dependency Relationships Affect Cleanup of Failed Containers Before the database service could recover, the script first had to run podman rm -f to remove the old container. But because upstream services still held dependency references, the delete returned exit 125. That means single-container recovery is not isolated; the dependency graph can push back on cleanup.\nThe direct error looked like this:\nProcess: ExecStartPre=/usr/bin/podman rm -f \u0026lt;db-service\u0026gt; (code=exited, status=125) The current script handles this by putting dependency cleanup into the Drop-in:\nfor the database service, ExecStartPre first stops dependent upper-layer services, then runs podman rm -f \u0026lt;db-service\u0026gt; for the other services, each one only cleans up its own container This is not a generic template. It is just the scripted handling required by this service graph.\n3) The Recovery Path Cannot Rely on Fixed sleep The earlier script did in fact rely on fixed sleep windows to wait for dependencies.\nThe current deployment script still keeps a small amount of fixed waiting in the Compose cold-start phase. But after control is handed to systemd, startup is now ordered explicitly through systemctl --user start, with wait_for_service() polling is-active and is-failed, and printing the relevant journalctl output when a failure or timeout occurs. That is far easier to debug than blind waiting.\nThe final startup order is therefore made explicit:\n\u0026lt;db-service\u0026gt; -\u0026gt; \u0026lt;redis-service\u0026gt; -\u0026gt; \u0026lt;backend-service\u0026gt; -\u0026gt; \u0026lt;frontend-service\u0026gt; After all services are handed over, the script also adds one application-level health check to verify that \u0026ldquo;active\u0026rdquo; is at least moving closer to \u0026ldquo;externally usable.\u0026rdquo; That is a compensating measure for the Type=simple tradeoff, not a deep readiness probe on every service startup.\nThe wait logic after handoff looks roughly like this:\nsystemctl --user start container-\u0026lt;db-service\u0026gt;.service for i in $(seq 1 60); do systemctl --user is-active --quiet container-\u0026lt;db-service\u0026gt;.service \u0026amp;\u0026amp; break systemctl --user is-failed --quiet container-\u0026lt;db-service\u0026gt;.service \u0026amp;\u0026amp; exit 1 sleep 1 done 4) Systemd Rate Limiting Can Lock Services Out When dependencies are still not ready, Frontend may repeatedly fail and restart. For example, Nginx tries to resolve the upstream name for \u0026lt;backend-service\u0026gt; at startup time. Once restart attempts exceed StartLimitBurst within the configured window, systemd stops retrying. At that point, even if dependencies become healthy later, the service will not come back automatically unless we reset the failure state explicitly:\ncontainer-\u0026lt;frontend-service\u0026gt;.service: Start request repeated too quickly. container-\u0026lt;frontend-service\u0026gt;.service: Failed with result \u0026#39;exit-code\u0026#39;. On the Nginx side, the application-level evidence is often more explicit:\n[emerg] 1#1: host not found in upstream \u0026#34;\u0026lt;backend-service\u0026gt;\u0026#34; in default.conf:31 systemctl --user reset-failed container-\u0026lt;frontend-service\u0026gt;.service deploy-all.sh now inserts a reset-failed call before starting Frontend to clear any residual lockout state.\nConfirmed Contributing Factors in This Incident On the surface, the incident looked like a database failure followed by drifting service and supervision states. Looking back, it was really a stack of contributing issues:\nthe main Unit was edited directly by scripts, creating a risk of service-definition corruption podman-compose startup and systemd supervision coexisted, splitting container state from supervision state the output of podman generate systemd was not directly runnable and still needed Type, -d, and argument patching the recovery scripts initially did not cover dependency cleanup, rate-limit recovery, and final health validation thoroughly enough This was not one flag set incorrectly. The recovery path had breaks across definition, handoff, and execution. The repair therefore was not about fixing one error line; it was about reconnecting runtime ownership, generation-time patching, and failure-time diagnostics into a single workable path.\nRemediation Already Landed Combining watchdog-enable.sh and deploy-all.sh, the practical changes now in place are roughly:\nthe deployment script validates execution user, HOME, XDG_RUNTIME_DIR, and DBUS_SESSION_BUS_ADDRESS first, reducing the chance that Rootless user units fail because of an inconsistent runtime environment the deployment phase still uses podman-compose for build and cold start, but the handoff phase explicitly stops those containers and gives ordered control back to systemd watchdog-enable.sh backs up old Units before regenerating .service files generated .service files are now patched in Python in three ways: change to Type=simple, remove -d, and restore Redis empty arguments; this also avoids repeating the PowerShell escaping issue that broke the earlier sed -i path Drop-ins now write Restart=always, RestartSec, StartLimit*, and service-specific ExecStartPre after handoff to systemd, services start in the fixed order \u0026lt;db-service\u0026gt; -\u0026gt; \u0026lt;redis-service\u0026gt; -\u0026gt; \u0026lt;backend-service\u0026gt; -\u0026gt; \u0026lt;frontend-service\u0026gt; and wait_for_service() checks is-active and is-failed; failures and timeouts print the corresponding journalctl before Frontend is started, the script runs reset-failed once to clear any leftover rate-limit lockout after all services are under systemd control, the script adds one application-level health check so the system is verified beyond mere process liveness This is still an engineering solution under a specific set of constraints: single host, Rootless Podman, and user units. It solves who owns the process, how recovery is executed, and where to look when it fails. It is not a claim of generalized high availability.\nHow the Fix Is Verified compare podman ps with systemctl --user list-units before and after handoff, confirming that we no longer leave the system in a state where containers are running but the corresponding Units are inactive use wait_for_service() to poll is-active and is-failed, exposing failures during startup instead of hiding them behind fixed sleep print the relevant journalctl immediately on timeout or failure so each problem maps back to a specific Unit add a final application-level /health check after all services are handed over, so external availability is not inferred from Type=simple process liveness alone A later verification pass showed that all four containers and all four user units were back in sync:\npodman ps \u0026lt;db-service\u0026gt; / \u0026lt;redis-service\u0026gt; / \u0026lt;backend-service\u0026gt; / \u0026lt;frontend-service\u0026gt; all Up systemctl --user list-units \u0026#39;container-*.service\u0026#39; all four corresponding Units active (running) These checks mainly cover whether systemd truly owns the process, whether failures are exposed in time, and whether the system becomes externally usable after handoff. They still do not cover finer-grained readiness signals or ongoing telemetry collection.\nConstraints Made Explicit After This Incident the main Unit should remain a generated artifact only, while runtime switches belong in Drop-ins; this is not ceremonial separation, but a way to avoid breaking parsing, startup, and restart behavior all at once when the main file is damaged podman generate systemd output cannot be treated as ready to run; generation, patching, and daemon reload must be considered one continuous step, and in this case Type=notify, -d, and empty-argument handling all fall into mandatory post-generation compatibility work deployment control and runtime control must be separated; Compose may still be useful for build and cold start, but runtime ownership must be handed back explicitly to systemd, otherwise \u0026ldquo;restart config exists\u0026rdquo; and \u0026ldquo;recovery actually works\u0026rdquo; remain conflated a recovery path is not complete just because Restart=always exists; it must also cover dependency cleanup, rate-limit recovery, state polling, and a final health check so the failure path remains repeatable and diagnosable Follow-Up Work The current repair closes several known breaks in the recovery path, but some risks are still not fully covered and need follow-up:\nType=simple plus is-active / is-failed and the final /health check close the gap between \u0026ldquo;process is alive\u0026rdquo; and \u0026ldquo;service is actually usable,\u0026rdquo; but there is still no finer-grained readiness signal; if a service becomes active briefly and only then exposes an internal initialization failure, the current scripts may still detect it late the solution still depends on podman generate systemd plus script-level patching; if generated output format, argument expansion, or Podman behavior changes across versions, the current patch logic can break, which is why a move to Quadlet or another declarative path still deserves separate evaluation current validation remains mostly inside scripts; failure counts, exit codes, container state, and rate-limit hits are not yet collected, shipped, or alerted on, which means the broader system still lacks continuous observability and similar problems may remain reactive instead of proactively visible Direct Improvements from This Troubleshooting Cycle process ownership and restart responsibility are now much clearer there is now an explicit standard for deciding whether generated Units are directly usable the script now provides a stable entry point for where to inspect logs and where execution should stop on failure the gap between \u0026ldquo;service is active\u0026rdquo; and \u0026ldquo;service is externally usable\u0026rdquo; is now covered by an application-level check These changes do not mean the whole setup has become a generalized HA design. They do mean the previous state, where restart configuration existed but recovery was still unreliable, has been turned into a path that is more testable, more explainable, and more repeatable.\nClosing Note The core takeaway from this incident is simple: writing Restart=always in configuration does not mean the recovery path is complete. Main Unit stability, who actually starts the container, whether generated output is patched, and whether dependency and rate-limit behavior are handled in script all affect the final result.\nAt least in this single-host Rootless Podman setup, once those breaks were repaired in script, process ownership, diagnostic entry points, and the failure-handling path all became much clearer. That does not make the solution \u0026ldquo;perfect,\u0026rdquo; but it does turn recovery from something driven largely by operator intuition into a path that is easier to explain and easier to verify.\nAppendix: A Note on the Linger Mechanism If Rootless Podman is supervised through systemd --user, the target user usually needs linger enabled first. Otherwise, once that user logs out, the user-level systemd instance may be reclaimed and the related services may no longer stay alive without an interactive session. In this incident, the environment already had Linger=yes, so linger was not part of the root cause. It is included here only as a prerequisite check.\n# check current state loginctl show-user \u0026lt;deploy-user\u0026gt; --property=Linger # if it is not enabled, run this as a sudo-capable user: sudo loginctl enable-linger \u0026lt;deploy-user\u0026gt; ","permalink":"https://intent.me/en-us/blog/tech/rootless-podman-systemd-watchdog-postmortem/","summary":"A postmortem on a failed recovery path under Rootless Podman + Systemd user units, covering confirmed contributing factors, concrete remediation work, validation steps, and remaining risks.","title":"Rootless Podman + Systemd Supervision Failure Postmortem: Diagnosing and Repairing a Broken Recovery Path"},{"content":"This post documents a practical JWT dual-token (AT/RT) hardening effort.\nThe real question was simple: if a refresh token is replayed in an abnormal scenario, can the system detect it quickly and contain impact.\nNote: all domains, IPs, account identifiers, session IDs, and log samples are desensitized placeholders.\nSystem boundary:\nFrontend: Web SPA (AT stays in memory) Backend: Go + Gin (auth and refresh) Session layer: Redis (token state index) Gateway: HTTPS reverse proxy Symptom and Trigger In routine security exercises for a utility-scale energy cloud platform, we ran penetration tests on the authentication path across large volumes of edge gateways and dashboard traffic, including an extreme case where an old RT was intercepted and replayed.\nThe previous design had AT/RT separation, but the refresh path did not fully enforce state transitions, so old RTs could still be abused in narrow concurrency windows.\nDesensitized event sample:\n{ \u0026#34;event\u0026#34;: \u0026#34;auth.refresh.reuse_detected\u0026#34;, \u0026#34;user_id_masked\u0026#34;: \u0026#34;u-***39\u0026#34;, \u0026#34;session_id\u0026#34;: \u0026#34;s-***b1\u0026#34;, \u0026#34;ip_masked\u0026#34;: \u0026#34;10.**.**.21\u0026#34;, \u0026#34;action\u0026#34;: \u0026#34;session_revoked\u0026#34; } Strategy: Keep AT Stateless, Strengthen RT Governance We kept the existing AT/RT architecture and focused on hardening the RT refresh path to make it revocable, auditable, and operationally manageable.\nLayer 1: Keep Token Responsibilities Clear AT handles high-frequency access via Authorization: Bearer RT is delivered only through HttpOnly Cookie AT validation remains stateless for performance and scalability This keeps the access path lightweight while concentrating control in the refresh path.\nLayer 2: Make RT Refresh a Single-Use Traceable Flow Refresh follows a strict sequence: validate old RT -\u0026gt; mint new RT -\u0026gt; update Redis state -\u0026gt; invalidate old RT.\nRedis key model (desensitized):\nrt:active:{jti} -\u0026gt; { user_id, session_id, exp } // primary state, TTL follows RT expiry rt:session:{session_id} -\u0026gt; Set\u0026lt;jti\u0026gt; // secondary index for one-shot session cleanup rt:deny:{jti} -\u0026gt; 1 (TTL=remaining lifetime) // deny-list for replayed legacy tokens user:sessions:{user_id} -\u0026gt; Set\u0026lt;session_id\u0026gt; // user-level control, supports \u0026#34;sign out other devices\u0026#34; Controls:\none RT can refresh successfully only once old RT is deny-listed immediately after rotation short refresh lock prevents race-based double refresh Concurrency Debounce Design (Result Reuse Window) Modern SPA clients often emit multiple requests in parallel. Without coordination, one legacy RT can trigger multiple refresh attempts and cause valid traffic to be mistaken as replay. We introduced a 5-second reuse window:\nthe first request acquires the lock and completes rotation the generated AT/RT pair is cached for 5 seconds concurrent requests with the same key reuse that exact pair within the window, with no second rotation after the window expires, the flow returns to strict single-use rotation Layer 3: Handling Replay Events When RT replay is detected, the system revokes the related session, writes an audit event, and drives the client back to login.\nThe goal is to terminate risky sessions quickly instead of trying to patch them in place.\nLayer 4: Tie Logout and High-Risk Operations to Revocation Session invalidation is linked to:\nuser logout admin password reset account lock/disable transitions This upgrades logout from browser-only cookie clearing to actual server-side session invalidation.\nRoot Cause This was a lifecycle governance gap rather than a single coding mistake:\nEarly implementation focused on issuance and verification. Session-state capability existed technically, but was not fully integrated into the refresh critical path. Outcomes Three practical improvements after hardening:\nsessions can be revoked quickly server-side RT replay becomes detectable and attributable security incidents are easier to triage by user and session scope User experience remains stable: normal traffic keeps silent refresh, while risky paths degrade to explicit re-login.\nProcess Improvements 1) Add Token Lifecycle Checks to Release Gate Pre-release checks now include:\nRT single-use validation refresh concurrency consistency session invalidation checks after logout/password reset 2) Standardize Security Audit Fields Audit logs use desensitized fields consistently: user_id_masked, session_id, jti_prefix, ip_masked, ua_hash, risk_level.\n3) Run Regular Replay Drills A monthly RT replay drill verifies that detection, revocation, and recovery paths still work end to end.\nPhase 2 Roadmap: From Session Control to Contextual Zero Trust After this round, Phase 1 goals are in place: Redis-backed session revocation and concurrency-safe anti-replay controls. The refresh path already enforces real-time user status checks and blocks token issuance immediately for locked or disabled accounts.\nPhase 2 focuses on advanced hijack scenarios under contextual risk:\nContext-aware risk controls: introduce environment fingerprint comparison (IP range + UA hash). On abnormal geographic RT jumps, the system will revoke all AT/RT, freeze the account, and enforce third-party step-up identity verification. False-positive management: because industrial field networks may switch 4G/5G links frequently and legitimately, the policy will be promoted into the critical path only after risk-model tuning and controlled rollout. Closing Authentication resilience is less about token issuance speed and more about containment speed during abnormal events.\nThe key value of this hardening effort is moving AT/RT from “works” to “governable”.\n","permalink":"https://intent.me/en-us/blog/tech/jwt-at-rt-redis-hardening-postmortem/","summary":"A security hardening postmortem for JWT AT/RT architecture: treating Redis reservation as completed and implementing RT rotation, replay detection, and revocable sessions.","title":"JWT Dual-Token Hardening Postmortem: From Stateless Refresh to Revocable Redis Sessions"},{"content":"While hardening infrastructure for a core business line, we migrated the global protocol from HTTP to HTTPS. SSL was configured on Nginx and all traffic was redirected to 443, but regression tests consistently hit 403 Forbidden on the login API.\nResponse payload:\n{\u0026#34;code\u0026#34;:403,\u0026#34;msg\u0026#34;:\u0026#34;forbidden: invalid origin\u0026#34;} System architecture:\nFrontend: Vue Backend: Go + Gin Gateway: Nginx reverse proxy Runtime: Podman (Rootless mode) Troubleshooting Strategy: Top-Down Layer Isolation We followed a strict top-down path with clear boundaries: gateway layer, application layer, and runtime layer.\nLayer 1: Gateway (Nginx) The first suspicion was header forwarding. Configuration review confirmed Origin and related headers were already forwarded correctly.\nDuring debugging, we found this workaround could make requests pass:\nproxy_set_header Origin \u0026#34;http://business-domain\u0026#34;; But this approach forges the request origin and effectively bypasses backend origin verification, weakening CSRF protection. We rejected it and continued to root-cause analysis in the application layer.\nLayer 2: Application (CORS vs CSRF Misalignment) Packet inspection showed this response header was correct:\nAccess-Control-Allow-Origin: https://business-domain That proved CORS already allowed HTTPS. The key was that:\nCORS controls whether browsers can read cross-origin responses CSRF validates whether backend should trust the request origin They are parallel but independent defenses.\nSource review exposed the real gap: during protocol migration, only the CORS allowlist was updated. The CSRF middleware still trusted the old http:// origin list. Requests passed CORS but were blocked by CSRF with invalid origin.\nFix: add https:// domains to CSRF trusted origin allowlist in sync with CORS.\nLayer 3: Runtime (Podman Rootless Isolation) After code fix and build, production still failed. We compared image hashes via podman inspect and found a hidden operational trap:\nProduction services run under deploy in a Rootless namespace Some images were built in the Root namespace Under Rootless isolation, deploy cannot see newly built images in Root namespace Deployment scripts kept restarting containers from stale images, creating a false state where code was fixed but runtime never updated.\nTo prevent this class of failure at the OS boundary, we enforced wrapper guards on /usr/local/bin/podman and /usr/local/bin/podman-compose: if uid=0, execution is rejected immediately. Once which podman resolves to the wrapper, accidental root operations fail fast instead of polluting the wrong image namespace.\nRoot Cause Summary This incident was not a single bug, but two systemic gaps stacked together:\nArchitecture governance gap: CORS and CSRF allowlists were configured separately without centralized security config ownership. Engineering loop gap: image build authority was not fully constrained to standardized CI/CD, allowing cross-namespace dirty state. Action Items and Process Evolution 1) Strengthen Security Release SOP and Checklist Add mandatory “CORS-CSRF policy alignment” checks for new services and protocol upgrades.\n2) Converge Production Operations Baseline Enforce that application image build and lifecycle are handled only by CI/CD pipelines. Manual cross-privilege namespace intervention is prohibited.\n3) Standardize Observability for Microservices Make middleware order an architecture rule: logging must be loaded before security interception, ensuring blocked requests are always auditable.\nClosing The value of this 403 investigation was not just restoring one API. It revealed weak points in security governance and delivery reliability. Code fixes solve the current outage; process constraints prevent the next one.\n","permalink":"https://intent.me/en-us/blog/tech/https-upgrade-403-postmortem/","summary":"A postmortem on a persistent 403 after HTTPS migration, traced to both missing CSRF allowlist updates and Podman Rootless image namespace isolation.","title":"HTTPS Upgrade Triggered 403: A Deep Postmortem from Security Middleware to Container Isolation"},{"content":"During iteration of Modbus-Tools, we met a classic concurrency issue: the UI window closed, but the process stayed alive and the debug session could not exit cleanly. After addressing that, timeout misjudgment and re-entrancy contention also surfaced. This note records the diagnosis and fixes.\nSymptom 1: Hang During Shutdown Main thread waits for worker thread termination while destroying MainWindow Worker thread may still be blocked on serial or socket I/O UI is gone, but process remains Cause 1: Inconsistent Thread Affinity QSerialPort and QTcpSocket rely on the event loop of their owning thread. If created in the main thread but effectively used from worker-side flow, shutdown can fall into cross-thread waiting.\nFix 1: Make Ownership Explicit Move the channel object to the worker thread right after construction:\nstack.thread = std::make_shared\u0026lt;QThread\u0026gt;(); stack.channel-\u0026gt;moveToThread(stack.thread.get()); Also keep parent unset at creation time so migration remains valid.\nSymptom 2: Intermittent Timeout After Connect Packet monitor shows device responses Business layer still returns timeout Cause 2: Blocking Wait Starves Event Loop After sending requests, condition_variable::wait_until blocked the I/O thread. As a result, readyRead callbacks were delayed even when data had already arrived.\nFix 2: Replace With Event-Yielding Wait Loop while (true) { QCoreApplication::processEvents(QEventLoop::AllEvents); if (signalReceived) break; if (timeout) return Error; std::this_thread::sleep_for(std::chrono::milliseconds(1)); } This keeps responsiveness within a controlled loop and avoids busy-spin.\nSymptom 3: Self-Deadlock Under High-Frequency Operations After introducing processEvents, re-entrant calls could happen within the same thread during the waiting window, causing lock contention on the same request path.\nCause 3: Lock Type Mismatch With Re-entrancy std::mutex does not allow repeated acquisition by the same thread.\nFix 3: Use Recursive Mutex for Request Serialization std::recursive_mutex requestMutex_; Takeaways Fix I/O object thread ownership early and keep it consistent through lifecycle. Avoid long blocking waits in I/O threads; if waiting is required, use an event-aware strategy. Whenever event yielding is introduced, re-check re-entrancy paths and lock model together. This fix not only removed shutdown stalls, but also improved timeout consistency in Modbus communication. For industrial software, deterministic shutdown and explainable timing are as important as functional completeness.\n","permalink":"https://intent.me/en-us/blog/tech/modbus-deadlock-fix-notes/","summary":"A field note on shutdown deadlocks and timeout misjudgment in a Qt multithreaded Modbus workflow, with practical fixes.","title":"Pitfall Log: Fixing Modbus Multithread Deadlocks"},{"content":"Modbus-Tools C++20 | Qt6 | Industrial Protocols\nBased on the latest Modbus-Tools release. Updated 2026-04-21\nIn industrial automation and embedded development, Modbus is part of everyday work. The real challenge is how much time we spend on repetitive tasks during field commissioning: manually building frames, repeatedly checking logs, and converting raw register values one by one.\nModbus-Tools was built with a clear goal: make the high-frequency steps of sending frames, reviewing logs, and parsing data smoother to use. Rather than stacking features, it prioritizes debugging efficiency and usability. Internally it uses a layered channel / transport / session / parser design, with CI/CD, multi-language support, and auto-update capabilities for practical iteration.\nBelow is a walkthrough of the core features in typical usage order.\n1. Fast connection and frame creation Whether you are using Modbus RTU (serial) or Modbus TCP (network), connection parameters are in the left panel at a glance.\nDuring debugging, the priority is rapid trial-and-error. Modbus-Tools separates parameter input from function code actions clearly:\nOne-click function code actions: Fill Slave ID, Start Address, and Quantity/Data, then click 01/02/03/04/05/06/0F/10 buttons to send. The underlying layer assembles frames and calculates CRC/LRC automatically. Function codes cover 0x01–0x04 (read), 0x05–0x06 (single write), 0x0F–0x10 (multiple write). HEX / DEC smart recognition: Slave ID and Start Address accept both HEX (e.g., 0x10, 10H) and DEC (e.g., 16) formats, parsed uniformly by parseSmartInt() with range validation. Writable data format switch (HEX / DEC / Binary): Use the Format selector next to Write Data to switch between Hex, Decimal, or Binary input styles. Enhanced Raw mode: For custom Hex frames, switch to Raw input and send — suitable for non-standard scenarios or edge-case validation. Two auxiliary buttons are included: Append CRC (RTU): Automatically computes and appends a CRC16 checksum to the input field. Add MBAP (TCP): Automatically wraps the input with a Modbus TCP MBAP header (Transaction ID / Protocol ID / Length / Unit ID). Coil (Coils) binary write interaction For bit-level operations, the tool provides an intuitive Binary input mode:\nBinary input: Enter bit strings directly (e.g., 1 0 1 1). The system auto-encodes and sends via 0x05 (single coil write) or 0x0F (multiple coils write) function codes. Bit-level read: Pair with 0x01/0x02 read commands to verify remote device coil and discrete input states. 2. Log inspection and quick copy (Traffic Monitor) Troubleshooting usually starts with logs. Traffic Monitor is designed to make log reading and sharing straightforward.\nClear TX/RX separation: Sent and received frames are visually separated, with millisecond timestamps. One-click copy: Copy key frames quickly for issue reproduction, team discussion, or test records. Direction filters: Show only TX or only RX to focus in high-frequency polling scenarios. Log export: Save current communication logs for bug archives and field records. 3. Frame Analyzer: from Hex to useful values Frame Analyzer is one of the most frequently used modules. Paste a Hex frame, click parse, and get structured output with a readable table.\nThe following capabilities are especially useful:\nDecode mode switch (Unsigned / Signed) The toolbar provides Decode Mode with Unsigned and Signed. After switching, parsed values are refreshed immediately — decimal, hex, binary, and scaled output all update together. This is useful for signed telemetry such as negative temperature or reverse power.\nScaling (Multiplier) Devices often send scaled integers (for example, 220.5V transmitted as 2205).\nSet Scale per register (such as 0.1 or 0.01), and the analyzer shows engineering values in real time.\nByte order analysis (Byte Order) Different PLCs and instruments may use different data layouts. The analyzer supports four byte/word order modes:\nABCD (Big Endian): High bytes first. CDAB (Little Endian Byte Swap): Little-endian byte swap. BADC (Big Endian Byte Swap): Big-endian byte swap. DCBA (Little Endian): Low bytes first. Switching byte order recalculates and redisplays register values.\nRegister annotations (Description) Add descriptions like \u0026ldquo;Phase A Voltage\u0026rdquo; or \u0026ldquo;Motor Speed\u0026rdquo; to register addresses.\nValues and meanings appear side by side, reducing time spent cross-checking spreadsheets.\nPersistent config and JSON / CSV templates Metadata can be saved and reused.\nAuto-save keeps your latest scaling and descriptions. JSON / CSV import/export lets you maintain per-device templates and import them when switching targets. Other practical details Format Hex button: Cleans and normalizes pasted Hex text for better readability. Configurable response start address: Set Start Address for response parsing to align with your register map. Protocol selection: Auto Detect / Modbus TCP / Modbus RTU helps in mixed capture scenarios. Force Parse During field captures, frames may fail integrity checks due to truncation or modification by intermediate devices. When the user manually selects TCP or RTU in the Protocol dropdown (instead of Auto Detect), the parser enters force mode:\nRTU force mode: On CRC mismatch, instead of aborting with an error, it marks checksumValid = false, logs \u0026quot;CRC Mismatch (Forced)\u0026quot; in warnings, and continues parsing PDU data fields. TCP force mode: On abnormal MBAP length or trailing bytes, it logs the corresponding warning but still extracts PDU based on actual frame length. This mechanism is useful for analyzing frames modified by gateways/relays or incomplete captures from serial sniffing tools. Auto Detect mode enforces strict checks, suitable for precise verification in normal communication scenarios.\nLink to Analyzer (live linkage) In addition to manual paste, Frame Analyzer can receive live data from Traffic Monitor:\nAuto-push: With the Linkage toggle enabled in Modbus TCP / RTU views, RX response PDUs are sent to the parser automatically — no manual copy needed. Pause / Resume: Click Pause Refresh to freeze the current frame so you can edit Scale or Description; click Resume Refresh to resume auto-refresh. Stop linkage: Click Stop Link to disconnect the data stream; the analyzer returns to manual mode. Async execution: Parsing logic runs on a QThread background thread, non-blocking to Traffic Monitor list scrolling. If you maintain multiple device models, saving per-device templates and importing them when switching targets reduces repeated configuration.\n4. Handy extras Beyond the core Modbus workflow, two lightweight helpers are available for ad-hoc tasks:\nTCP Client: For quick custom network message verification. Serial Port: For basic serial port send/receive testing (ASCII/Hex). 5. Engineering quality and test coverage As an iterated open-source project, Modbus-Tools invests in code quality:\nAutomated tests The project uses Google Test (GTest) and Google Mock (GMock) frameworks for automated quality assurance, covering:\nSession management: Connect/disconnect logic, request timeout retry, and exception state recovery. Protocol transport: TCP/RTU frame assembly/disassembly, checksum computation, and integrity verification. Parsing logic: Robustness validation against valid instructions and malformed frames. Data processing: Byte-order conversion, engineering-value scaling, and formatting algorithm accuracy. Currently 42 automated test cases (TEST + TEST_F) covering session management, protocol transport, parsing logic, data processing, and formatting. Regression tests run on every Release.\nCI/CD integration GitHub Actions pipeline integrates MSVC AddressSanitizer (ASan) for automated memory corruption and leak detection. Supports automatic build, test, and Release artifact distribution. Auto-update (OTA) The tool integrates a GitHub Releases-based auto-update mechanism: silent version check at startup + manual check from the menu bar. Update packages are verified via SHA256 before replacement, supporting both UpdateOnly (incremental) and Full Package modes.\nFinal notes The design goal of Modbus-Tools is to encapsulate high-frequency debugging operations (send frames, review logs, parse data) into reusable workflows, allowing developers to focus more on business logic and issue diagnosis.\nFeature summary:\nFast framing: HEX/DEC smart recognition (parseSmartInt) + Raw mode CRC/MBAP helper calculation. Live linkage: Link to Analyzer supports auto-push of RX frames to the parser. Deep analysis: Scaling (Scale Factor) + four byte orders (ABCD/BADC/CDAB/DCBA) + register descriptions. Bit-level control: Coil Binary input mode, supporting 0x05/0x0F function codes. Quality assurance: 42 automated tests + CI/CD integrated with MSVC AddressSanitizer. If you work on embedded firmware or host-side tools, feel free to try it and share feedback in the GitHub repository.\n","permalink":"https://intent.me/en-us/blog/tech/modbus-tools-intro/","summary":"A practical guide to a lightweight Modbus tool for embedded development and field commissioning: fast frame building, log viewing, Frame Analyzer with scaling/register annotations/JSON-CSV persistence, Link to Analyzer live linkage, and Force Parse.","title":"Modbus-Tools Deep Dive: A Practical Modbus Workflow for Daily Debugging"},{"content":"package main import \u0026#34;fmt\u0026#34; func main() { fmt.Println(\u0026#34;Hello, Intent\u0026#34;) } Who am I? I am a Software Engineer navigating the intersection of Embedded Systems and Cloud Native technologies.\nWith a background in Power Systems (EMS/PCS), I enjoy building reliable systems that bridge the physical and digital worlds.\nCore: C/C++, Golang, Qt, Modbus, CAN, MQTT, IEC-104, IEC-61850, Protocol Conversion Focus: IoT Protocols, Real-time Control, Distributed Systems \u0026ldquo;Simplicity is the ultimate sophistication.\u0026rdquo;\nThis blog serves as a log of my technical explorations and open source contributions.\n","permalink":"https://intent.me/en-us/blog/hello-world/","summary":"fmt.Println(\u0026ldquo;Hello, Intent\u0026rdquo;)","title":"Hello World"},{"content":"🏗️ Open Source Projects This page showcases my personal open source projects. For commercial projects I\u0026rsquo;ve worked on (such as Energy Storage Monitoring Systems, Industrial Cloud Platforms, etc.), please refer to my Resume.\nProject Name Description Tech Stack Portal Modbus-Tools A lightweight Modbus debugging tool for embedded integration work, with RTU/TCP support, visual frame building, Frame Analyzer parsing, and reusable JSON / CSV templates. C++20 Qt6 Industrial Protocols 📖 Deep Dive\n🏗 Source Code For more experimental projects, please visit my GitHub Profile.\n","permalink":"https://intent.me/en-us/projects/","summary":"Open Source Projects \u0026amp; Technical Practice","title":"Projects"},{"content":" YUCHENG MING Embedded / PC Software Engineer ｜ 3 Years Experience ｜ Energy Storage, Industrial Comm \u0026amp; Cloud-Edge Synergy\nEducation: Bachelor ｜ Location: Zhongshan (Intent: Guangzhou/Shenzhen) ｜ Email: Click to View ｜ GitHub: mingyucheng692\nCore Competencies Languages: C / C++ ｜ Golang ｜ Python ｜ Shell Protocols: Modbus RTU / TCP ｜ MQTT ｜ IEC-104 ｜ CAN Frameworks \u0026amp; Tools: Qt6 ｜ CMake ｜ Docker ｜ Redis ｜ PostgreSQL ｜ Git Work Experience Honghui Energy (South HQ) / Guangdong Ruilai Huakong Technology Co., Ltd. Software Engineer ｜ 2025.06 - Present ｜ Zhongshan, Guangdong\n▸ A wholly-owned subsidiary of Honghui Energy (leading flywheel energy storage enterprise), responsible for core system development at the Southern R\u0026amp;D and production base\nParticipated in the 0-to-1 construction of the Southern base\u0026rsquo;s energy storage software ecosystem (edge-side-cloud), aligning with group technical standards; delivered software modules directly serving production line equipment testing and on-site delivery. Collaborated closely with hardware and electrical protocol teams to resolve communication congestion and signal jitter issues in complex industrial environments, ensuring reliable interaction between power electronics devices and upper-layer systems. Facilitated team Git collaboration and code review practices; developed multiple universal debugging toolchains to replace manual lookups with visual parsing, significantly reducing on-site troubleshooting cycles. Key Projects Flywheel Energy Storage Monitoring System (FMS) Core Developer ｜ C++ / Qt6 / Modbus-TCP / CAN / SQLite / IOCP ｜ 2025.06 - Present\nRefactored Modbus-TCP communication link with Windows IOCP and connection pooling, resolving UI blocking caused by high-frequency data acquisition, reducing CPU usage by ~15%. Developed dedicated debugging module for DSP control boards (based on ZLG CAN SDK), implementing bidirectional dynamic parsing of IEEE 754 floating-point numbers and HEX frames, supporting multi-byte order (CDAB/ABCD) auto-conversion and register semantic mapping, replacing manual CANTest workflows and significantly improving hardware-software integration efficiency. Designed sliding window algorithm for alarm signal debouncing; implemented high-frequency message persistence based on SQLite WAL mode, enabling efficient local retrieval and tracing of historical on-site data. Promoted CMake modular builds and integrated Crash Dump exception capture mechanism, compressing full compilation time from 5 minutes to under 50 seconds, significantly enhancing development iteration and on-site troubleshooting efficiency. Flywheel Energy Storage Edge Communication Gateway Core Developer ｜ C / STM32F407 / FreeRTOS / MQTT / Modbus / IEC-104 ｜ 2025.12 - Present\nResponsible for edge gateway core business logic based on STM32F407 and FreeRTOS, polling underlying devices via Modbus/RS485 downwards and establishing time-series data channels to the cloud via MQTT upwards. Implemented protocol parsing and mapping conversion among Modbus, IEC-104, and MQTT, enabling a closed-loop data interaction across device-side, power protocol-side, and platform-side. Designed and implemented time-series data caching and breakpoint resume mechanism for weak network conditions, ensuring data integrity during network fluctuations. Flywheel Energy Storage Smart Cloud Platform Backend Developer ｜ Golang / Docker / Podman / Redpanda / PostgreSQL / Redis ｜ 2025.06 - Present\nParticipated in core backend service development for the energy storage cloud platform (based on Alibaba Cloud Linux), handling high-frequency device data ingestion, protocol parsing, and time-series storage. Led deployment environment configuration using Docker for local development and Podman rootless combined with Shell scripts for production, achieving secure isolated deployment with rootless accounts. Introduced Redpanda message queue to decouple data acquisition and storage layers, smoothing concurrent write peaks; independently designed authentication middleware maintaining secure session state via JWT (AT/RT) and Redis. Modbus-Tools (Personal Open Source Project) Independent Developer ｜ C++20 / Qt6 / CMake / CI-CD ｜ 2025.12 - Present\nDeveloped cross-platform debugging tool using channel/transport/session/parser layered architecture with built-in visual frame builder, eliminating tedious manual table lookups and hexadecimal frame concatenation. Built Frame Analyzer core parser supporting automatic protocol recognition, custom scaling conversion, and register semantic annotation; supports JSON/CSV configuration import/export, compressing on-site troubleshooting time from minutes to seconds, improving integration efficiency by 5x+. Established fully automated CI/CD pipeline via GitHub Actions for automatic build and release upon code push; client features built-in multi-language switching and Auto-Updater mechanism, ensuring agile iteration in field deployments. Education Chongqing College of Foreign Trade and Business ｜ Internet of Things Engineering ｜ Bachelor of Engineering\n","permalink":"https://intent.me/en-us/resume/","summary":"YUCHENG MING - Embedded / PC Software Engineer, focusing on Energy Storage Systems, Industrial Communication, and Cloud-Edge Synergy","title":"Resume"}]