DNS Privacy Stack — Troubleshooting
Troubleshooting¶
Every issue documented here was encountered in production. Each section includes the symptoms, root cause, diagnosis steps, and fix.
dig @127.0.0.1 Times Out but LAN IP Works¶
Symptoms:
$ dig @127.0.0.1 google.com +timeout=3
;; communications error to 127.0.0.1#53: timed out
;; no servers could be reached
$ dig @<YOUR_SERVER_IP> google.com +timeout=3
142.251.220.206 # Works!
Root cause: AdGuard Home's allowed_clients does not include 127.0.0.0/8. When a query arrives from 127.0.0.1, AdGuard refuses it because the source IP isn't in the allowlist. TCP queries get REFUSED; UDP queries are silently dropped.
Diagnosis:
## TCP shows REFUSED (the clue)
dig @127.0.0.1 google.com +tcp +timeout=3
## status: REFUSED
## Check allowed_clients
docker exec adguardhome grep -A5 'allowed_clients' /opt/adguardhome/conf/AdGuardHome.yaml
Fix:
docker exec adguardhome sed -i 's/ allowed_clients:/ allowed_clients:\n - 127.0.0.0\/8/' \
/opt/adguardhome/conf/AdGuardHome.yaml
docker restart adguardhome
Unbound Returns SERVFAIL on Everything¶
Symptoms: Every query returns status: SERVFAIL. Unbound is running and listening.
Possible causes (check in order):
1. use-caps-for-id: yes¶
This feature randomizes query name casing to detect DNS spoofing. Many authoritative servers don't preserve case, causing Unbound to treat every response as spoofed.
grep 'use-caps-for-id' /etc/unbound/unbound.conf.d/adguard.conf
## If yes, change to no
Logs will show module_event_capsfail repeatedly.
2. harden-referral-path: yes¶
This does extra queries to validate the referral chain. If any validation query fails, the entire resolution fails. Remove it entirely — the security benefit is minimal for a forwarder.
3. DNSSEC trust anchor priming failure¶
info: failed to prime trust anchor -- could not fetch DNSKEY rrset
If Unbound can't validate the root DNSSEC key (common with ISP hijacking), and val-permissive-mode: no, all queries fail. Set val-permissive-mode: yes to log failures without blocking.
4. subnetcache module interference¶
On Ubuntu 24.04, Unbound 1.19.2 has the subnetcache module compiled in. It loads automatically even without send-client-subnet in the config. Combined with serve-expired and prefetch, it produces warnings:
warning: subnetcache: serve-expired is set but not working for data originating from the subnet module cache
This doesn't break forwarding but caused issues with recursive resolution. The module auto-loads — don't try to exclude it with module-config: "validator iterator" as this can cause worse problems on some builds.
ISP DNS Hijacking Breaks Recursive Resolution¶
Symptoms: Unbound forwarding to 1.1.1.1 works, but recursive resolution (no forward-zone) times out on every root server query.
Root cause: Your ISP transparently redirects all port 53 traffic (UDP and TCP) to their own DNS servers. When Unbound sends a non-recursive query (RD=0) to a root server, the ISP's proxy intercepts it. The proxy can't handle non-recursive queries, so it drops them or returns garbage.
How to confirm:
## This works (your shell sends RD=1, ISP resolver handles it)
dig @198.41.0.4 google.com +short +timeout=3
## Returns an IP — but you're NOT actually talking to the root server
## This is what Unbound sends (RD=0) — fails because ISP can't handle it
dig @198.41.0.4 . NS +norec +timeout=3
## Timeout or SERVFAIL
The definitive test from RIPE Labs:
dig @198.41.0.4 hostname.bind CH TXT +timeout=3
## If hijacked: timeout, SERVFAIL, or wrong answer
## If not hijacked: returns the root server's hostname
Fix: Use DNS-over-TLS forwarding instead of recursive resolution. DoT uses port 853, which ISPs don't hijack:
forward-zone:
name: "."
forward-tls-upstream: yes
forward-addr: 194.242.2.2@853#dns.mullvad.net
ISPs known to hijack: Many ISPs in Asia ( China, India, Indonesia, Brazil, Turkey. If you're on one of these, recursive resolution will not work without a VPN tunnel.
Unbound Swap Thrashing Causes Cascading DNS Outage¶
Symptoms: After 3-7 days, the server becomes unresponsive. SSH sessions freeze. All containers lose DNS. Server swap usage is 12GB+.
Root cause: Unbound's caches were set too large (1GB rrset + 512MB msg + 256MB key = 1.8GB). With malloc overhead, actual usage is ~2.5x the configured value (~4.5GB). On a 16GB server running 100+ containers, this pushes the system into heavy swap usage. Unbound's cache access patterns cause constant page faults, which cascade into I/O wait, which blocks DNS responses, which causes all containers to retry, which increases load further.
Fix: Right-size the caches. For a homelab with ~10,000 unique domains:
rrset-cache-size: 64m
msg-cache-size: 32m
key-cache-size: 16m
neg-cache-size: 8m
Monitor:
## Check Unbound memory
systemctl status unbound | grep Memory
## Should show ~20-30MB, peak ~50MB. If it exceeds 200MB, caches are too large.
## Check system swap
free -m | grep Swap
## Swap used should be under 1GB for healthy operation
GL.iNet Router: DNS Stops Working After Configuration¶
Symptoms: After changing the router's DNS settings, ping google.com returns bad address on the router and all clients lose internet.
Trap 1: force_dns='1'¶
GL.iNet routers have force_dns='1' by default. This creates iptables DNAT rules that redirect ALL port 53 traffic passing through the router to dnsmasq. If you set DHCP option 6 to <YOUR_SERVER_IP>, clients try to reach your DNS server, but the router intercepts the traffic → dnsmasq tries to forward → gets intercepted → DNS loop → total failure.
Fix: Always disable before changing DNS:
uci set dhcp.@dnsmasq[0].force_dns='0'
uci commit dhcp
/etc/init.d/dnsmasq restart
Trap 2: Setting noresolv and WAN DNS¶
Setting noresolv='1' and pointing the router's upstream to <YOUR_SERVER_IP> sounds logical but is fragile. If AdGuard/Unbound restarts, the router itself loses DNS, which can prevent dnsmasq from resolving anything — including the path back to <YOUR_SERVER_IP> if it goes through DNS.
Safe approach: Only use DHCP option 6. Leave the router's own DNS untouched (ISP DNS). This way the router always works, and only clients use your privacy DNS.
## SAFE: Only affects clients
uci add_list dhcp.lan.dhcp_option='6,<YOUR_SERVER_IP>'
uci commit dhcp
/etc/init.d/dnsmasq restart
## DANGEROUS: Don't do this
## uci set dhcp.@dnsmasq[0].noresolv='1'
## uci add_list dhcp.@dnsmasq[0].server='<YOUR_SERVER_IP>'
Trap 3: /etc/init.d/network restart¶
Never restart the router's network stack when testing DNS changes. It takes down all interfaces briefly, which disconnects your SSH session and can leave the router in a bad state. Only restart dnsmasq.
systemd Watchdog Kills Unbound Every 60 Seconds¶
Symptoms: Monitoring alerts show Unbound entering failed state every 60 seconds:
unbound.service: Failed with result 'watchdog'
unbound.service: Killing process with signal SIGABRT
Root cause: WatchdogSec=60 in the systemd override requires Unbound to send sd_notify(WATCHDOG=1) pings. Unbound does not implement systemd watchdog. After 60 seconds without a ping, systemd kills it.
Fix:
sudo tee /etc/systemd/system/unbound.service.d/override.conf << 'EOF'
[Service]
LimitNOFILE=65536
LimitNPROC=512
Restart=on-failure
RestartSec=5
EOF
sudo systemctl daemon-reload
sudo systemctl restart unbound
AdGuard Container Takes 10+ Seconds to Start¶
Symptoms: After docker restart adguardhome, port 53 returns "connection refused" for 10-15 seconds. Scripts that test immediately after restart fail.
Root cause: AdGuard Home enumerates all Docker veth interfaces on startup (host networking mode). With 100+ containers, this takes 7-15 seconds before the DNS listener starts.
Workaround: When scripting, poll instead of using a fixed sleep:
docker restart adguardhome
for i in $(seq 1 12); do
r=$(dig @127.0.0.1 google.com +short +timeout=3 2>&1 | head -1)
if [[ "$r" =~ ^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
echo "Ready after $((i*5))s"
break
fi
sleep 5
done
resolv.conf Resets After Reboot¶
Symptoms: After rebooting the server, cat /etc/resolv.conf shows the ISP/DHCP nameserver instead of 127.0.0.1. All containers that use the host resolver fail.
Root cause: Netplan or DHCP client overwrites /etc/resolv.conf on boot.
Fix: Create a netplan override:
sudo tee /etc/netplan/99-dns-override.yaml << 'EOF'
network:
version: 2
ethernets:
<YOUR_INTERFACE>:
nameservers:
addresses: [127.0.0.1]
dhcp4-overrides:
use-dns: false
EOF
sudo chmod 600 /etc/netplan/99-dns-override.yaml
sudo netplan apply
Quick Diagnostic Commands¶
## Full stack test (run all at once)
echo "=== resolv.conf ===" && cat /etc/resolv.conf && \
echo "=== Unbound ===" && dig @127.0.0.1 -p 5335 google.com +short +timeout=5 && \
echo "=== AdGuard localhost ===" && dig @127.0.0.1 google.com +short +timeout=5 && \
echo "=== AdGuard LAN ===" && dig @<YOUR_SERVER_IP> google.com +short +timeout=5 && \
echo "=== System ===" && ping -c 1 google.com | head -2
## Check Unbound is using DoT
ss -tnp | grep :853
## Check Unbound memory
systemctl status unbound | grep Memory
## Check AdGuard is running
docker ps --filter name=adguardhome --format '{{.Status}}'
## Check cache hit rate
dig @127.0.0.1 google.com +timeout=5 | grep "Query time"
## Second run should be 0ms
Previous: 04-optimization | Next: 06-resources