I've spent weeks trying to get to the bottom of this issue and I cannot figure it out. My interceptor board seems to reach some state where the realtek-smi driver throws some sort of error and then disconnects my interceptor from the network. After plenty of digging, I cannot figure out what is causing this error, why it is so consistent, and why it takes some period of time before occurring.
The kernel log at the time of failure is as follows. I would be incredibly appreciative to anyone that could shed some light onto why this occurs. I am happy to provide any amount of further data that might help to diagnose.
Jan 29 16:54:11 kernel: realtek-smi switch0: ACK timeout
Jan 29 16:54:11 kernel: realtek-smi switch0: failed to read PHY0 reg 00 @ a400, ret -110
Jan 29 16:54:11 kernel: ------------[ cut here ]------------
Jan 29 16:54:11 kernel: phy_check_link_status+0x0/0xc8: returned: -110
Jan 29 16:54:11 kernel: WARNING: CPU: 2 PID: 514044 at drivers/net/phy/phy.c:1233 phy_state_machine+0xa4/0x2dc
Jan 29 16:54:11 kernel: Modules linked in: xt_connmark xt_mark xt_comment wireguard libchacha20poly1305 chacha_neon libchacha poly1305_neon ip6_udp_tunnel udp_tunnel lib>
Jan 29 16:54:11 kernel: CPU: 1 PID: 514044 Comm: kworker/u8:1 Not tainted 6.6.8 #16
Jan 29 16:54:11 kernel: Hardware name: Raspberry Pi Compute Module 4 Rev 1.0 (DT)
Jan 29 16:54:11 kernel: Workqueue: events_power_efficient phy_state_machine
Jan 29 16:54:11 kernel: pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
Jan 29 16:54:11 kernel: pc : phy_state_machine+0xa4/0x2dc
Jan 29 16:54:11 kernel: lr : phy_state_machine+0xa4/0x2dc
Jan 29 16:54:11 kernel: sp : ffff800082c7bd70
Jan 29 16:54:11 kernel: x29: ffff800082c7bd70 x28: 0000000000000000 x27: 0000000000000000
Jan 29 16:54:11 kernel: x26: ffff6ca080014028 x25: ffff6c9fc336dd00 x24: ffff6ca080012c05
Jan 29 16:54:11 kernel: x23: 00000000ffffff92 x22: ffff6ca0870ad4d8 x21: 0000000000000005
Jan 29 16:54:11 kernel: x20: ffff6ca0870ad530 x19: ffff6ca0870ad000 x18: 00000000fffffffe
Jan 29 16:54:11 kernel: x17: 0000000000000000 x16: ffffd5bee75f9cf4 x15: ffff800082c7b990
Jan 29 16:54:11 kernel: x14: 0000000000000000 x13: ffffd5bee83e6638 x12: 00000000000007bf
Jan 29 16:54:11 kernel: x11: 0000000000000295 x10: ffffd5bee843e638 x9 : ffffd5bee83e6638
Jan 29 16:54:11 kernel: x8 : 00000000ffffefff x7 : ffffd5bee843e638 x6 : 80000000fffff000
Jan 29 16:54:11 kernel: x5 : ffff6ca17efa4e08 x4 : 0000000000000000 x3 : 0000000000000027
Jan 29 16:54:11 kernel: x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff6ca0800ce900
Jan 29 16:54:11 kernel: Call trace:
Jan 29 16:54:11 kernel: phy_state_machine+0xa4/0x2dc
Jan 29 16:54:11 kernel: process_one_work+0x138/0x244
Jan 29 16:54:11 kernel: worker_thread+0x320/0x438
Jan 29 16:54:11 kernel: kthread+0x114/0x118
Jan 29 16:54:11 kernel: ret_from_fork+0x10/0x20
Jan 29 16:54:11 kernel: ---[ end trace 0000000000000000 ]---
Jan 29 16:54:11 kernel: realtek-smi switch0: Link is Down
Jan 29 16:54:11 kernel: realtek-smi switch0: ACK timeout
Jan 29 16:54:11 kernel: realtek-smi switch0: failed to read PHY0 reg 00 @ a400, ret -110
Jan 29 16:54:12 systemd-networkd[367]: end0: Lost carrier
Jan 29 16:54:12 kernel: bcmgenet fd580000.ethernet end0: Link is Down
Jan 29 16:54:12 systemd-networkd[367]: end0: DHCPv6 lease lost
Jan 29 16:54:12 systemd-networkd[367]: wan: Lost carrier
Jan 29 16:54:12 systemd-networkd[367]: wan: DHCP lease lost
Jan 29 16:54:12 dbus-daemon[351]: [system] Activating via systemd: service name='org.freedesktop.hostname1' unit='dbus-org.freedesktop.hostname1.service' requested by ':>
Jan 29 16:54:12 kernel: realtek-smi switch0 wan: Link is Down
Jan 29 16:54:12 systemd-networkd[367]: wan: DHCPv6 lease lost
Jan 29 16:54:12 systemd[1]: Starting systemd-hostnamed.service - Hostname Service...
Jan 29 16:54:12 systemd-timesyncd[336]: No network connectivity, watching for changes.
Jan 29 16:54:12 dbus-daemon[351]: [system] Successfully activated service 'org.freedesktop.hostname1'
Jan 29 16:54:12 systemd[1]: Started systemd-hostnamed.service - Hostname Service.
Jan 29 16:54:12 systemd-hostnamed[515219]: Hostname set to <nas> (static)
Jan 29 16:54:15 kernel: net end0: non-realtek ethertype 0x0080
Jan 29 16:54:15 systemd-networkd[367]: end0: Gained carrier
Jan 29 16:54:15 kernel: bcmgenet fd580000.ethernet end0: Link is Up - 1Gbps/Full - flow control rx/tx
Jan 29 16:54:15 systemd-timesyncd[336]: Network configuration changed, trying to establish connection.
Jan 29 16:54:15 kernel: realtek-smi switch0 wan: Link is Up - 1Gbps/Full - flow control off
Jan 29 16:54:15 systemd-networkd[367]: wan: Gained carrier
Jan 29 16:54:15 systemd-timesyncd[336]: Network configuration changed, trying to establish connection.
Jan 29 16:54:16 kernel: net end0: non-realtek ethertype 0x0080
Jan 29 16:54:16 kernel: net end0: non-realtek ethertype 0x0000
Jan 29 16:54:17 kernel: net end0: non-realtek ethertype 0x0080
Jan 29 16:54:17 kernel: net end0: non-realtek ethertype 0xadad
Jan 29 16:54:17 kernel: net end0: non-realtek ethertype 0x0000
Jan 29 16:54:17 kernel: net end0: non-realtek ethertype 0x01c0
Jan 29 16:54:17 kernel: net end0: non-realtek ethertype 0xa767
Jan 29 16:54:17 kernel: net end0: non-realtek ethertype 0xa767
Jan 29 16:54:17 kernel: net end0: non-realtek ethertype 0x0000
Jan 29 16:54:20 kernel: rtl8_4_read_tag: 13 callbacks suppressed
Jan 29 16:54:20 kernel: net end0: non-realtek ethertype 0x0000
Jan 29 16:54:20 kernel: net end0: non-realtek ethertype 0x0000
After this, the log just spams the "non-realtek ethertype" message over and over until shutdown. I assume this is due to some sort of offset issue in the frame data.
@Antony Derham, it is my belief that you have a loose connection with the Realtek IC and it was exposed by the move to the new case. I believe this is a manufacturing defect. We'd like to go ahead and replace this board. We'll get one shipped out to you shortly. We apologize for the inconvenience.
I first noticed the issue occurring a while after installing the board into a new case and mounting it into a rack. I didn't notice any issues at first so I'm not sure the new case is involved, but after a few weeks, this error began occurring often. I was using the standard OS image from the Axzez downloads page, and I am still using that system, but I have recompiled a more recent kernel version using the same config and Axzez patches. Unfortunately, that didn't fix the issue either.
@Antony Derham,
Thank you for bringing this to our attention. Can you tell us what version of our OS you are using? Also, please tell us about your hardware configuration (just in case there is anything interesting)? When did you start seeing this (was there a significant change or did it just start happening)? Did you experience this with other versions of our OS?
Axzez Support