Having a strange problem with network hanging that I can't troubleshoot

My Purism 14 (running PureOS and i3) is having an issue lately. Every so often, at random intervals, it locks up:

  • shows one core as running at 100% on htop
  • locks any use of sudo (even “sudo echo hello world” does nothing)
  • any process that goes anywhere near the network hangs (but the rest of the application remains responsive - so I can read old emails in Thunderbird, but it’s stuck fetching new ones)
  • any disk access is fine, no problem (luckily - I can save all my work before rebooting)

I can’t work out what is going on (partly because I can’t use anything that requires sudo to use).

Rebooting or powering off using any command line or GUI tool does nothing. I have to hard-reboot the laptop. And when it comes up again, everything’s fine and no problem.

Anyone experienced anything similar, or have any ideas what’s going on?

You should be able to figure out which process it is that is using 100% CPU. If you open a terminal window and just run the regular “top” command there, which process is shown highest there? The name of the process is in the rightmost column, under “COMMAND”.

1 Like

I thought the same. But nope. There isn’t one. All processes look like they’re behaving normally, there’s nothing doing anything weird. But one core is running at 100% (and the fans are blowing to match that).

Hm. The top command shows a list of running processes with the following columns of info:

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND

Below that comes one line for each process.

If you have 100% CPU usage in total, that means that the values in the %CPU column should add up to 100% together. Is that what you see?

Does the highest process have 100% CPU usage, or do you have two processes using 50% each, or some other combination?

What are the names of the processes with highest %CPU values?

if you are cautious and if the lock up occurs often enough … after booting, create a terminal and then

sudo -i

so you have the root session available for when the lock up occurs.

However this may not bypass the problem.

I’ll have to get a s/shot of htop (and top, as that actually shows the % utilisation in user/sys/idle etc, which htop doesn’t) the next time it happens, which is usually at least once a day, so shouldn’t be long.

good idea. I’ll get a root terminal going now and see how that behaves when it happens next

Another approach could be to press Ctrl + Alt + F1 and login into a terminal session and try to sudo there.

P.S.: go back to gnome with Ctrl + Alt + F7

1 Like

found the culprit: ksoftirq

top sshot

Everything was working fine, only the network had disconnected. I found out the router was having a problem, so rebooted that. Came back to the laptop spinning and unable to do anything that touched the network.

My saved root terminal that I created earlier was non-responsive.

Any ideas? At this point I’m suspecting the network driver, but haven’t much of a clue what to do next about it

Seems you have really so many interrupt requests (IRQs) that the kernel can’t keep processing them quick enough and stores them. May indeed be network related.

Thanks, yeah, I found that too. I’ll check /proc/interrupts next time.
It doesn’t actually give any answers, though.

Do you use iptables or nftables? If yes, can you try removing all the rules see if that helps?

iptables --list… There’s a ton of them!
the VPN sets a ton of rules, Docker sets a whole chain, UFW sets some. I’m not really comfortable deleting all these (I really don’t want to break the VPN, Docker networking, or the firewall). Is there any way of telling which rules are getting used a lot?

I’m not sure how you could determine that, but I do think that the large list of rules is the cause of ksoftirq running at 100%.

that would make sense.

At least I have something to go on, now. I’ll investigate further, thanks everyone :slight_smile:

Maybe use iptables-save and iptables-restore (and the corresponding versions for IPv6 if relevant) around the delete?

That’s useful, thanks :slight_smile:

Update: Changed PIA to use Wireguard and the problem stopped happening.

However, now the laptop won’t go to sleep unless the Ethernet cable is unplugged first. I guess wireguard likes being connected :wink: