I use Manjaro Linux with the Cinnamon desktop and sometimes run into system-level issues, but I have no idea how to properly debug them. It doesn’t feel as straightforward as debugging a normal program. What’s the best way or resource to learn system debugging on Linux?
Op should be tagged as a bot.
Sysadmin here, this is my usual flow for various distros
-
as /u/FigMcLargeHuge mentions, recent logfiles in /var/log. Notably /var/log/messages (EL) and syslog (Debian) but anything that’s recent.
-
journalctl
- More and more things are moving to binary logging. If you know the process, thenjournalctl -u processname
restricts to just that. also add a-f
for tailing it for ongoing logs. -
dmesg -T
- especially at system level, this captures any hardware/low level logs. (-T reports actual times, not just seconds since boot) -
Once you have some logs that you think are related, but don’t know WTF they actually mean, you have two options. The first is to google likely strings. This is… ineffective much of the time - accidental misinformation and outdated advice is increasingly common. The answer might be there, but it takes time and can be frustrating to weed out the cruft.
The better way, (IMO, and people downvote me for saying this) is to use AI. Get a few lines of logs with the errors, check them for confidential information, and simply paste the suspect lines into chatgpt, gemini, claude, co-pilot, whatever. No need for context, it’ll figure that out. The LLM will, 4 times out of 5, identify the problem very quickly.
Now, once it’s identified that, it will offer to fix it for you. This is where you’ve got to be on your toes as LLMs are really really quick to give bad advice at this level. But that first triage is nearly always worth doing and helps shape your own mind as to what’s going on. AI is still useful for fixing it, but do understand what it’s telling you to do.
If there’s one thing AI is good at is trawling though a hell scape of random data from across time and space. And outputting… Something reasonably close.
Least close enough to actually put you in the right path 9 times out of 10. Even if the ai is wrong, it’s still vastly more time effective then going to stack overflow for this was solved with no solution present.
Hell I had a issue trying to find documentation for a old game I was modding from 10 years ago that maybe sold 1000 copies the wiki was dead the devs site was dead and the way back machine didn’t have the pages I needed.
Some fucking how chat gpt knew the game, was able to identify it correctly from a script I gave it then spat out the exact correct context and documentation I needed.
It was flabbergasting. Cause at the same time it told me the game came out in 1995 when it came out in 2015… It got basic ass info wrong but the technical details right.
I was so fucking confused how it managed that.
I have resorted to the AI step also, if Stract.com doesn’t give me a good link, because if I paste a minidlna crash log Google responds with:
- Mini Cooper on sale
- Buy your DAC device here
- want to sign up to streaming music
- network and NAS comparisons
Useless.
At least AI said: based on your error it appears a file in your database has metdata tags it cannot parse properly. Sure enough the tagger I used had applied a tag to a wmv file and Minidlna couldn’t deal with tag1 area vs tag 2 areas used in other file formats.
Did you try to do this workflow with local models? If so, in your experience what are the better models for this?
We did experiment with local models. They were okay, if a little slow with the resources we allocated for testing. Ultimately though, we paid for copilot. I’m still a little sceptical that it won’t leak data, despite the assurances, so I do clean anything sensitive before pasting.
As for best models - generally gpt4 or 5 is my go-to, but the others have their uses. I tend to stick with one until it annoys me, then move on. Claude’s pretty good for code help, imo, but there’s not really a huge difference between them.
What’s your experiences?
I do not use models in general online, but my needs are also much smaller. Max I use my local model for ollama is translations. I am always interested in seeing more focused models so we can use on lower end hardware
use AI. Get a few lines of logs with the errors, check them for confidential information, and simply paste the suspect lines into chatgpt, gemini, claude, co-pilot, whatever
concur. I used to put smaller snippets of the logs into Google search to hopefully bring up pages from fellow sufferers of the same malaise; that usually worked, but AI is doing it better - now.
-
If you’re using systemd you should know journald. There are UIs to make searching the journal logs easier, like journald browser
I simply stopped using Manjaro, this resolved all system-level issues I had encountered.
First thing I do is to check the kernel output:
sudo dmesg -Tw
Just as a general rule, I would start checking log files. You can start by searching /var/log for files that have been modified in the last few mins with something like “sudo find /var/log -mmin -10 -ls 2>/dev/null”. That will get you all log files in /var/log changed within the last 10 mins. Then you can tail those or grep them looking for clues. I have done searches of the entire file system looking for log files that were recently modified to find clues. It might also help to send the output to a file so you can view that and scroll up and down rather than just trying to read the output of the find, tail or grep commands. Put a “1>/{path}/filenameyouwanttouse.out” at the end of the command or you can pipe it to the tee command and it will show on the screen and write to the file you specify.
Generally, it depends on the issue. The first thing I’d check is journalctl, and if there are any errors, I usually look up “[pasted error] [distro name]” and go from there. if I’m unable to find errors, then my next bet is to look up “[description of issue] [distro name]”. Unless I am directly familiar with the component that is having an issue, I try to see if I can find a solution online first. Of course, I never recommend running commands you read online that you don’t understand, so take it as a learning experience and pull up some man pages to see what everything is doing. By doing that, you can even begin to learn how to debug and fix these issues by yourself. Even just finding issues other people have and proving it isn’t your issue helps narrow it down.
What I will never under any circumstances recommend is using an LLM. Please, just use a normal search engine (I prefer DDG), and find forum posts from real people. Those people are generally capable of understanding what they’re saying, so they won’t give completely made up information based on generation of the most likely next word from the data an LLM model was trained on. Besides, chances are that the LLMs are trained on the data you would find by searching anyway, so why not go straight to the source?
I do find myself having to troubleshoot issues entirely on my own sometimes, but usually those are of my own doing, and I can likely figure out what I did wrong (I host my own server and tinker with it quite often). Of course, since switching to atomic distros on my desktop, I haven’t had any system issues to troubleshoot with it in years. Running Manjaro is practically a guarantee that you’ll have system issues, though. I’ve never had a worse experience with my system than when I ran it, and I’m not alone in that.
Otherwise, if you find yourself unable to find an easy solution, backups are a wonderful thing. My server recently had part of its boot corrupted, and it was just a case of recovering from a backup to restore it. Remember, with backups: 2 is 1 and 1 is none. Data can (and will) get corrupted eventually.
Stact.com if you remember the good google times pre 2010
journalctl and log files are very valuable. If it’s specific to an application running said application in a terminal with verbose also gives can potentially provide you with a clear indication of what’s going wrong.
I’m dyslexic so I get syntax errors all the damn time and thankfully using NixOS it likes to remind me on rebuild how much of an idiot I am.
Worse comes to worse you can always plug the error into an LLM like Claude or Chatgpt. But take that with a grain of salt. It’ll give you a good base to start from for debugging but never trust something like Claude that will constantly tell you “it’s a known issue” when it isn’t.
All this being said I’ve had the best experience for help via whatever application/distro/whatever IRC channels on Libera chat.
There’s an entire step between trying to figure it out yourself and resorting to an LLM which is probably likely to tell you to shove cheese into the USB ports. Regular web searching. Forum and social media posts. The distro’s wiki itself or other such resources. You know, the stuff the AI originally sucked up, mashed together, mixed around, and spat back out.
right that’s why I said worse comes to worse and to take it with a grain of salt. For very simple issues it’s fine, beyond that it’s a coin toss. It’s a fine rubber duck. Like if I missed something obvious but I’m just not seeing it then it might point that out for me. Like for example I recently reinstalled my OS and I couldn’t get wireguad to work so as a last ditch effort I plugged it into Claude and it told me that I had forgotten to replace a privatekey on one of the peers. I had just completely missed it.
There are guides available. Search for ‘Linux kernel debugging’ or ‘Linux module debugging’, depending on which you are more interested in. And, of course, learn about the relevant parts of the kernel.
You might have a look at Debugging kernel and modules via gdb¶. The kernel.org site has a wealth of information.
I plug the error into ChatGPT usually get pointed in the right direction, I mean I no longer have a functional laptop and was given instructions on how to build a really good toaster. But hey I’m learning!
What distro is the toaster running?
As a new user GPT said it would be best to install a beginner level distro like arch.
“beginner level”
On Guix, I could bisect (like git bisecting) my OS. So usually what would happen is:
- I’m running in a good state
- I accidentally mess something up
- oh no
guix system switch-generation $n
, wheren
is the last known good state- then binary search until I find the first bad generation
- look at the config changes I made
- fix them
- back to good state
Unfortunately, my laptop is too new so Guix isn’t fully compatible with all my hardware. (Yes, I was using nonguix)
But that was a pretty neat experience compared to debugging something on Arch.
I use a distro that doesn’t fall into such things all the time. Linux Mint works great for me, as is Debian.