[Discuss] ATA Access Errors For Spinning Disk
Steve Litt
slitt at troubleshooters.com
Sun Dec 17 21:05:39 EST 2023
jbk said on Sun, 17 Dec 2023 10:13:36 -0500
>I periodically get access errors for a specific spinning
>disk that I have done these things to diagnose:
>Changed Sata Cable
>Switched Sata bus on MB
>Run E2fsck on the 3 formatted ext4 partitions w/ no errors found
>Run smartctl -a: all results within norms
>Run smartctl -t short: No errors found
>
>Disk operation age is about 7.5 years with around a couple
>hundred starts. It has been in continuous operation for over
>8 years except during vacations. On occasion the disk
>partitions will become unmounted and a mount -a will remount
>the partitions as a different device from lets say sda to
>sdd. I've not lost any data and I do regular backups to
>another device that's rotated out of system.
>
>I seem to have always had these errors present on this MB
>that is maybe 4 or 5 years in operation. Any thoughts on the
>cause of this issue? Do others see this behavior on occasion
>on systems they manage?
>
>On this same system my Rocky OS on an SSD is showing no
>issues at all. Same operation age as the spinner.
I really like the troubleshooting strategy you've pursued in trying to
find the root cause of this intermittent problem. As we all know,
intermittents are much more difficult to diagnose than reproducible
symptoms. If you look at the Universal Troubleshooting Process (UTP) on
Troubleshooters.Com, you'll see that UTP step 5, Corrective
Maintenance, is extremely powerful and necessary with intermittent
problems. I have some suggestions for Corrective Maintenance and
further diagnostic tests...
* You get occasional disk errors, any of which could cause data
corruption. To prevent things from getting worse, boot a rescue
distro and ddrescue your current disk to a larger disk, and if you
ever mount that backup disk, mount it read-only.
* Lubricate all electronic contacts for all cables, daughter cards, RAM
sticks, switches with associated cables, and jacks and plugs for all
peripherals. Apply the lubricant to conductive surfaces on both plug
(male) and jack (female), then insert and remove twenty times to bust
off all corrosion. Please take 10 minutes to read this 20 year old
discussion of electronic lubrication:
http://troubleshooters.com/tpromag/200310/200310.htm
I've used transmission fluid, WD-40, Lube-Job electronics lubricant,
Breakthrough CLP, WD-40, Deoxit Gold, Superslick Slick Stuff, and CRC
QD Contact Cleaner, and was very satisfied with all of them. I
currently use mostly Superslick Slick Stuff. The important thing is
that there's residual lubrication to prevent build-up of Fretting
Corrosion. Stabilant 22 and Deoxit Gold are the safest to prevent
damage to non-metals and prevent conduction between non-mating
surfaces, but they're pretty expensive. My experience has been that
as long as I carefully limit application to the mating conductors.
Lubricating all mating electronic contacts takes 2 or 3 hours, but
doing so can save you weeks of frustration if an intermittent is
being caused by fretting corrosion between electronic contacts. I do
complete electronic contact during the initial build of all my
computers. Because you've observed this intermittent since you bought
the mobo several years ago, lubricating the RAM stick contacts is
especially important, as it's likely those sticks have been in place
since you bought the mobo.
* Run a complete RAM test overnight by booting a memtest86 CD or thumb
drive. Get rid of any sticks with errors. Intermittents are too
expensive to try to limp along with RAM errors. Note that if you're
not using UEFI, you'll need an older version of memtest86.
* Temporarily swap in a known good power supply, use for several days,
and see whether the problem has gone away. If so, use the known good
power supply or a known good newly purchased power supply. If the
problem persists, put back the original power supply at the
conclusion of troubleshooting.
* Power switches and reboot switches can go intermittent and cause
hangs and spontaneous reboots. If I have suspicions of these things,
I disconnect the reboot switch (you can always unplug the computer
for an abrupt shutdown), and temporarily disconnect the power switch,
starting and stopping the computer by CAREFULLY shorting the power
switch pins with a screwdriver. I then run the machine for about 3
days to see if the problem really went away. If the problem appears
to be the power switch, I replace it with a cheap, wired, no light, 2
contact doorbell switch, available at home warehouse stores. If you
can't find it there (most doorbell switches are now lighted), I'm
pretty sure that this is what you need:
https://www.ebay.com/itm/155929670486 . You might need extra wire so
your front panel can be removed enough to service the front parts
without needing to disconnect the power button leads and fish them
around the motherboard and through the chassis.
* If you're overclocked, roll it back to the non-overclocked
frequencies. Often simply telling the BIOS to reset to its factory
state is a great way to rule out a whole bunch of BIOS caused
problems. As always, test for several days to make sure the
intermittent symptom really went away.
* Use various sensor programs to check various CPU temperatures and
disk temperatures. If temperatures even begin to approach maximum
specs, take
* Try to observe whether this intermittent symptom occurs significantly
more when running a specific set of software, and act accordingly.
* Boot a radically different distro, use for several days, and see if
the intermittent symptom still occurs. If so, you've for the most
part ruled out your distro, software, and config settings. If not,
investigate your software and configs.
* If none of the preceding works, you need to consider how much time,
money and energy you're willing to throw at this intermittent problem.
If you have a known good spinning rust hard disk bigger than the
current one, you could ddrescue the current one onto the new, bigger
one, test for a few days, and if the symptom doesn't recur, the hard
disk had a problem not detected by smartctl.
* If none of the preceding works, you need to consider how much time,
money and energy you're willing to throw at this intermittent
problem. Personally, at this point, I'd byte the bullet and buy a new
motherboard, ram and processor and processor heat sink. Be sure to
use high quality thermal heat sink compound between processor and heat
sink, be sure to remove any labels the manufacturer stupidly put on
the processor where it should be mating with the heat sink, and clean
all label adhesive residue before applying heat sink compound. Don't
cheap out on the heat sink: A lot of times the heat sink packaged
with the processor is great for email and light web browsing, but
allows overheat in intense operations like compiling a kernel.
Remember, you want this new setup to last for many years.
* If you're going to buy a new mobo, CPU and RAM anyway, it costs you
nothing to take the very risky step of updating your BIOS. Who knows,
it might work. Because of risks involved in BIOS updates, I don't
recommend them except in cases where your symptom is a well known
effect of your specific BIOS version, or else when you're about to
throw the mobo in the trash anyway. Be sure to run the computer on a
known good uninterruptable power supply when updating your BIOS so
your electric company's problems don't brick your computer.
I'm very aware of the time and energy the preceding steps require. Your
computer is now 8 years old and probably anemic by today's standards.
If your current computer has enough capability for your needs, you
could probably buy a whole new computer of equal capability for under
$700. If you want to replace it with a modern computer with huge
capacity, you can probably do it for between $1500 and $2300. Remember,
the alternative is all the troubleshooting steps I listed (and probably
other people can think of even more).
HTH,
SteveT
Steve Litt
Autumn 2023 featured book: Rapid Learning for the 21st Century
http://www.troubleshooters.com/rl21
More information about the Discuss
mailing list