BSOD: when experts get it wrong

Intro

BSOD (Blue Screen of Death), the infamous sight of windows crashing, perhaps taking your work with it, perhaps taking everything with it, as the crash corrupts the file allocation tables beyond repair! (This is why we do back-ups, people!)

The problem with the BSOD, and Microsoft’s cryptic messages, is they can be a sod to get to the root of, as was the case with me over the past few months. But first a few clarifications:

This isn’t going to be a rant at a particular person, or company. No finger-pointing here.

While the experts got the root of the problem wrong, their assessment of what was (or was not) happening was correct; in most cases they didn’t have the system to fiddle with, only verbal, diagnostic and log reports from me. Their assessment of what I was reporting was correct and, like pieces of a jigsaw, helped me get to the bottom of the problem. Intermittent faults can be horrendous to sort, and sometimes, even when you solve them, the general feeling can be, “It worked, but it shouldn’t have, that makes no sense. But let’s not jinx it, leave it as it is!”

The experts in question were not simply “a neighbour that’s good with computers”, they were long established tech companies, it was highly respected and widely used tools, it was world class tech companies. And it was me; and while I’m long since retired, back in the day I was a consultant with decades of experience: System building, networking, support, graphic design, programming. In tech terms, “I know kung fu.”



Error, does not compute

I didn’t keep a log of all the different errors as I thought I’d fixed it, once, or twice, but they included:

event_tracing_fatal_error
This is typically a driver issue (missing, corrupted, or out of date), but can be related to viruses, to a Windows update, or in rare cases, hardware issues.
(Or, it turns out, my problem.)

.

kernel_security_check_failure . As the report on Howto-connect explains, this has all the above and more, but again, not what I was seeing. That’s not to point the finger as this site, Microsoft got it wrong too, as did WindowsReport and others. Many others.

.

So far, it’s probably drivers, or maybe a virus, but possibly memory. Windows corruption rears its ugly face too.

.

driver_irql_not_less_or_equal.
Yet again, same list of culprits, just in a different order, just not the actual cause, though it did point towards it: ‘incorrect allocation of memory may be a reason’.
Memory was indeed my first thought, and it caught me out.

.

system_service_exception
This one specifically blamed ‘bhddrvx64.sys’, a Norton 360 security suite file.

This error is particularly gnarly:

System Service Exception is the result of shabby, outdated drivers and missing system resources because of the reasons as malware infection and conflict between programs. Furthermore, incorrect allocating of Memory, bad sector in Hard Disk, RAM, and low disk storage low can play the trick and cause Stop errors. Since the issue is a notorious one and found to cause a series of damages to Windows, we are here today for discussing some very fruitful workarounds.

(How-to connect)

Had all that; it was still pointing me in the wrong direction!

.

video_scheduler_internal_error

And there you have it, finally, it is caused by a bad Intel or Nvidia HD driver.

No, no it wasn’t!

Besides a particular game, crashes were also caused by paint and graphics packages (e.g. CorelDraw, Paintshop), development packages (e.g. Visual Studio, Unity), and others. A tally from the past week included:

Windows
Unity Editor
Unity helper
Unity helper / Opera
Background Task Host
MoUSO Core Worker process
nvidia ShadowPlayer Helper
Unity (csc.exe) / CorelDraw
Shell Infracture Host / Firefox
Microsoft.Photos.Exe (wasn’t even using it!)
HARDWARE ERROR (Problem Event Name: LiveKernelEvent)

So, while a picture was building up for me, nothing jumped out. It was like those horrid fractal autostereograms.



Side note

I was an authorised dealer for all the major companies (Seagate, Intel, Microsoft, Apple, etc) and used to build systems, sometimes for the budget market, sometimes for engineers, CAD, etc, sometimes for bluechip companies.

In my time I had the dubious distinction of being the first person in the world (to report being) infected with a particular virus, on a SCSI driver disk. I saw viruses on magazine cover disks (the magazine folded the next month), I’ve had them direct from Microsoft, and from Intel.

I’ve bought ‘new’ hardware from distributors and suppliers that have had people’s personal data on (so returns), and hardware with a 90% failure rate. I was also the main tech support guy. Some of the horror stories tech support tell are real!

So, partially as a consequence of being ‘behind enemy lines’ for decades, I am paranoid as heck! Not quite in the tinfoil brigade, but I have trust issues, so I don’t buy cheap parts, I don’t go questionable download sites, especially not for device drivers, and only use registered software, bought either direct or from reliable sources.

So if something on my system goes wrong I have two modes:

1) Telling my computer (you talk to yours too, right? Right? Good!), “No, you don’t get to do that.”

2) Rip parts out and replace them, or replace the entire system. Glitches and intermittent issues trigger me.

(and now 3, ‘cos 2020: “This year, I tell y’!“)

I have no tolerance for slow or misbehaving tech.



Let me tell you a story

I was procrastinating over an end paper (or two) for university and decided to try a new game, one promoted on Facebook. It was from the Warhammer franchise, on the Windows store, and on Steam, so reliable, right?

Well, the day installed it my computer crashed, for the first time ever. And it got worse, and worse, to the one point were within minutes of loading the game my computer fell over. So, I tried it on my laptop. That fell over. AMD, Intel, Windows store, Steam, no matter what or were, this damned game caused crashes and BSOD. So I went to their Discord support page and asked if it was just me. A few others saw the same. Yet – claimed their support – this had never been reported by anyone else, ever. I glanced at the developer telling the other guy he’d look into it, sighed and Googled. And found others, who’d also “not” reported it.

It should be noted, of course, that correlation is not causation, but this bloody game was now causing bad sectors on my root drive. And, it turned out, appeared to have physically damaged my main memory!

I’ll get to it in a minute but, jumping ahead, I uninstalled the game, replaced the memory chips, reformatted my drives to get rid of every last trace of that game, then reinstalled Windows. Problem solved. It seems.

It should be noted that around the same time, also due to university work, I updated my bios firmware because as I needed VPN support.



Reports, detection and repair

Over the course of this, sometimes repeatedly, I ran diagnostic and repair tools, reformatted, re-patch firmware, reinstalled Windows, ripped out stuff. And every time I thought I had it sussed, this happened, again:

It’s exhausting!

It also affected the final grades in one of my Y2 University modules, dropping me from a 90% average to a poxy 68% in the final! Though, if I’m honest, even though all this happened in the middle of my finals, the time and frustration loss only accounted for about 10% to 15% of the marks lost. The rest of it was on me. Annoying though, another day to write and I’d have got a distinction, I reckon.

That said, (post exams, post memory upgrade), my reliability history still looked like this until yesterday:

.

.

.

DIY: How and why

A lot of people are technophobes and will waste money on ‘repair guys’ when half the time you can do some or even most of the grunt work yourself. A decent techie will charge £25 to £40 an hour for this, a shyster might charge you £60 to £200. The difference is one knows exactly what he (or she) is doing and is charging you for that experience, the latter, often, has (maybe) an idea what to do and is intent on ripping you off.
(Note this relates to run of the mill issues, not the “A disgruntled ex-employee deleted and overwrote the company accounts, and the back-ups! We can’t bill customers, can you help?” level of serious).

Anyway, in rough order, my tests and such included:

In windows, search for: ‘reliability history’

Most of it may be gibberish to you, but it gives a visual idea of how often it’s happening.
If you want to know more, this is a good article, by Dell: How to use Windows Reliability Monitor to identify software issues)

.

The next ones will need to get your hands a little dirty, but it’s not really scary, honest.

Again in Search, type ‘command’ and run the command prompt as administrator (if you can). See below.

.

.

From there you can run further executables, starting with:
Sfc /scannow
This will look for protected Windows files that have been corrupted and try to repair them.

.

Next you can run searchdism.exe /online /cleanup-image /scanhealth. This will look for and try to fix other types of corruption.

.

.

If either or both of these throw up error warnings (e.g. ‘corrupt files found‘) you should run check disk, thus:

chkdsk C: /f /r /x.

Check disk will require a reboot and will look and attempt to repair corrupted files tables and bad sectors.

.

In a similar way, to look for outdated drivers, you can run: devmgmt.msc. Doesn’t actually do much, it’s just a quick way to open up the device manager in control panel. Any clear problems will be shown via a yellow exclamation warning. Up to you to tell windows to look for updated drivers.

You can also turn off devices that aren’t wanted or requirement, or that may be causing problems – e.g. network drivers, bluetooth, web cams, nvidia audio drivers

.

Depending how all the above went, and if it helped, you next want to check the memory. It’s not foolproof, but the one bundled with Windows is generally reliable. ‘Proper’ memory testers test just the memory module and can be very expensive, this is more like holding your finger up to the air to see which way the wind is blowing. Still, better than nothing, usually.

mdsched.exe

This requires a reboot, either immediately, or on the next start-up; the test can take a considerable time (hours), depending on the amount of memory you have, and whether there are errors.

However, remember that with the latter, and indeed most, if not all software tools there is an intrinsic problem: whatever is being tested is part of a tightly bound system, some errors are hidden, others compounded, and not everything is as it seems.

Take the ‘mdsched.exe’ test, for instance. It told me I had bad memory, which fit with a lot of the crash reports, so the solution was to replace the memory. Except the memory, while misbehaving, wasn’t itself faulty. You can get good software trade tools at reasonable prices (but still more than the price of new memory), or discrete hardware DIMM module testers, which can run into thousands of pounds. New memory seemed the cheaper solution.



Extending the search

The errors were all over the place, and I’d ruled out the main memory (‘cos I replaced it), so this was proving to be nightmare. Culprits include BIOS firmware (bugs), a corrupted BIOS firmware, the mainboard,* or the GPU*. And or driver(s). Could be the CPU, but I doubted it.

*Have you seen the price of high-end mainboards and GPUs lately!? The RTX 2080 TI graphics card is a wallet choking £1,250, while the new AMD TR4’s come in at up to a grand just for the board, and another £1,399 (24 core) to £3,400 (64 core) for the processor!
😱

At this point I was at the stage of asking the company that built it (Overclockers UK) to take a look, because I was at a loss, but I didn’t want to be without my PC either, I have too much to do.

So I ran more tests, using whatever free (but recommended) software I could get hold of. These included:

gpumemtest.exe from Programming4Beginners.
Small file, just tests video memory. No whistle or bells, but something. No errors reported

HWiNFO64 from HWINFO.
A monitoring tool rather, but a few whistles and bells. This also detected no errors detected while running.

Finally I used OCCT v6.1.0
It’s free for personal use and very competitively priced for professional and corporate licences.

And it works!

Huzzah! Finally we have a winner! 🏅

Almost… 🏅 😩

This is OCCT running now, with a number of apps loaded in the background:

.

.

Yesterday, before I worked out what the heck was happening, that test – with nothing else running – was throwing out over 1,000 errors a minute!

*mouths: “a minute! 😱.

Not for long mind, ‘cos it crashed the computer every time it ran! Sometimes in under a minute, once it made it to nearly 5 minutes, but it was watching a trainwreck.

.

OCCT can perform a number of checks and is meant for stress testing, but here’s what I found:

Power‘, as the name suggests, pushes the GPU and CPU to test the mainboard and power supply. That was fine.

Memtest‘ tests your video memory, as long as it’s not overclocked. That agreed with the previous gpu tester, nothing appeared wrong here.

3D‘ again obvious, runs more tests on your GPU. All good here, Bob.

Linpack‘ is something from Intel to stress test systems, it has error detection. It found nothing wrong.

OCCT‘ test is a less aggressive large data set stress test to find stability issues. It gagged, and fell to its knees, choking, which was both interesting and mortifying to watch!
🤢

I’d long since worked out it was (probably) memory related due to malloc errors, but how, and which memory? RAM?, video? CPU cache? Mainboard cache? Something else (e.g. DIMM rails damaged). Software causing it? hardware? drivers?

Yet everything seems to be OK itself, so what’s going on?

All the tests I can run say the video memory, and the (replaced) main memory are fine, but OCCT (v6.1) consistently reports errors with a large data set. OCCT say on their site that:
1 error is too many – and I am getting up to 1,000 a minute before it crashes their test.

Alas, it doesn’t actually say what’s causing it, doesn’t known, doesn’t care, but it’s there in black and white, or red and white at least: “Son, you got problems!”
As they describe it:

CPU:OCCT, Large data set, with Auto number of threads is, hands-down, what will detect CPU/Memory/Motherboard stability issues faster
Starting at how many errors do you consider a computer unstable ? Seriously : One.
On the software point of view, it is as if it asked the for 2 + 2 and got 5 as an answer.

It is because the CPU miscalculated and answered that 2 + 2 = 5 ?
Is it because the memory stored a 3 instead of a 2 ?
Is it because when transferring the 2 from the memory to the CPU, the 2 got transformed to a 3 ?
As there’s no sure way of reproducing the error that occurred, there’s no way of knowing what is going on. To further pinpoint errors, you have to dig yourself.

But it did actually tell me what was wrong in its own way, despite all the other tests trying to lynch the video card, having already buried the previously false accused ram in a shallow grave!



So, I asked my mate, Bob

Actually I asked a few people, and a few others offered advice because, techies.

One was an old friend, a local trader (Express Computers) who had carried on in the trade twenty years after I retired. He offered this opinion:

Could be a windows update that’s triggering it. I remember had something weird very similar a few years ago. Turned out to be a particular Windows update was conflicting with the graphics driver.

One of the Kxxxxxxxxx windows updates.

Explored the Web and found a common theme and it all. Pointed towards a particular update. Once I uninstall the update it was fine. Had to disable ‘Automatically download and install. Updates’. Then changed the option to notify me of updates and allow me to select which ones to install. Every time the suspect one came up I unchecked it.

.

It wasn’t the case, here, but (at the time) fit with a lot of the error reports.

.

Unity support were of a similar opinion and a direct crash dump analysis was telling them this:

Upon further inspection, this seems to be an NVIDIA driver issue.
As a workaround, please try the following:
-Update your GPU drivers
-Uninstall GeForce Experience
-Disable any app used to monitor framerate
-Uninstall or reinstall NVIDIA Audio drivers

Some users recently reported that turning off various NVIDIA settings helped with similar crashes.
Steps to restore the settings to the default state:
1. Open the NVIDIA Control Panel
2. Select Manage 3D settings
3. Click the “Restore” button

One more thing to mention, if you’re using any VPNs/proxies, please try disabling them and see if it helps. If it does, please respond with additional information on what kind of proxy/VPN you were using.

I tried all that. Computer crashed again. At the time and even up to yesterday, it did definitely seem to point towards a video problem (I have noticed occasional video glitches for some time) as certain games can cause it, as can graphics and development software (e.g. CorelDraw, PaintshopPro, Unity, Visual Studio), while running text editors and web browsers (regardless of the number of windows open) does not seem to cause crashes.

Thoughts?



Denouement

Denouement is a posh word for ‘the end of the story’. “The final part of a play, film, or narrative in which the strands of the plot are drawn together and matters are explained or resolved.”

It is amazing how much it focuses your mind when the alternative is paying out several thousand quid to replace parts that might – like the memory (possibly) – be perfectly fine.

It started with a game, one that – without question – was a vile memory hog, and that caused an untold number of crashes, bad sectors, and right royally corrupted Windows to the point were format was needed (SSD’s do not like being formatted).

But as I said at the start, correlation is not causation. I had also updated the bios firmware for VPN reasons. But the bios was stable, and I’d long since reset the VPN, ‘cos it can cause problems.

But I went for another look anyway, ‘cos the 2080 card is eye-watering, and I really hate swapping motherboards at the best of times.

And my BIOS hung! 😨

There’s two things that can give us techies the heebie jeebies:

Needing to patch the firmware on a motherboard (or anything that expensive/critical) – and that.

I have NEVER seen a BIOS hang in my life – and I’ve seen a SIMM burst into flames during a stress test. I’ve put my boot through one of my server cases in frustration after 30 minutes on the phone with the most computer illiterate person on the planet. I’ve heard of a customer removing a video card by pulling it though the back port of a PC with a pair of pliers and demanding it was fixed because the “void if removed” sticky was intact.” But the BIOS freezing? Nope.

My jaw drop.

Literally.

Pretty sure I held my breath and checked my heart was still beating. This is like bunny in the headlights of a speeding car level of freezing. (Me, not the computer, that was just staring back at me, an evil glint in its digital headlights).

So I rebooted and held my breath.

And got Windows, ‘cos it killed my USB ports, and thus my keyboard. Can’t enter ‘delete’ to enter the BIOS without it. Luckily a) it was still working, after a fashion, and b) I’d seen it do this before and know how to fix it – pull all the USB cables out and shove ’em back in again.

So I rebooted again, got into BIOS, told it to reset everything to ‘default’ and

it gave me the middle finger! The BIOS screen turned black with a pulsing underscore in the middle of the screen. Oh, it’s on!, I whispered back.

The wrestling continued for some time, ‘cos I knew I had the bugger trapped.

Took a while, and only one thing saved me – I believe. Some motherboards have a dual bios, like the Gigabyte X399. The firmware, somehow, either through a setting change, or something else, had become corrupted and I managed to clear and reset it.

Have to say, fingers crossed, ‘cos I thought I had it worked out last time, but I have put the old (none) faulty memory back, doubling my RAM, and have being running tests and pushing the computer since. Everything is stable.

HUZZAH!

Anyway, while everything pointed to memory or drivers, the real villain was the bios, hiding in plain site.



Extra bit

Showing my age, back in the day my ‘go to’ diagnostics were from a company called Microhaus. The kit included pair of excellent P.O.S.T. cards (8-bit ISA and PCI), and software on 5.25″ and 3.5″ disks.

Good times, simpler times. Times when big trade shows for dealers were still a thing, vendors gave out freebies like software, t-shirts, mugs, toys and, umm, socks, while distributors laid on silver waitress meals. Over the years, as margins fells, it became butties, cheap wine and little fluffy things with the vendor’s name on a ribbon. Then nothing. Part of me died when the suits and market men took over the industry. Most if not all the PC trade magazine are long gone too. Nothing can suck the soul out of a thing faster than corporate avaricious greed, driven by a mindless ‘need’ for ever greater quarterly results.

.

Surprisingly, I found today, the other ‘go to’ company of the time is still going strong: UXD. ‘Ultra-X Professional Diagnostic Tools’ were highly desirable. Nice to see they are still with us.

.

These seem to ring a bell too, or at least the PC-Check trademark does. Another for professional dealers and repair shops, but always handy to know. Eurosoft UK sell PC-Check® Diagnostic Software & Hardware Diagnostic Tools

.

I’ve seen things like this next one before, but not sure if the company (Innoventions) is still trading, so just included out of interest. A lot of motherboards have built in POST now, but there are other tools for USB devices, hard drives, etc. This, from 2018, is for RAM. INNOVENTIONS Launches Memory Testers for DDR4:
“The RAMCHECK LX DDR4 tester is available for $2,495. The DDR4 Pro tester is $3,850.”

.

Lastly, because it’s free, and because, well, I might need it another time, there’s ESET’s SysInspector, a free PC diagnostic tool. Eset are better known for antivirus and security software, and this is another of their products. The blurb reads:

“Use the ESET SysInspector utility to inspect your PC.
Fix problems straight away, or submit a log to ESET Customer Care for resolution.”

.

Didn’t get round to using these last two memory (ram) testers either (I probably should have!), but I’m fairly sure I’ve used one or both in the past.

Passmark offer a range of professional software and hardware tools, from forensic software to PCIe test cards and a PC Test Kit (for around $1,200).

.

.

Plus Memtest86, which comes in free and pro versions. Needs installing on a bootable USB stick to do its tests.

Why use MemTest86

Unreliable RAM can cause a multitude of problems. Corrupted data, crashes and other unexplained behaviour.

Bad RAM is one of the most frustrating computer problems to have as symptoms are often random and hard to pin down. MemTest86 can help diagnose faulty RAM (or rule it out as a cause of system instability).

MemTest86 is often used by system builders, PC repair shops, overclockers, computer enthusiasts & PC manufacturers.

There is also Memtest86+, which is based on early versions of Memtest86, but is open source. Does a similar job.

For partition issues you might want to look at TestDisk. Haven’t tried this, though I have used similar (commercial) software in the past, Partition Wizard, I think. Sort of thing that you only play with if you know what you are doing, really. Windows does have a partition manager as well, which does the job, but it’s not pleasant to use!

Hope this page helps!

~ Ack

Ack

Been playing with computers since the stone age, online since the '80s, and developing websites since the '90s.

Leave a Reply

Exit mobile version