April 21, 2010

Perhaps They Should Have Tested More - McAfee

Due to a faulty virus definition update, machines running Windows XP Service Pack 3 using the faulty definitions will delete svchost.exe, causing many key Windows services to fail to start. This Windows file is being mistakenly detected as W32/wecorl.a. Failure to start svchost.exe causes Windows to automatically reboot.

  • National software glitch
  • Hundreds of thousands of computers disabled
  • A huge disruption
  • Strangely similar to a widespread virus outbreak
  • Software update caused the anti-virus program to misidentify a harmless file (svchost.exe) as infected
  • Endless cycle of rebooting
  • A chain of uncontrolled restarts and loss of networking functionality
  • Shut down the State of Vermont's computer network
  • Many hospitals postpone elective surgeries
  • Organizations who had to shut down for business until this is fixed.
  • According to Ars Technica it "Would be trivially detected with even basic QA, which makes the regularity of such problems perplexing"
  • According to Amrit Williams (a former director of engineering with McAfee) it shows "a complete failure in their quality control process"
  • Unmitigated disaster for McAfee
McAfee says:
We are investigating how the incorrect detection made it into our DAT files and will take measures to prevent this from reoccurring.
Mistakes happen. No excuses. The nearly 7,000 employees of McAfee are focused right now on two things, in this order. First, help our customers who have been affected by this issue get back to business as usual. And second, once that is done, make sure we put the processes in place so this never happens again.
This is not the first time for McAfee.  Back in 2006, they similarly flagged system files as infected:

Perhaps they should have learned their lesson in 2006.  Perhaps they should have tested more.

See also:

Updated, April 23, 2010
From Barry McPherson on McAfee's blog:
"Of course many of you are asking how the faulty DAT made it past our quality assurance checks. The problem arose during the testing process for this DAT file. We recently made a change to our QA environment that resulted in a faulty DAT making its way out of our test environment and onto customer systems.

To prevent this from happening again, we are implementing additional QA protocols for any releases that directly impact critical system files. In addition, we plan to add capabilities to our cloud-based Artemis system that will provide an additional level of protection against false positives by leveraging an expansive whitelist of critical system files."

A change to the QA environment caused a faulty DAT to get released to production?

And in his blog, technology writer Ed Bott tells us that he received a document from an anonymous source that appears to be a pre-scrubbed (and perhaps more telling) version of what appears on the McAfee blog.

Among the interesting nuggets:
"Specifically, XP SP3 with VSE 8.7 was not included in the test configuration at the time of release."

They left Windows XP SP3 out of their test matrix? Wow!