Thursday, September 30, 2010

[solved] The WD WD5000BEVT hard disk self destruction

This goes out to all owners of a QNAP SS-839 Pro. I've owned the device for about 1,5 months now, and in general I'm pretty happy with it. However, the load cycle behaviour of the hard disks that I am using with the QNAP (8x WD5000BEVT 2,5-inch drives) had me doubt my happiness would last long.
The problem became clear when I checked the SMART details of each drive in the admin interface. Only a few days had passed since I first powered the machine up, and load cycles were already beyond 3000 for most of the drives. This seemed a little much and meant that drives would shut down and spin up about 400 times a day! It's just a belief, but I think that hard disk drives last longer if they run 24/7 instead of changing their power state that often. QNAP options didn't seem to contain switches to influence this from the host side.

SMART Statistics Illustrating the Problem

This is the chart that shows the load_cycle_count (193) SMART indicator for all of the drives. Note that I have disconnected drive 8 on August 22nd. It was configured as the hot spare drive, and I didn't want to continue wasting its lifetime.

SMART Statistics

I googled for these symptoms and found a considerable lot of reports of this problem, and repeatedly found that Linux as well as 2,5-inch drives were affected, regardless of whether a NAS configuration or a classic PC was used. There were some advice, too, but they required manufacturer-specific tools that I couldn't use, and WD didn't offer any tools for the 5000BEVT drives. The WD data sheets didn't show any usable info on power saving behaviour of the drives, so I decided to contact them.

WD Support

This is the text of the e-mail I wrote on August 19th:

Dear Sirs,

I have a question regarding the general power saving behavior of Scorpio Blue hard disk drives of which I own eight units that run in a QNAP SS-839 Pro NAS enclosure as a RAID-6. Watching the SMART data of the drives from time to time, I found out that the load cycle count of each disk goes up by 300 to 400 each day! That seems too much for my taste, and I'd like to find out why this happens.
The NAS is up and running all the time, so there is no impact on the general operation, but I'm worried about the reduced lifetime of the disks caused by these frequent shutdowns.
I noticed that the one drive that I have assigned as the spare drive does not shut down as often as the other drives. While most drives have about 8800 load cycles by now (after 3 weeks), the spare drive has ~5400 cycles. It is still questionable why the spare is this active as the it should not be doing much except being checked every once in a while, but instead it is nearly as much used as all the other disks which makes me worry if it is capable of saving the RAID for long once another disk fails.
I haven't found any options in the QNAP device software to control any of this. Currently it is unknown who is responsible for the behavior at all, either the NAS device or the disks. People in the QNAP forums couldn't help me so far, I hope you can help me find out.
The question is, do Scorpio hard disks shut down on their own, e.g. after being idle for some minutes to save power? Is there any way I can control this? I'd prefer to have each disk run 4 hours or so before it shuts down. In my optinion it is better for a disk to keep running than to shut down and spin up again constantly. Am I right with this assumption? I cannot find a tool on the WDC web site that would let me view or change the hard disk parameters. Does such a tool exist? I know from other HDD manufacturers that they deliver tools to adjust parameters such as AAM, and read detailed SMART data. Maybe WD has some tool of this kind, too?
What do you recommend I should do?

Thanks a lot for your support!
Best regards,

Johannes Franke

And wow! - indeed, the reply came in the evening of the same day, but seemed somewhat dissatisfying:

Dear JOHANNES,

Thank you for contacting Western Digital Customer Service and Support. My name is Tahimi.

I truly apologize for the inconvenience you are currently experiencing. Unfortunately Mr. Franke the first problem you are facing are the drives you are using on a RAID. Unfortunately we don't support Scorpio drives on a RAID array because they don't have the TLER feature enabled.

However, another bad news is that we don't have any feature or tool to manage the power or sleep timer of the drive.

To build a RAID array I recommend you to use our RAID edition drives like the RE3 or RE4. Please use the following link to get more info about the RAID edition drives.
http://www.wdc.com/en/products/index.asp?cat=2

If you have any further questions, please reply to this email and we will be happy to assist you further.

Sincerely,
Tahimi
Western Digital Service and Support
http://support.wdc.com

TLER is a nice feature which enables hard drives to report problems pretty soon to the host controller, and has both devices kind of negotiate on how to deal with the problem. Without this feature, a hard disk has about 8 seconds to come back with the data requested after the controller sent the request. If it takes longer, it is automatically set to FAILED state in the controller, and the user can't do anything but replace the drive, so TLER may help use drives longer, but for sure it is not required at all for normal RAID operation. Furthermore, the advice to use the RE3 or RE4 series of the WD drives is not applicable for me as they are only available in 3,5-inch form factor. Probably Tahimi didn't look up any info on what a SS-839 Pro is and what drives are supported.
I was also mad at QNAP because they have the 5000BEVT in their compatibility list for the SS-839 Pro [1]. This was not exactly confirmed by WD this way…

Finding the Needed Hint

After this, I tried tweaking the QNAP settings, removed all power saving settings as far as they were offered in the admin interface, and also removed the disk test jobs that I had created to make a daily quick test and a weekly full test for each of the drives. This dropped the load cycle growth down to about 10 times a day per disk (August 25 in the chart), but for some reason, there must have been a change between August 30th and September 8th that caused the load cycles to go further up again. I cannot remember what I did, and there was no more option to return to the previous state that I considered good.
Up to this point I didn't even know whether the frequent load cycling came from the QNAP and some of its built-in power saving mechanisms, or if the hard disks would shut down themselves after some idle time, so I investigated a little more yesterday. Victor Meldrew’s blog [7] was very interesting to read and pointed exactly in the direction I wanted – thanks Victor!

Enter: wdidle3!

Even though WD support denied it, there is a tool with only one purpose: tweaking the built-in "idle3" timer that triggers a shutdown of the disk after eight seconds (!) of idle time by default. WD calls this "IntelliPark", I'd rather call it "StupiSuicide"... oh well, get the tool here [2] or here [3].
The problem is that wdidle3 is a pure DOS tool, it cannot be run in any Windows environment. If you try, you will just get informed that the application is not allowed to run in the way you intended. That is, you need a DOS environment. Yes, such things still happen!

Creating a DOS Environment

Nowadays, FreeDOS and TUBCD are pretty popular and royalty-free. I chose TUBCD [5] but you can just as well use FreeDOS [4]. You will also need a PC that supports booting from USB memory sticks and features a built-in SATA controller. Third-party controllers will most probably not be recognized by the tool.
To customize TUBCD, and run it from a USB stick instead a CD, follow instructions at [6]. I skipped the chapters from Adding floppy images through Generating customized ISO image, and instead placed wdidle3.exe in the ubcd\tools\win32 folder inside the path to which I had unpacked the ISO. It can be run from there after booting. The creation of a bootable USB stick is described in the chapter Making UBCD memory stick in the customization instructions.

Lights On: Tweaking the Drives

This is what I did (main steps):

•    Created a bootable USB stick with The Ultimate Bootable CD plus wdidle3 as described
•    Shut down my PC, disconnected all hard drives from the mainboard, then connected the first of the eight drives I wanted to tweak
•    Placed the USB stick in one of the USB ports
•    Turned the PC back on and went to setup to modify the boot order: USB-Floppy (not USB-CD or USB-Harddisk) should be the first entry to ensure the system boots from the USB stick
•    Watched the sytem boot from USB. There may be some dialogs during the boot order that you need to confirm.
•    When the main menu appeared, chose UBCD FreeDOS
•    After the command prompt appeared, entered c: to get to the root of the USB stick
•    Entered cd \ubcd\tools\win32

Now you’re ready to use wdidle3 with the hard drive currently connected.

PLEASE NOTE: the steps described here worked for me, but may fail with your hardware. The wdidle3 tool is not officially designed to tweak the Scorpio Blue series of WD drives, and probably you are going to void your warranty once you use it. If you are extremely unlucky, the tool may corrupt your drive’s firmware and render it useless. Please keep in mind that you do this at your own risk. I am not responsible for any loss of data or damage to your hardware.
Please consider performing a full backup of your QNAP (or of each disk you are about to tweak) to make sure that a damaged drive is the worst thing that happens.

You can use wdidle3 with these parameters:

•    /? – displays a command line help
•    /R – reports the current timer status of the disk connected, along with the model and serial number
•    /D – disables the timer completely, i.e. the drive will never shut down on its own even when idle
•    /S{n} - sets the timer to the amount of seconds specified in place of {n} (values from1 to 255)
To disable the timer on the current hard disk, just enter
wdidle3 /d
Again, the hard disk model and serial number are shown, and the message should now also say that the timer is disabled. If so: congratulations! You are done!
I repeated this for all of the eight drives, and didn’t even need to power the system down and back up to disconnect the current and connect the next drive. That was a great timesaver, but let me repeat, this is something that no reasonable support personnel would ever recommend. Disconnecting and plugging in hardware while powered on is a very dangerous game, particularly for internal SATA ports which are not hotplug-enabled. It worked anyway with my Gigabyte GA-880GA-UD3 mainboard, using the following pattern. Feel free to try, but be extremely careful, and remember that it’s at your own risk:

  1. After the current drive is done, disconnect the SATA data cable from it (the smaller of the two plugs)
  2. Then disconnect the SATA power cable (larger plug)
  3. Get the next drive and connect SATA power first
  4. Take a listen - you should hear the drive spin up
  5. Then connect the SATA data cable
  6. Wait some seconds for the drive to be ready
  7. Repeat the command line “wdidle3 /d”, and verify that it shows the serial number of the drive you have connected in step 3
  8. Wait some seconds to ensure that no more writing to the drive takes place
  9. Continue with the next disk at step 1

That way, I disabled the timers of all hard drives within a few minutes. Eventually the big moment came: all disks were reinserted into the QNAP (ensure the same order as before!), then powered up, and whoa, all data still there, wonderful! Since then, no load cycle count increased, on none of the disks. Case closed.

Conclusion

WD's strategies to make their drives less power-consuming are two-edged: while they do save power by quickly shutting down whenever possible, they tend to destroy themselves more rapidly than drives that keep running. In a mobile PC, the hard disk is one of the minor power consumers, mostly it's the display that draws most of the energy, so this power saving approach goes a little too far in my opinion. Another downside of it is that the OS is not aware of the drive's behaviour. On many laptops, the result is that the system seems to "hang" from time to time because the OS doesn't even know the drive has shut down. When it is accessed the next time (and in Windows, that's rather frequent), the HDD spinning up again causes a delay of a few seconds, making the system completely inaccessible until the drive is back up.
For sure, WD is not the only manufacturer who implements power saving like this. No manufacturer is likely to let you tweak predefined settings, and most of them will see this as a violataion of warranty conditions. It's a matter of trust. I think the WD drives are well-built and will not fail spontaneously. At least it's less likely if they keep running, instead of interrupting their operation over and over again. They will probably handle running 24/7 with ease, but only time can tell. If they do fail, I will probably have a hard time getting a replacement even if this happens within the guarantee lifetime. It's sad that there is so little official information about this. I hope this article may help other hardware desperados find what they need more quickly.

[1] QNAP Compatibility list for 2,5-inch drives
[2] wdidle3 download at WD
[3] wdidle3 download at private mirror
[4] Use wdidle3 with FreeDOS (German)
[5] The Ultimate Boot CD Download
[6] The Ultimate Boot CD Customizing
[7] I Don’t Believe It! Blog about WD self-torture

Addendum


[2010-01-12 19:28] Today, I inspected the SMART values again, and still the load cycle count has not changed for any of the disks! Wonderful!

Statistics from mid-August 2010 until today

7 comments:

  1. Hi,

    Great job!
    3 weeks ago I have bought Lenovo Thinkpad Edge 15 with WDC WD 5000BEVT on board and I can not handle this struggling sound comming from my HDD.

    I would like to use this wdidle3 tool but I am affraid of loosing my warranty (BTW 1 year).

    Currently I have 60 hours of working on it and 350 Start/Stop cycles - so not even a small disaster - but this sound make me crazy!

    The next "said" thing is that WD does not recommend, at least oficially, wdidle3 tool for Scorpio Blue hard drives so I guess I stack for one year ;)

    Maybe if somebody already has done this "update" to its Lenovo (or any other laptop with this HDD) and get positive results I would consider this wdidle3 tool use even now.

    Thank you!
    G.

    ReplyDelete
  2. Dear G.,

    I agree with you that a lot of care is needed. The WDC drives may sound ugly when they park heads, but especially in a mobile environment such as yours you might consider to continue using the drive's power saving features, eventually this will give you a longer uptime when you are working on battery power.
    I am pretty sure that you can use WDIdle3 with your Lenovo the same way I described, but again, I have fixed the drives because they work in a 24/7 environment where frequent shutdowns (and the faster decay of the drives) is rather undesired. That is totally different from notebook usage. The problem here is that current operating systems such as Windows 7 are frequently accessing the disk for indexing, optimization, background services, logging, etc., so the chance that a drive is not used for more than 2 minutes is extremely low, even if you are not actively using the computer. Windows is using system idle time intentionally for optimiziation processes that would slow down the system in the user's perception, so some processes wait until no user appears to be present to do their work in that phase of time. They stop as soon as any user interaction event happens to return the full CPU / disk speed to the user. That behavior is kind of odd because you'd expect your system to go power saving as soon as you don't use it, and Windows works just the other way round.
    From this point of view, it would make sense to keep drives running all the time and leave it up to the Windows power management to shut the drive(s) down. But you should expect a shorter battery runtime. If you can live with that, there should be no technical reason why you could not use WDIdle3.
    About warranty: even if you destroy the drive, a brand new replacement costs around 50 EUR. You won't void your Lenovo notebook warranty when you change the disk drive. And if you create a full backup of the drive, you can't even lose data by this operation.

    Regards,
    Joe

    ReplyDelete
  3. Thank you Johannes.
    Seeing that in your case it worked and following this instruction also: http://www.synology.com/support/faq_show.php?q_id=407
    i disabled the timer on my WD5000BEVT drive. Since then LLC is increasing only when i shut down my NAS or when HDD goes into hibernation.
    I took a Free DOS bootable CD, made an cd image, edit the image with UltraISO to put wdidle executable, and get things done in 2 minutes.
    Thank you again for your effort to let us know your case.

    ReplyDelete
  4. Hi Johannes,

    thanks for the instructions using Ultimate Boot CD and USB Stick, it worked straight forward.

    I have used wdidle today to stop my WD10EADS with Load Cycle Counter at 1,8 Million (!) after 18 Month of service in an Linux software RAID1.

    The funny thing is the second drive in the mirror failed but as it had a small different firmare&drive version it was not having the Load-Cycle Power Management, so its counter was at around 50! Only this failure made me review the other drive as well...

    ReplyDelete
  5. at 1,8 Million cycles and no problem - maybe the drives were designed to handle many start/stop cycles (7,2 milion for 7 years) ?
    Does anybody tried to set the timer to say 2 hours (for 12 cycles per day) ?

    ReplyDelete
  6. Nonono, don`t fight w DOS whan you can do it under windows - jut d/l hddscan from http://hddscan.com/, click the fancy big round knob and select features->ide features. Then U can disable everything ;-)

    ReplyDelete
  7. I've made a ISO wich contains bootable free dos and added wdidle3.exe on it. Just download from http://tibi78.dyndns.org/wordpress/2011/07/wd-load-cycle-count-issue/ , burn it to disk and follow instruction.

    ReplyDelete

Wenn du auf meinem Blog kommentierst, werden die von dir eingegebenen Formulardaten (und unter Umständen auch weitere personenbezogene Daten, wie z. B. deine IP-Adresse) an Google-Server übermittelt. Mehr Infos dazu findest du in meiner Datenschutzerklärung (Link einfügen) und in der Datenschutzerklärung von Google.