Talk:Minimizing Hard Disk Drive Failure and Data Loss

Human error protection

Latest comment: 17 years ago1 comment1 person in discussion

I've moved the following text from the article here. I've compacted and incorporated the ideas behind it in the sections on revision control and on managing backups and revisions. I think it is important that the article stays compact, and that information available in Wikipedia is not duplicated here.

It is all too easy to accidentally delete a file, or to accidentally delete parts of a file.

Many system administrators set up a file server that archives earlier versions of useful files.

The user at his computer (perhaps using a revision control GUI like TortoiseSVN or a periodic cron program on the file server (perhaps using something like the "backup" option of Unison installed on the user's computer) periodically copies the latest version from the user's computer to the file server.

Most such systems skip any files that have not changed, and compress the changes of files that have changed, such that storing 100 versions of a file folder, each version slightly different from the next, takes only a little more storage than storing only 1 version of that file folder.

The key to preventing data loss is that the file server itself is set up in such a way that normal users (or any virus running on the user's machine) can't accidentally (or deliberately) delete or change files on the file server.

Some system administrators set things up so obsolete versions are eventually purged. For example, a system administrator might make sure that today's version will still be accessible a month from now, by setting up a cron job that periodically scans the archive, keeping at least 1 version more than a month old, and automatically purging any versions older than that.

(Don't delete *every* version more than a month old; that's a common backup mistake. For example, the banner graphic file referenced by every page on a company's website -- often that file that hasn't changed in years. If I accidentally damage it, and then that damaged version gets saved in the archive, I'm going to be upset if I can't revert to "yesterday's version" because the system automatically purged that "obsolete 2-year-old version").

--Antibogon (talk) 19:16, 5 April 2008 (UTC)Reply

Condensation control

Latest comment: 17 years ago2 comments2 people in discussion

The 'Condensation control' section says condensation occurs when something is moved from a cold to a warm place, so from outdoors to indoors. But isn't that the other way around? Cold air holds less water, so it condensates. So the condensation occurs not when you bring a computer back in, but earlier, when you bring it out. I am fairly sure of this, but not sufficiently so to change the article.
Oh, hold on, if the computer has been outside for some time and is then brought inside, the cold metal will cool the air and cause condensation. Just like when you blow against a cold window.
So it can be one of two effects, or both, depending on the circumstances. Right? DirkvdM (talk) 07:48, 1 August 2008 (UTC)Reply

Thanks for bringing up the concern. The information I previously entered in the article was roughly obtained from another web article, the link to which I unfortunately cannot currently recall.

I'm certain that condensation will occur in the latter case when the computer is brought inside. About this, as per the Wikipedia article on condensation, water vapor will only condense onto another surface when the temperature of that surface is cooler than the temperature of the water vapor.

About your point "cold air hold less water", the Wikipedia article on dew point states that the claim "cold air cannot hold as much water as warm air" is only superficially true. It seems that the physics involved is complex, and at this time I really don't know what to make of this event of taking the computer outside. It is by no means certain that condensation will happen, or that it'll damage the hard drive even if it does happen. One easy way to answer the question may be to run several limited experiments. I will update this post if I learn anything new and concrete with regards to this matter. Until then, I will by no means be taking my laptop outside in the winter without it being in its case.

--Antibogon (talk) 03:43, 4 August 2008 (UTC)Reply

Temperature control

Latest comment: 15 years ago6 comments3 people in discussion

There are 2 sections in this book that talk about the effects of temperature on hard drives: minimizing hard disk drive failure and data loss#Adequate cooling and minimizing hard disk drive failure and data loss#Temperature monitoring. One section implies that cooling is good for hard drives. The other section says that cold drives have a higher failure rate than hot drives, implying that cooling is bad for hard drives.

Could we somehow merge those sections together into one section -- perhaps "controlling the temperature of hard drives"? --DavidCary (talk) 03:54, 2 November 2008 (UTC)Reply

Thanks for your feedback. I have now done a preliminary merge of the two sections into a single large section on temperature control.

The details, as mentioned in the article, are important and shouldn't be overlooked. By and large, cooling is indeed important, as mentioned in the first part. I'm not so sure that the second part implies that cooling is bad; instead it prescribes having finer control, beyond the simple notion of hot versus cool.

I would welcome any further comments you may have. --Antibogon (talk) 02:32, 7 November 2008 (UTC)Reply

Good job merging the sections.

It seems obvious that "cooling is indeed important". However, is there any evidence for this? Usually when we make measurements, the results are about what we expected. However, occasionally we are surprised by counter-intuitive results. Do we reject and ignore all measurements that don't match our preconceived biases? The Google measurements show that, the further the average temperatures is below 38 C, the higher the failure rate. What other implication can I draw other than "cooling below an average temperature of 38 C is bad" ? --DavidCary (talk) 08:30, 23 March 2009 (UTC)Reply

Doesn't heat increase friction and wear and tear? We know the mean temperatures, yes, but do we know anything about the actual distributions of temperatures in the Google study, or are we simply assuming they were gaussian? Is the Google study the only study in this regard? I find it safe to assume that there have been studies with conflicting results in the past. Do we know what kind (makes and models) of drives Google used?

Based on the Google study, I personally find it safe to draw the conservative conclusion that I probably wouldn't want my drive running at less than 30C. I think air-based cooling is essential because without it the drives could easily be running in the mid 40s, and there's no way that internal air alone will bring the temperature below 30. --Antibogon (talk) 02:21, 24 March 2009 (UTC)Reply

Yes, in theory I expected that heat would bad for hard drives. But in practice, the Google study says "In our study, we did not find much correlation between failure rate and either elevated temperature or utilization. It is the most surprising result of our study."

I'm looking on the graph on this page, "Minimizing hard disk drive failure and data loss". It shows the actual distribution of the mean temperatures of the hard drives studied. Or are you talking about the distribution of temperatures (over time) for any one hard drive? I can assure you I do not assume that was gaussian.

If there have been *any* other studies, I don't care whether their results are conflicting or not, *please* add a reference to those studies to this book.

I agree that, based on the Google study, I don't want my drives running at less than 30 C.

I agree that, in theory, it seems unlikely that air cooling will bring the temperature below 30 C. However, I see that same graph shows that over half of Google's hard drives had an average temperature below 30 C. Are you assuming that Google cools their hard drives using some kind of liquid cooling? --DavidCary (talk) 04:49, 30 March 2009 (UTC)Reply

Google using liquid cooling? Entirely possibly, but I wouldn't bet on it - they're more likely to use air-cooled blade servers.

However, a lot of this may be a moot point. The computer needs to be cooled anyways, and most standard cooling systems are not going to get your HD into the lower ranges you're referring to. This of course depends on the hardware, but I'm not sure it's worth spending lots of time discussing in the text. — Mike.lifeguard | ^talk 16:27, 20 November 2009 (UTC)Reply

Generalizing Google's results

Latest comment: 16 years ago2 comments2 people in discussion

Generalizing the results of the Google study to all HDDs would imply the assumption that the distribution of SMART values of HDDs used in the Google study and the distribution of SMART values of HDDs in general is the same. This assumption however is not entirely true, because:

Google's HDDs probably have a very high uptime, whereas desktop HDDs probably don't.
Google's HDDs had temperature distributions that are possibly different from those of HDDs in other places, including many data centers.
Google used those HDDs in the study that didn't have infant mortality.
There are probably big differences in reliability of the HDDs of different manufacturers.

Having said this, as is obvious, the results would translate better to data centers that use mixed drives. --Antibogon (talk) 00:31, 29 March 2009 (UTC)Reply

By "not true", are you saying that you have data that shows that other hard drives definitely do act differently than Google's HDDs? Or are you merely speculating that that they *might* act differently?

Yes, Google's HDDs are turned on and spinning essentially all the time.
Yes, the temperature swings of Google's HDDs are probably very different desktop HDDs and even more different from laptop HDDs.
Yes, the Google study notes that different manufacturers had different "age-related results".

However,

I assume that Google's data centers the same temperature distribution as other data centers. (Do you have any reason to think differently?)
The Google study specifically says that, even though it does not include infant mortality that occurs during the burn-in stress test, "our data should be consistent with what a regular end-user should see, since most equipment manufacturers put their systems through similar tests before shipment."
The Google study reports that "None of our SMART data results change significantly when normalized by drive model. The only exception is seek error rate, which is dependent on one specific drive manufacturer, ..."

Therefore, since Google specifically says that the SMART data results are the same for every manufacturer, it seems reasonable to me to assume that the results of the Google study are valid for all data centers. Including data centers that use all-one-manufacturer as well as data centers that use mixed drives. I would be very surprised if a data center's HDDs act differently than Google's HDDs.

Since I know of no contradictory information, it seems reasonable to me to assume that their results are also valid for desktop HDDs and laptop HDDs. Although I would be less surprised to learn that those HDDs act differently than Google's HDDs. --DavidCary (talk) 04:44, 30 March 2009 (UTC)Reply

delta T

Latest comment: 15 years ago2 comments2 people in discussion

Neither the Google paper or other papers about this subject give attention to delta T, as opposed to absolute T, which is reported in the study. That delta T is a larger factor in drive failure is supported directly bu Google's study, in that lower temperature was associated with failure.

A machine, whether desktop PC or SAN (ambient temperature is not relevant) can be viewed as an enclosed system, where a volume of its surrounding air comprises the out-most surface of the system. By itself, this may not create an important thermal layer its presence may not come into effect.

A fact that is well-established by work in other areas shows that mechanical failure is the primary cause of failure in any electrical or electronic device. Consider a lightbulb as an example. Consumer data reported on boxes often rates a lamp in terms of hours. If you dig into engineering data you will find life-cycle data about the same lamp, to be measured in on-off cycles, not in hours of operation. The association between on-off cycles and change in temperature (delta T) should be straight forward -- the lamp is at higher T while powered and lower T while not powered. Thus, delta T increases in number as the lamp is turned from off to on, back to off, and is repeated over many cycles. In a lamp, metal that is part of the filament is ionized when power is turned on, as a function of the heat that is generated while electrons pass through it. When power is removed the energy level of the ionized atoms falls and most of the ionized metal returns to the filament. THe remaining metal is deposited on the inside of the glass lamp as residue.

While the lamp operates, and while it is off, its temperature is in steady-state. (While air currents may cause the lamp to change temperature slightly, from moment to moment, the change in temperature is negligible compared to the change in temperature that results during either power-up or power-down). Therefore, the length of time the lamp is in one state or the other is not a contributor to the build-up of residue. (Build-up of residue is directly related to loss of metal, from the filmanent, which is directly related to failure).

Semiconductor devices experience a similar life in that the n-p and p-n junctions within them, suffer as a result when the device changes temperatute. A semiconductor device is more likely to suffer from the loss of material than a tungsten lamp, so operation in an over-heated state is more likely to effect the semiconductor device. But operated within their specifications, the junctions will succomb to degradation in a manner similar to the filament of a tungsten bulb. As a result, the number of power cycles is a greater predictor of failure than is the time in operation.

The Google study reports on steady state condititions -- a logical use of data produced by SMART. Whether or not the Google data supports what I've written isn't likely to be obvious by reviewing the report itself, but it may be visible in the daw data used to by Google to support the report. As a matter of physics, or as a study of data collected by various manufacturing sectors, change in temperature a well-known contributor to failure and it is overlooked in the Google paper. However, examination of the temperature curve show greater area under the curve to the left, than to the right, where the center is taken as ambient for a given drive. This range correlates with the power-up and power-down cycles.

The characteristics of any system can include thermal conductivity. Thermal conductivity can be considered constant over certain, limited, temperature ranges. Exactly here the range causes this predictable parameter to fail in Google's study is uncertain but one thing does remain certain. It will fail, at least, in an area of temperature to the right of the center.

I've provided this for consideration by those who wish for a practical solution to drive-related problem, because it is empirically true and because most reports (all reports I've looked at) fall short of analyzing the effect of delta T. This is a shortcoming of the given report because the ubiquity of evidence about the effect of delta T is not hidden. If this is considered as removable information from Wikipedia, because no citations are offered, so be it. If one wishes to find relevant sources for it, they can be found. I hope it is useful to someone other than myself. Kernel.package (talk) 18:53, 20 October 2009 (UTC)Reply

The article does have a section on power cycling control. --IO Device (talk) 17:19, 21 October 2009 (UTC)Reply

Should text be split?

Latest comment: 15 years ago1 comment1 person in discussion

The {{split}} template was recently added to the text. Should the text be split into smaller pages?

Personally I'm of the opinion that it should not be split at this time. One reason for this is that the individual sections are not very big, and are not expected to grow very much, given the limited historical growth. --IO Device (talk) 23:11, 26 November 2009 (UTC)Reply

Advertising

Latest comment: 4 years ago3 comments2 people in discussion

Should a WikiMedia project really be promoting any particular (especially proprietary & commercial) software? I realise the benefit of giving examples, but in such cases Free (Libre/‘open source’) Software should be mentioned which removes the dubious nature of proprietary and/or commercial suggestions. As Jimmy Wales said, Free knowledge requires Free software, standards, and formats. — Lee Carré (talk) 14:45, 27 December 2010 (UTC)Reply

Please specify the names of the software in question. I think mentioning freely usable proprietary software is fine, assuming there is no comparable open source alternative available. I am not in favor of including commercial software (that costs money) either. --IO Device (talk) 16:50, 2 January 2011 (UTC)Reply

[Apologies for the very long delay.]

It seems that I had mistakenly added this entry to the main talk page, instead of for the specific sub-page. I, originally, was referring to the Case Study section, in which both Cobian BackUp and TransLogic are endorsed.

Proprietary software always has ethical entanglements by its very nature. Both upon the reader and the author. Particularly in the case of backups, which essentially comes down to copying of files for which no special (let alone proprietary) software is needed. Wikimedia, being a libre project, implies that the knowledge it imparts shouldn't require a reader to surrender their liberty in other contexts (especially on their own computer).

Commercial software is a different matter. Price is a mere practical concern. Your objection seems to exclude the possibility of non-gratis libre software.

I notice the use of certain buzz-terms, which may explain a few things. — Lee Carré (discuss • contribs) 17:35, 15 March 2021 (UTC)Reply