{"id":276,"date":"2006-04-22T19:20:00","date_gmt":"2006-04-22T19:20:00","guid":{"rendered":"http:\/\/blog.zerowait.com\/?p=276"},"modified":"2006-04-22T19:20:00","modified_gmt":"2006-04-22T19:20:00","slug":"276","status":"publish","type":"post","link":"https:\/\/blog.zerowait.com\/index.php\/2006\/04\/22\/276\/","title":{"rendered":""},"content":{"rendered":"<p><a href=\"http:\/\/www.drunkendata.com\/?p=388#comments\">Finally &#8212;  An answer of sorts from NetApp on the BCS &#038; ZCS issue<\/a><\/p>\n<p><span style=\"color: rgb(0, 153, 0);\">These answers below  give a great perspective on the switch back to 512 sectors, from 520 sectors. <span style=\"font-weight: bold; font-style: italic;\">And it is worth reading all the way through.<\/span><\/p>\n<p><\/span><span style=\"color: rgb(0, 153, 0);\">But I still think  <span style=\"font-weight: bold;\">NetApp should release reliable, repeatable and verifiable performance data so that customers can make informed, economical business decisions based on the costs and risk factors of storing D\/R data on ATA disk as compared to FC disk. <\/span> Additionally, since there  are costs associated with additional disks and wasted disk space due to the penalty  of running Dual Parity disks to protect from a parity disk failure, customers need to know what are the percentages of wasted disk space and their costs in these configurations? Is it possible that because you don&#8217;t need to run DP with FC disks that in certain smaller raid configurations it could be cheaper to run FC than ATA on NetApp filers? <\/span><\/p>\n<p><span style=\"color: rgb(0, 153, 0); font-weight: bold;\">Finally, is there a read or write penalty to running databases on ATA disks with Dual DP and ZCS formatting, as compared to the faster Fibre channel disks with BCS formatting?  <\/span><\/p>\n<p><a title=\"\" href=\"http:\/\/www.drunkendata.com\/?p=388#comment-2567\">April 22nd, 2006 at 10:57 am<\/a><br \/>Cross Posted from the previous thread: From Dave Hitz, CTO, Network Appliance:<br \/>Let me take a shot at this. I asked one of our engineers to take a look at this thread as well, so if I mess up the details, hopefully he can set me right. (Hi Steve.)<br \/>Reformatting the disk drives from 512 bytes blocks to 520 byte blocks and putting the checksum right in each individual block is the best solution, because it doesn\u2019t take any extra seeks or reads to get the chunksum data you need. This is called BCS or Block Checksum. (Most high-end storage vendors have something similar. EMC and Hitachi certainly do.)<br \/>Unfortunately, we aren\u2019t able to format ATA drives with 520 byte blocks. Maybe someday, but not yet. So with ATA we use a different technology called Zoned Checksum (or ZCS) where we steal every Nth block on the disk and use it for the checksums. (I think N is 64, but can\u2019t remember for sure.) This is less efficient because you have to read extra data, but it allows you to get the reliability benefits of checksums even with ATA drives, which is important because ATA drives are less reliable.<br \/>And what about the RAID-DP (DP = \u201cdouble parity\u201d)? I think that RAID-DP is a wise choice for all drives, Fibre Channel or ATA, but given that ATA drives are less reliable we make RAID-DP the default there. I\u2019m wondering if it\u2019s time to make it the default for Fibre Channel drives as well, but as far as I know, we haven\u2019t done that yet.<br \/>Why sell less reliable drives? ATA drives are cheaper! If you\u2019ve got the money, then by all means keep buying Fibre Channel drives and keep using block checksums.<br \/>On the other hand, if you want to save money, and your application can get by with a bit less performance, then the combination of RAID-DP and Zoned Checksums can make ATA drives very safe. We used to recommend ATA only for disk-based backup or for archival storage, but now that we have RAID-DP and ZCS, we see lots of customers using it for primary storage, which is why we are starting to support ATA through the entire product line, and not just in the R-Series.<\/p>\n<p> ************************ ******************* ********8<\/p>\n<ol class=\"commentlist\">\n<li class=\"alt\" id=\"comment-2583\"><a href=\"http:\/\/www.drunkendata.com\/?p=385#comment-2583\"><cite>Steve Strange<\/cite> Says: <\/a><br \/><small class=\"commentmetadata\"><a title=\"\" href=\"#comment-2583\">April 22nd, 2006 at 2:36  pm<\/a> <\/small>\n<p>Let me see if I can fill in a few more details (Hi Dave).<\/p>\n<p>First, let me try to clear up the confusion about BCS vs. ZCS, and provide a  little history. As Dave says, ZCS works by taking every 64th 4K block in the  filesystem and using it to store a checksum on the preceding 63 4K blocks. We  originally did it this way so we could do on-the-fly upgrades of WAFL volumes  (from not-checksum-protected to checksum-protected). Clearly, reformatting each  drive from 512 sectors to 520 would not make for an easy, on-line upgrade.<\/p>\n<p>As Dave says above, the primary drawback to ZCS is performance, particularly  on reads. Since the data does not always live adjacent to its checksum, a 4K  read from WAFL often turns into two I\/O requests to the disk. Thus was born the  NetApp 520-byte-formatted drive and Block Checksums (BCS). For newly-created  volumes, this is the preferred checksum method. Note that a volume cannot use a  combination of both methods \u2014 a volume is either ZCS or BCS.<\/p>\n<p>Pq65 provides some spare-disk output from a filer running ONTAP 7.x showing  spares that could be used in either a BCS or a ZCS. The FC drive shown here is  formatted with 520-byte sectors. If it is used in a ZCS volume, ONTAP will  simply not use those extra 8 bytes in each sector.<\/p>\n<p>When ATA drives came along, we were stuck with 512-byte sectors. But we  wanted to use BCS for performance reasons. So rather than going back to using  ZCS, we use what we call and \u201c8\/9ths\u201d scheme down in the storage layer of the  software stack (underneath RAID). Every 9th 512-byte sector is deemed a checksum  sector that contains checksums for each of the previous 8 512-byte sectors  (which is a single 4K WAFL block). This scheme allows RAID to treat the disk as  if it were formatted with 520-byte sectors, and therefore they are considered  BCS drives. And because the checksum data lives adjacent to the data it  protects, a single disk I\/O can read both the data and checksum, so it really  does perform similarly to a 520-byte sector FC drive (modulo the fact that ATA  drives have slower seek times and data transfer\/rotational speeds).<\/p>\n<p>Starting in ONTAP 7.0, the default RAID type for aggregates is RAID-DP,  regardless of disk type. For traditional volumes, the default is still RAID-4  for FC drives, but RAID-DP for ATA drives. You cannot mix FC drives and ATA  drives in the same traditional volume or aggregate.<\/p>\n<p>The default RAID group size for RAID-DP is typically double the number of  disks as for RAID-4, so if you are deploying large aggregates, the cost of  parity is quite similar for either RAID type. But the ability to protect you  from a single media error during a reconstruct is of course far superior with  RAID-DP (the topic of one of Dave\u2019s recent blogs on the NetApp website).<\/p>\n<p>You can easily upgrade a RAID-4 aggregate to RAID-DP, or downgrade a RAID-DP  aggregate to RAID-4. But you cannot shrink a RAID group, so you do want to be  careful about how you configure your RAID groups before you populate them with  data (assuming you don\u2019t like the defaults).<\/p>\n<p>There was an implication earlier in this blog that we used to use RAID 4, but  on newer systems we use RAID 5. That\u2019s not the case \u2014 we do not use RAID 5 on  any of our systems (though an HDS system sitting behind a V-series gateway might  use it internally). This is a whole topic in itself, but the reason, stated  briefly, is that RAID-4 is more flexible when it comes to adding drives to a  RAID group, and because of WAFL, RAID-4 does not present a performance penalty  for us, as it does for most other storage vendors. RAID-DP looks much like  RAID-4, but with a second parity drive.<\/p>\n<p>Our \u201clost-writes\u201d protection capability was also mentioned. Though it is  rare, disk drives occasionally indicate that they have written a block (or  series of blocks) of data, when in fact they have not. Or, they have written it  in the wrong place! Because we control both the filesystem and RAID, we have a  unique ability to catch these errors when the blocks are subsequently read. In  addition to the checksum of the data, we also store some WAFL metadata in each  checksum block, which can help us determine if the block we are reading is  valid. For example, we might store the inode number of the file containing the  block, along with the offset of that block in the file, in the checksum block.  If it doesn\u2019t match what WAFL was expecting, RAID can reconstruct the data from  the other drives and see if that result is what is expected. With RAID-DP, this  can be done even if a disk is currently missing!<\/p>\n<p>We\u2019re constantly looking for opportunities for adding features to ONTAP RAID  and WAFL that can hide some of the deficiencies and quirks of disk drives from  clients. I think NetApp is in a unique position to be able to do this sort of  thing. It\u2019s great to see that you guys are noticing!<\/p>\n<p>Steve <\/p>\n<\/li>\n<li class=\"\" id=\"comment-2588\"><a href=\"http:\/\/www.drunkendata.com\/?p=385#comment-2588\"><cite><a rel=\"external nofollow\">Administrator<\/a><\/cite> Says:<br \/><small class=\"commentmetadata\"><a title=\"\">April 22nd, 2006 at 3:56  pm<\/a> <\/small><\/a>\n<p><a href=\"http:\/\/www.drunkendata.com\/?p=385#comment-2588\">Wow: great historical and technical clarification. My hat is off to Dave and  Steve for jumping to the task of helping clear up the confusion around this  512\/520 issue.<\/a><\/p>\n<p><a href=\"http:\/\/www.drunkendata.com\/?p=385#comment-2588\">Truly appreciated by all.<\/a> <\/p>\n<\/li>\n<\/ol>\n<p>I agree with Jon on this,  <a href=\"http:\/\/zerowait.blogspot.com\/2005\/05\/i-really-like-dave-hitz-he-took-me-to.html\">I guess it is my turn to invite Dave out for dinner <\/a>to thank him for clarifying the issues  so well.<\/p>\n<div class=\"blogger-post-footer\"><img width='1' height='1' src='https:\/\/blogger.googleusercontent.com\/tracker\/11084229-114573388654819463?l=zerowait.blogspot.com' alt='' \/><\/div>\n","protected":false},"excerpt":{"rendered":"<p>Finally &#8212; An answer of sorts from NetApp on the BCS &#038; ZCS issue These answers below give a great perspective on the switch back to 512 sectors, from 520 sectors. And it is worth reading all the way through. &hellip; <a href=\"https:\/\/blog.zerowait.com\/index.php\/2006\/04\/22\/276\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/blog.zerowait.com\/index.php\/wp-json\/wp\/v2\/posts\/276"}],"collection":[{"href":"https:\/\/blog.zerowait.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.zerowait.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.zerowait.com\/index.php\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.zerowait.com\/index.php\/wp-json\/wp\/v2\/comments?post=276"}],"version-history":[{"count":0,"href":"https:\/\/blog.zerowait.com\/index.php\/wp-json\/wp\/v2\/posts\/276\/revisions"}],"wp:attachment":[{"href":"https:\/\/blog.zerowait.com\/index.php\/wp-json\/wp\/v2\/media?parent=276"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.zerowait.com\/index.php\/wp-json\/wp\/v2\/categories?post=276"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.zerowait.com\/index.php\/wp-json\/wp\/v2\/tags?post=276"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}