StorageMojo




Robin Harris    




Flash isn’t tier zero

August 13th, 2008 by Robin Harris in Architecture, Disk, SSD/Flash Disk

A panel discussion on enterprise SSDs at the Flash Memory Summit came to an almost unanimous conclusion: NAND flash is best seen as an extension to DRAM and a layer between DRAM and disk - not as the guts of a disk drive replacement.

I don’t think the guy from Seagate agreed.

Since I was on the panel, my recollections have to be taken with grain of salt. But I was trying to resist the group think that too many panels fall prey to. Yet I agreed with the result.

Price changes everything
StorageMojo has reported at length on the problems of making a big, quirky EEPROM look like a disk. Flash doesn’t look much like DRAM either, but the two are cousins.

In the last few years price has altered the landscape. On today’s spot market a Gbit of DRAM is 7-10x of a Gbit of MLC NAND.

That wasn’t the case 3 years ago, so substituting flash for DRAM made no sense.

The market resistance to flash drives is because flash costs more than disk. Not a problem when augmenting DRAM.

The performance fit
Disks are millisecond devices; DRAM DIMMs are nanosecond devices; and NAND chips are microsecond devices.

More than once it was suggested that maybe it is time to bring back the 3600 RPM drive. Optimized for capacity, power and long life, it would be a good complement to servers with several hundred GB of flash.

The StorageMojo take
Flash as a new storage layer between DRAM and disk just sounds more logical than flash-as-a-disk-like product. Let disks be disks!

And flash be flash.

Courteous comments welcome, of course. More on this topic later. Stay tuned.

StorageMojo at Flash Memory Summit

August 9th, 2008 by Robin Harris in SSD/Flash Disk

If you are attending the Flash Memory Summit in Santa Clara on Tuesday and Wednesday please say hello. Tuesday morning I will be sprinting between my two concurrent sessions.

In Forum F1B: Laptop Design session I’ll be giving a 25 minute presentation titled “Can The Flash Consumer SSD Be Saved?” In “Flash in Enterprise Storage Systems” a panel will hold forth on the promise of enterprise/solid state disks.

For reasons regular readers will appreciate, the latter should be more interesting.

The StorageMojo take
The summit will also have vendors showing their wares. I’m hoping to see some creative work.

The first thought with new technologies is to replicate what we already have. The real benefits from flash will come as we rethink the old architectures.

Courteous comments and questions welcome, of course.

Power-play, power work

August 7th, 2008 by Robin Harris in Off-Topic

Steve Denegri, author of The Data Center’s Green Direction Is A Dead End turned me onto an interesting Microsoft blog post. Titled Changing Data Center Behavior Based On Chargeback Metrics the post breaks down data center costs at Microsoft.

The author, Christian Belady, professional engineer and principal power and cooling architect, discovered that over 80% of the data center costs scale with power consumption and less than 10% scale with space. Why, he asked, do data centers charge for space and not for power?

Power use is the cost issue
He reports that since Microsoft started charging for wattage there have been a number of important changes. He writes

From our perspective, our charging model is now more closely aligned with our costs. By getting our customers to consider the power that they use rather than space, then power efficiency becomes their guiding light. This new charging model has already resulted in the following changes:


  • Optimizing the data center design

    • Implement best practices to increase power efficiency.

    • Adopt newer, more power efficient technologies.

    • Optimize code for reduced load on hard disks and processors.

    • Engineer the data center to reduce power consumption.

  • Sizing equipment correctly

    • Drive to eliminate Stranded Compute by:

      • Increase utilization by using virtualization and power management technologies.

      • Selecting servers based on application throughput per watt.

      • Right sizing the number of processor cores and memory chips for the application needs.

    • Drive to eliminate stranded power and cooling—ensure that the total capacity of the data center is used. Another name for this is data center utilization and it means that you better be using all of your power capacity before you build your next data center. Otherwise, why did you have the extra power or cooling capacity in the first place…these are all costs you didn’t need.

Christian goes on to quote James Hamilton, a Microsoft architect, whose study (PowerPoint here) has convinced him that a power saving of nearly 4x is both possible and affordable using only current technology.

The StorageMojo take
That data centers can be almost 4x more power efficient should surprise no one. Power efficiency has never been a criteria so they should be grossly inefficient.

The same mindset that justifies pricing software add-ons at their business value rather than a reasonable profit margin is designing data centers. Plenty of power, plenty of cooling, gold-plated backup power because the costs of being down are so high.

It is easy to see how such an attitude creates such wasteful infrastructures and workloads. The question is: will power costs create a significant incentive for change?

My guess is no. The vast majority of computer users, consumer and small to medium enterprises alike, simply do not see power consumption as a significant buying criteria. This is the downside of the consumerization of IT.

Power efficient IT infrastructures will only come with cost effective stepwise enhancements. Component efficiency enhancements that do not cost extra will be successful. Big engineering programs to build energy-efficient servers will not. If you have fewer than 25 servers, as most IT sites do, power demand is simply not a very large part of your costs.

Companies that sell to server vendors should take the issue seriously. The biggest buyers of servers will care about power, even if they are a minority of the units shipped. But significant power savings will require national standards.

Detroit knew that gas prices wouldn’t remain low forever, but in the absence of higher fuel economy standards they went for the easy money. Are server vendors any different?

Comments welcome, of course. Does your datacenter charge per square foot or meter, or by wattage?

Green-plated storage

August 3rd, 2008 by Robin Harris in Enterprise, Security & Public Policy

Stepping beyond marketing green-washing, the folks at Wikibon have done something. Tomorrow morning they’ll announce, along with California-based PG&E, Conserve IT,

. . . a first-of-its-kind service that accelerates the qualification of storage products for energy rebates and provides independent validation of energy efficiency for storage platforms from a number of leading vendors, spanning emerging Web 2.0 suppliers to the most recognized brands in the business.

Conserve IT was launched on behalf of IT customers in the Wikibon community who wanted to take advantage of the excellent programs PG&E and other utilities have put in place to conserve energy. The community felt that it could help to dramatically increase the participation of storage technologies which are major consumers of power and cooling in data centers. PG&E responded to Wikibon by allocating resources to help qualify additional storage technologies and providing guidance to the storage industry at large.

A watt saved is a watt earned
3PAR, Compellent, DataDirect Networks, EMC, Hitachi Data Systems, Nexsan and Xiotech, have signed on to the program. Customers who want PG&E’s incentives must be accepted into the program before buying new equipment.

PG&E has long understood that conservation cuts their marginal cost of power. Since that power is the most expensive they buy - usually natural gas-fired turbines - it is cheaper for them to pay customers to conserve power than building more power plants. Faster and better for the environment too.

The StorageMojo take
Kudos to Wikibon for sheparding this program and to PG&E and the storage companies for their support. Now it is up to the customers to take the next step.

Of course, looking at the companies involved, you are wondering “where are HP, IBM, Sun and NetApp?” I hope they are already in process, but if not, get the lead out. Company reps are invited to comment to update StorageMojo readers on your progress.

Comments welcome, of course.

Seattle scalability conference videos

August 1st, 2008 by Robin Harris in Off-Topic

No, not here on StorageMojo.
Over on Anil Gupta’s blog Network Storage. Scroll down.

I liked the transactional memory preso and found Maidsafe intriguing. Perhaps one of StorageMojo’s smarter readers can explain the latter to me.

I didn’t get the Nemo preso, but maybe it’s just me. I wrote a little about Carmen and Chapel earlier. The scalable directories problem is absolutely critical.

Check it out.

The Data Center’s Green Direction is a Dead End

July 29th, 2008 by Robin Harris in Architecture, Enterprise, Future Tech

Is the data center doomed? Or just the IT industry? That’s what the following article, by Steve Denegri - star industry financial analyst - asks.

Steve’s thesis: the IT industry is entering a period of energy scarcity that, in other industries, has meant that revenue growth slows to a crawl. Instead of embracing “green” marketing initiatives, he says, the industry should be working with customers to help drive down electricity costs.

If electricity costs continue to rise there will be a massive industry shake-out, similar to the consolidation in the auto industry since 1973. Are you ready for that?

I’m still pondering Steve’s argument, but its many insights are well worth considering. Higher energy costs create massive problems for airlines - why not for IT?

More later, I’m sure.

Robin

The Data Center’s Green Direction is a Dead End
By Steve Denegri, Storage Consultant, Financial Analyst

A study on data center electricity usage published a year ago by the Environmental Protection Agency (EPA) continues to receive attention in the storage industry. The study illustrates that storage is not keeping pace with servers and networking equipment as it relates to the amount of energy each of these hardware products uses in the data center. In fact, the EPA study shows that storage is consuming an increasing portion of the data center’s power budget as networking equipment and servers are maintaining a steady appetite for electricity, not a good trend in these times of skyrocketing utility costs.

No wonder the EPA study recommends that the storage industry dramatically improve upon its power management semantics for disk and tape systems. And the industry pundits are taking this data and running with it, with talk of underutilized storage resources and customers not getting the most of the equipment they’ve purchased, as if that’s a new theme.

Regardless, many vendors in the storage industry are salivating at the thought of bringing new energy-efficient products to market, believing that this problem has all the ingredients of a paradigm shift that could rearrange the competitive playing field.

However, these vendors would be better off recognizing that this heightened attention to energy efficiency is less indicative of a new growth opportunity and, more likely, portends an uncertain future for the industry, as a whole. Countless industries have reached an energy ceiling over the past half century, only to realize, soon after, that revenue potential had peaked.

What follows is a survival contest that only Darwin would love: more combinations at the top of the food chain and significant consolidation or closed doors among the multitude of suppliers. As revenue potential falls, those who are fortunate enough to survive must remain in cost-cutting mode in order to stay competitive.

Where is the Storage Industry Going?
If this, indeed, is the direction that storage is headed, the coming decade will see a massive shake-out in the enterprise computing industry. The storage industry, in particular, is very vulnerable to this outcome.

The simple fact is that the storage industry has and always will be an OEM-dominant industry, whereby 70% or more of the sales to end users are sourced from fewer than ten vendors. Consequently, dozens of companies compete to supply product to this small number of very powerful OEMs. In this regard, storage closely parallels the automotive industry, one that’s dominated by five or six vendors.

To get a glimpse of the future of storage, the automotive industry saw a major transition to energy-efficient products beginning in the late 1970s. Since that time, the automotive industry has seen its supplier-to-OEM ratio shrink by a factor of five.

If your company is one of the many suppliers to OEMs in the storage industry, then you should recognize that this “green” trend, over the long term, bodes poorly for your company’s existence, and consequently, your personal livelihood. The cold, hard truth is that an ample supply of energy is necessary to grow any business over the long-term, and the storage industry is shying away from the harsh reality that a sufficient amount of energy is, unfortunately, not available to keep the industry growing.


Source: Environmental Protection Agency’s “Report to Congress on Server and Data Center Energy Efficiency”, 8/2/07

Instead of elevating the rhetoric on the essential need to expand the capacity of the power grid, the storage industry is incomprehensibly embracing the energy efficiency paradigm, deploying marketing strategies that resemble those of the oil and gas industry. The websites of storage companies these days make mention of carbon footprints, green initiatives, and environmental stewardship, clearly having no idea that they are using buzz words that highlight the industry’s dire state.

A recent press release from one OEM actually boasted of its efforts to generate electricity at its headquarters from burning its employees’ garbage! With this as the most suitable example, the world is deploying utterly ridiculous new strategies to generate electricity, none of which have any scale to them. Unfortunately, the storage industry is buying into this nonsense.

What Do Customers Really Need?
There will be those who argue that the storage industry is only doing what the customer wants. For example, in its ever-increasing appetite for computing performance, customers are being forced to conserve energy in order to facilitate the necessary degree of computing scale for their businesses. Adding more servers and storage resources results in an exponential level of growth in electricity usage, which is an undesirable effect since utility costs are said to have grown 30% over the last five years.

So the storage industry reasons that it must redesign its products to allow more resources to be utilized at increasingly lower levels of power consumption per unit of storage capacity. This will allow its customers to expand while maintaining more control over the power budget.

However, the storage industry is blind to its customers’ true needs. All vendors in the industry should take heed: what the customer really wants is lower utility costs. Here’s a great example to prove it. In a study by The Uptime Institute called The Invisible Crisis in the Data Center: The Economic Meltdown of Moore’s Law, a report which was published at roughly the same time as the aforementioned EPA study, the authors cite that the three-year cost of powering a server exceeds the purchase cost of the server beginning next year.

Imagine buying a new car faced with the dilemma that the gas required over the first three years of ownership will exceed the cost of the vehicle. Now consider how the storage industry would respond to the problem: furnish the consumer with frequent refreshes of new models of vehicles that get more miles to the gallon. Chalk up yet another example of the storage industry furnishing its customers with products that they do not really want. The customer wants cheaper gas, not the financial burden of a new car every few years.

Would the typical storage vendor agree with this deduction? In order to grow the top line, the storage company might say that it will relentlessly focus its R&D effort towards providing new product models and accompanying software that gain more benefit per watt of electricity with each successive product generation so that the customer sees increasing energy efficiency over time.

In reality, the storage vendor might say, the customer is “stuck”: they have no choice but to frequently rip up and replace/upgrade storage equipment in order to contain utility costs. So the customer’s power dilemma actually provides job security for employees and provides greater visibility to top line growth.

If this is the attitude of any storage provider, they are in for a rude awakening. In fact, they should ask Intel or AMD how they’re coming along with this strategy. These two companies are finding that Moore’s Law will soon be downgraded to theorem status, because you can’t increase compute performance at the necessary pace for very long without an ample supply of electricity.

Likewise, the storage industry will quickly hit a wall, because the vast majority of energy improvements are bound to come about in the first few generations of product. Those storage vendors who are hoping to blaze a new trail in the quest for energy efficiency should re-familiarize themselves with the law of diminishing returns. The return on investment (ROI) for the R&D that will be necessary to expand the storage benefit per consumed watt will, almost certainly shrink over time, resulting in lower and lower profit margins. Flat to shrinking revenue and declining ROI are the ingredients of a decaying industry.

What Can We Learn From Research?
Let’s reference one last, key finding of the study by the Uptime Institute, which found that 2010 will be the year that the benefits of server virtualization will peak, meaning that the number of servers used in the data center in 2008 has, for all practical purposes, now been fully optimized. Therefore, the number of server units per data center will soon start increasing again.

The EPA study found that servers, as a whole, had decreased their portion of the power budget between 2000 and 2006. However, since server virtualization benefits have largely been realized, the amount of electricity needed to power the data center is on the verge of increasing even more in the coming years, meaning that electricity requirements will continue growing at an exponential pace.

In order to satisfy the demand, the Uptime Institute suggests that “multiple thousand-megawatt” power plants will be needed once the server virtualization benefit phase ends. Consequently, you would think that the largest consortiums representing the storage industry would have already formed working groups with end customers that speak to the need for more power plants.

Instead, these organizations boast of their “green” initiatives promoting energy efficiency that they wrongly think will be adequate to offset the unending explosion in demand for electricity. It’s quite fascinating. Green computing is almost the equivalent of battling a raging inferno through the design of smaller matches. If only these consortiums realized that by hailing their energy-efficiency activities, they merely appear content with a reputation of environmental responsibility as they proclaim their industry’s doomed state.

In fact, without more power plants, the typical storage industry consortium had best realize that its membership numbers will soon be on the decline. As the automotive industry’s experience proves, the number of suppliers to the storage industry will soon fall off of a cliff.

The Future
There will also be those who say that energy efficiency is an important responsibility, and that customers benefit tremendously from the availability of such products. There is no denying this. In the late-1990s, when oil prices returned to levels not seen since the years prior to the 1970s oil crisis, consumers certainly benefited from the lower expense required to fuel their tanks.

However, data center computing has achieved its favorable reputation and widespread adoption thanks to its performance, not its energy efficiency. Said another way, the Indianapolis 500 isn’t won by the driver who can make the most laps on a single tank of gas.

If, going forward, data center computing is hindered by the need to expand performance not unabated but rather at a predetermined rate of consumed electricity, then the industry simply can’t expand much further, it’s that simple. Furthermore, if the industry’s fears of being labeled an “energy hog” outweigh its determination to expand the supply of energy to its customers, it may as well pack it in now. Either the industry wields its influence to expand the power grid or else faces the consequences associated with a shrinking opportunity - those are the choices.

In the grand scheme, it’s mesmerizing to consider the possibility that the peak in the storage industry will come about not by technological limitations with regard to areal densities, not by the commoditization effects of TCP/IP networking, but rather by something as simple as the lack of available electricity to facilitate growth.

Furthermore, the fact that the storage industry, in light of all its tremendous innovations over the last fifty years, is content in letting this happen is even more disconcerting. Consider the dire state of the automotive industry, storage vendors: do you really aspire to follow their lead?

Biography
Steve Denegri is a storage consultant and financial analyst whose experience in the storage industry dates back to 1995. He has been a senior financial analyst at two investment banks: Morgan Keegan and the Capital Markets Group of the Royal Bank of Canada, specializing in industry research that covered both enterprise computing and data center infrastructure. In addition to his involvement with the SCSI Trade Association, Steve has worked with Fibre Channel Industry Association and Storage Networking Industry Association along with ANSI T10 and T11 communities in speaking and consulting roles. Steve earned his B.S. in Mechanical Engineering and has a Masters degree in Business Administration.

steve (at) denegri.net
http://www.scsita.org

Update: Published with permission of the author. The article was originally published at SCSI Trade Association. I got the article from Steve and didn’t think to credit them at the time.

Comments welcome, of course.

Samsung follows StorageMojo’s lead - finally!

July 24th, 2008 by Robin Harris in SSD/Flash Disk

Samsung announced a 500,000 R/W cycle on their server-grade NAND flash. I thought that was pretty smart - even though the “several month” project didn’t sound like it involved a lot of engineering.

Then Flash analyst extraordinaire Jim Handy, who runs Objective Analysis, saw my post on ZDnet where I talked about the announcement. He reminded me that in October ‘06 I’d written

. . . the cells are actually good for closer to a million read/write cycles. If true, Samsung is silly not to adjust their spec upwards, even to 250k. Engineers can be their own worst enemies sometimes when it comes to promoting a cool new product.

A mere 21 months later Samsung got with the StorageMojo program.

The StorageMojo take
Better late than never.

Samsung knew sooner than I did that flash has some serious deficits as a storage medium. The 500k “server-grade” moniker is a way of attacking one of those deficits - longevity - in a way that should reassure customers and increase margins.

What Samsung has lost is 2 years in building customer awareness of server-grade flash. Now it looks more reactive than proactive. Not a bad move, but sooner would have been better.

Comments welcome, of course.

The virtual machine I/O blender

July 23rd, 2008 by Robin Harris in Architecture, Enterprise

I’m at the SNIA Symposium this morning. Hence the short post.

What is the impact of virtual machines on I/O?
Engineers have spent decades optimizing the OS, drivers, caching, controllers and disks for specific workloads.

Observed behavior such as locality of reference have informed many strategies. Like read-ahead.

A smear
But when you put 25 virtual machines on a single server, what happens to all this hard-won empiricism? It’s gone.

Each of the 25 machines may have predictable I/O behavior. But together all those I/O patterns smear together. One I/O may have nothing to do with the next 10.

Fast and stupid
That puts a premium on stupid, but fast storage. Storage that doesn’t think about what you may be trying to do because you aren’t trying to do anything. A jumble of VMs is doing it.

The StorageMojo take
The “stupid vs smart” network debate has been around for decades. In storage we’ve always taken it for granted that smart is better. But now?

Not so much.

Comments welcome, of course. If you are running a lot of VMs what I/O issues have you noticed?

Write off-loading enterprise storage

July 20th, 2008 by Robin Harris in Architecture, Enterprise, Future Tech

It isn’t clear how serious the enterprise storage vendors and and their customers are about reducing energy consumption. A server may have 4-8 cores, consuming 50 W when idle, attached to 8, 16 or even 24 drives each pulling 8 W at idle.

High end it drives, whose demise is widely predicted, may consume 12 W at idle. If they are serious storage is a good place to start.

But how?
A recent paper from Microsoft research in Cambridge, Write Off-Loading: Practical Power Management for Enterprise Storage (pdf) by Dushyanth Narayanan, Austin Donnelly and Antony Rowstron, studies the issue. The traditional view is that enterprise workloads are too intense to generate savings by spinning down disks.

The team analyzed block level traces from 36 volumes in an enterprise data center and found that significant idle periods exist. They found that a technique they call write off-loading can save 60% of the energy used by enterprise disk drives.

Ring for the MAID
Main memory caches are good for handling reads but their lack of persistence means they are not effective for writes. That is the impetus for the write off-loading techniques.

Blocks intended for one volume are redirected to other storage in the data center. During write intensive periods the disks are spun down and the writes redirected. Blocks are off-loaded temporarily, for for as much as several hours, and are reclaimed in the background after the home volume disks are spun up.

The team reports

Write off-loading modifies the per volume access patterns, creating idle periods during which all the volumes disks can be spun down. For our traces this causes volumes to be a vital for 79% of the time on average. The cost of doing this is that when a read occurs for a non-off-loaded block it incurs a significant latency while the disks spin up. However our results show that this is rare.

Locality of reference hasn’t gone away.

Yes, you can spin disks down in the enterprise
The Microsoft team used servers in their Cambridge research facility to measure volume access patterns. This isn’t hard-core OLTP but there are generic server functions such as user home directories, project directories, print server, firewall, Web staging, Web/SQL server, terminal server and a media server.

They acknowledge that for TPC-C and TPC-H benchmarks disks are too busy to benefit from write off-loading. Nonetheless, even OLTP systems have significant variations in their workloads. At night for example, traffic might be light enough to power down many array disks.

The team took a week’s worth of traces. The total number of requests was 434 million, with 70% reads. They found that peak loads were substantially higher than average loads. This over-provisioning enables the power savings of write off-loading.

They also found that the workload is read dominated. Yet on 19 of the 36 volumes the traced volumes had 5 writes for every read.

How write off-loading works
A dedicated manager is responsible for each volume. The manager decides whether to spend the disks up or down and also when and where to off-load writes.

The manager off-loads blocks to one or more loggers for temporary storage. The storage could be a disk or SSD but the team only tested disk-based bloggers.

Loggers support four remote operations: write, read, invalidate and reclaim. They write the blocks and the associated metadata including the source manager identity the logical block numbers and a version number.

The invalidate request includes the version number and the logger marks the corresponding versions as invalid. Every claim is like a read except the logger can return any valid range it is holding for the requesting manager.

Their implementation uses a log-based on-disk layout.

Manager determines when to off-load blocks and went to reclaim them while ensuring consistency and performing failure recovery. The manager fields all read and write requests, handing them off to loggers and/or caches as needed.

Performance
Write off-loading is vulnerable to 10-15 second delays when a read forces a disk to spin up. 1% of the read requests had a response time of more than 1 second.

The write performance is equivalent to array performance in 99.999% of the cases. Here’s a figure that gives results for a “least idle” servers.


The tested configurations:

  • baseline: Volumes are never spun down. This gives
    no energy savings and no performance overhead.
  • vanilla: Volumes spin down when idle, and spin up
    again on the next request, whether read or write.
  • machine-leveloff-load: Write off-loading is enabled but managers can only off-load writes to loggers running on the same server: here the “server” is the original traced server,not the test bed replay server.
  • rack-level off-load: Managers can off-load writes to any logger in the rack.

And this differs from MAID how?
In a massive arrays of idle disks (MAID) a small number of the disks are kept spinning to act as a cache while the rest are spun down. This requires additional disks per volume. Copan Systems claims power savings of 75% with their “enterprise MAID” product. [Note to Copan - I'd be happy to have you compare your approach in the comments.]

Write off-loading does not require additional disks per volume or new hardware. The technique can use any unused data storage on the LAN.

The StorageMojo take
Can write off-loading become a viable commercial product? If Microsoft were to commercialize it in Windows Server at a low price it certainly could. Given the general reluctance of Redmond to productize MR concepts I wouldn’t expect anything soon. Too bad.

What this also underscores is the continued development of tightly coupled of storage and server architectures for cost-effective solutions with unique benefits. The ability to relax some constraints of the (increasingly atypical) “typical” enterprise data center work load shows what can be accomplished through creative architecture.

As the leading OS vendor, Microsoft has an unparalleled opportunity to bring these ideas to market and create functional differentiation with Linux. I hope someone with clout in Redmond is looking at this.

Comments welcome, of course. What could be more appropriate in an era of massive write-offs?

Design Tradeoffs for SSD Performance

July 15th, 2008 by Robin Harris in Architecture, Future Tech, SSD/Flash Disk

A new Usenix paper looks at NAND flash SSD performance. From a team at Microsoft Research and the University of Wisconsin, including Ted Wobber who worked on last year’s A Design for High-Performance Flash Disks [see Flash chance for the StorageMojo take on that excellent paper - a post Ted was kind enough to review and comment on].

Design Tradeoffs for SSD Performance (by Nitin Agrawal, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse and Rina Panigrahy) makes a deep dive in flash translation layer (FTL) issues. As the authors note, flash vendors keep their FTL designs secret, so the team developed a NAND flash simulator to look at how design choices affected performance.

What they found
They ran several workloads on their trace-based simulator, including TPC-C, Exchange and some file system benchmarks. They found several critical issues in SSD design.

  • Data placement Needed for wear leveling and load balancing.
  • Parallelism Single flash chips aren’t very fast so they need to work together.
  • Write ordering Small random writes are a killer.
  • Workload management You can optimize for sequential or random workloads, but managing both well is hard.

Canonical part
The paper’s discussion of flash memory is based on the spec for Samsung’s K9XXG08UXM 4 GB Single Level Cell (SLC) package. Other parts may differ, but NAND physics are the basic challenge.

The Samsung part has 2 2 GB dies (chips) in the package. Each die has 8192 blocks - a block is 64 4 KB pages - organized into 4 planes of 2048 blocks. The dies can be addressed independently, while cross-plane operations are limited to planes 0 & 1 or 2 & 3. Each page has 128 bytes for metadata.

Cross-plane operations are a form of parallelism. The Samsung part also provides a copy-back operation so one page can be copied to another without transporting the data off of the die. Copy-back is limited to copies within the same flash plane of 2048 blocks.

Expensive writes
NAND flash is a type of EEPROM. About the only characteristics it shares with disks are block structure and persistence. To write - or as the flash guys say program - it must first be erased. And you can’t just erase a 4 KB page - you have to erase an entire block.

An erase operation takes 1.5ms, making it considerably more expensive than a read or a write. To maintain a supply of empty blocks a cleaning process - garbage collection - runs when the free block supply gets low.

SLC flash is good for about 100,000 writes, so not only do you have to manage the full block erasure problem, but you also have to manage the life span of each block - the wear-leveling problem.

[Wear-leveling will become even more acute with next-gen 3 and 4 level cells. Speculation is that the write spec could drop as low as 1,000 per cell.]

Here is a table of the operational flash parameters for the Samsung part from the paper:

SSD controller architecture
The flash packages of course are only the building blocks of an SST. Much of the magic comes from the architecture and optimizations of the SSP controller logic. This is a generalized block diagram for an SSD controller:

Key elements:

  • Host interconnect SATA, USB, FC, PCI-e
  • Buffer management for pending and satisfied requests.
  • Multiplexer to manage instruction and data transport along the serial connections to the flash packages.
  • Processor to manage request flow and mappings from the logical block address to physical flash locations.
  • RAM for the processor.

On a cheap USB thumb drive all these elements may be integrated into a single chip. On a high-performance fiber Channel SSD these elements may be separated on their own PC board.

The size of the flash packages also has an impact on cost and architecture. A 32 GB SSD build with the Samsung parts would require 136 pins at the controller. Larger SSDs may not have enough pins for full interconnection between the controller and the flash packages, requiring additional engineering trade-offs.

Faking it
Borrowing a simulator, DiskSim from Garth Gibson’s Parallel Data Lab at CMU, the team modified it to reflect SSD latency and architecture. Features unique to SSDs, such as multiple request queues, logical block maps, cleaning and wear-leveling states were added.

Workloads
They used a collection of workload traces they named TPC-C, Exchange, IOzone and Postmark, as well as a group of microbenchmarks generated by DiskSim.

The TPC-C trace came from a large-scale configuration comprising 14 HP MSA1500 FC controllers supporting 28 36 GB disks. Exemplifying the current high-end OLTP problem, each controller had over a terabyte of disk, but the benchmark used only 160 GB of that capacity.

The Exchange server was similarly over-configured with 6 RAID controllers each running 1 TB capacity, while the 15 minute trace utilized only 250 GB of that with a 3 reads for every 2 writes workload.

Microbenchmarks
These were run using 4 KB I/Os. With cleaning enabled the write operations include the extra overhead. Sequential I/Os have less cleaning overhead. Note cleaning has a ~30% hit to the random write rate.

Trade-off summary
The researchers looked at several design techniques:

  • large allocation pool
  • large page size
  • over provisioning
  • ganging
  • striping

These deserve some explanation.

A large allocation pool is convenient for achieving performance, but there is a cost. If the page size is small, there is more overhead of managing the pages.

If the page size is large, it is easier to manage the pages, but writes smaller than the page size require a read-modify-write operation, which kills performance.

Over provisioning reduces the cleaning overhead, at the cost of more expensive storage.

Ganging requires more explanation. A flash package is made of one or more dies or chips. The serial interface to the flash packages is a primary bottleneck for SSD performance. Spreading a write across multiple serial interfaces is an obvious way to improve performance. The cost comes in the interconnect density between the packages and the dies.

If a write can be interleaved across multiple flash packages, read or write bandwidth can be substantially improved. The ability to place multiple packages in an SSD, and to interleave operations across those packages, is key to the performance improvements that SSD vendors have been advertising.

The StorageMojo take
This paper is too rich in detail to summarize well. If understanding SSD controller design is important there is no substitute for a careful read.

The net is that engineers have many options in configuring and managing flash devices inside a solid state disk. The interaction of these design choices with applications is likely to remain a fruitful area of study for years to come.

Expect to see many performance oddities as new solid state disk designs are released. This is a different world than disk drives. There is much innovation and much to learn.

A macro longer-term trade-off is the extent to which SSD vendors should attempt to alter operating system behavior to better match SSDs. In the short term designers must conform to today’s disk I/O oriented operating systems. In the long term however, there must be major opportunities to tweak operating systems to enhance solid-state disk performance.

For this reason SSDs is may find their best short term market to be inside storage arrays where array vendors have complete control over the interface to the array software. This will be no small advantage as array vendors struggle to remain relevant in a world where high performance solid state disks have the potential to replace midsize arrays.

Comments welcome, of course.

Update:
Ted Wobber kindly wrote in with a comment I’m reproducing in full, since he does a better job of getting to the heart of the matter than I did:

I think the bottom line is that flash devices are a lot more complicated than you might think they would be. At first glance, the conventional wisdom is that something constructed out of solid-state circuitry should be fundamentally simpler than a device with very small parts moving at high speed. However, you have to remember that NAND-flash is built on quantum tunneling, and while the software layers that build up from there don’t involve advanced physics, the properties of the medium create complexities and tradeoffs that might not be expected.

We don’t talk with SSD vendors at a great level of detail since we’d prefer not to be under NDA unless there is a good reason. However, informal discussions and other materials I’ve seen have convinced me that our evaluation of the state of affairs isn’t far from the truth. It’s my opinion that most manufacturers are well aware of these sorts of tradeoffs, and they carefully consider them along with the requirements of their target markets and cost structures. The point of our article was to talk about these tradeoffs in an academic forum unconstrained by IP issues, and to begin to tease apart the tangle of related issues.

In sum, SSDs constitute a marvelous step forward and are really useful in many applications. However, they are not a panacea, at least not yet.

/Ted

Thank you, Ted.

Testing, testing, 1 2 3 . . .

July 7th, 2008 by Robin Harris in Architecture, Disk, SSD/Flash Disk

George Ou weighs in
Many good points have been made about the problems with the Tom’s Hardware flash SSD tests. My former colleague George Ou, late of ZDnet, weighed in with an excellent summary of the TH testing problems:

The tests are very flawed.  If you read the results, the SSDs with the worst power consumption aren’t the ones getting the worst battery life.  The ones with great performance and above average power consumption turn out to be the worst on battery life WITH THE TEST THEY RAN.
 
What this says is that Tomshardware’s measurements weren’t wrong, but what they were measuring was wrong.
 
The load test was not well controlled.  The SSDs with great performance allowed the benchmark to run faster which cranked the CPU more.  The difference in the CPU state is what explains the discrepancy in their data.
 
A proper measurement would have done a fixed amount of CPU work and a fixed amount of storage work and then you can see how long the battery lasts.  They could have simply played a movie off the storage system and let it play until the battery died.  Videos are great because they’re fixed computational workload and fixed storage workload.
 
This is yet another example of bad science by Tomshardware.

I don’t buy the “play a movie” test - that only tests playing a movie - but I do accept that Tom’s Hardware didn’t do a great job of testing. So what?

I’ll be returning to the testing issues shortly - after pausing for this disclosure.

Disclosure: I’m biased towards notebook flash drives
Unlike, AFAIK, any of the commenters - pro or con - I used a flash-based Windows notebook every day for 5 years and loved it. It had a 10 hour battery life, a full-size keyboard and a sleep mode that really worked. Bliss!

I also paid an extra 20% - $400 back when the dollar was worth something - for the dinky 10 MB CF card it used. It was worth every penny.

Based on my sample size of 1 (me) here’s WHY it was worth an extra 20%:

  • Battery life. The Omnibook 300 went from 5 hours to 10 hours of battery life with flash.

Factors that didn’t matter:

  • Performance: I never compared the disk to the flash, but the performance was “good enough” with either.
  • Durability: nobody gets 5 years out of a notebook drive, but crashing wasn’t a liability since all docs were copied to an external system.
  • Boot up time: sleep mode worked perfectly, so I’d reboot once a month at most. I did not care about boot time.
  • Multi-media workloads: while I agree with George that a video provides a good fixed workload, notebook SSDs are aimed at business travelers whose workloads commonly allow drives to spin down. But this is a topic that deserves a deeper look.
  • Capacity. The Omnibook had a compression utility that effectively doubled capacity to 20 MB. But it was easy to copy stuff off the ‘book - Laplink - so it never felt cramped.

Those are my biases. They may or may not be the biases of Mr. Road Warrior - but I suspect they are close. End disclosure.

Testing, testing, testing
Performance testing is a black art. That’s why test driving applications remains popular: there are so many variables that predictions based on benchmarks are close to useless.

Because of that I prefer to look at the preponderance of evidence rather than a single benchmark or set of tests. More data points paint a clearer picture.

For example, the single most positive SSD test I’ve found is Anandtech’s MacBook Air SSD. The similar results of another test is here.

Battery Life Test (H:MM) 80GB 4200RPM HDD 64GB SSD % Improvement
Wireless Internet + MP3 4:16 4:59 16.8%
DVD Playback 3:25 3:56 15.1%
Heavy Downloading + XviD + Web Browsing 2:26 2:42 11.0%

Bottom line best case: a 17% improvement. Not zero but not, as most reviewers concluded, enough to justify the price.

Ars Technica also reviewed the MBA SSD and had mixed results. They concluded:

. . . I had high hopes for the battery life on the SSD model. Unfortunately, I was met with only moderate gains when there were any at all.

More Anandtech
Anandtech also tested a high-end Memoright SSD in a high-end MacBook Pro. Here are their results:

Battery Life in Hours (Higher is Better) MacBook Pro (Hitachi 5400RPM) MacBook Pro (Memoright SSD)
Wireless Internet Browsing + MP3 Playback 5.13 hours 5.0 hours
DVD Playback 3.88 hours 3.58 hours
Heavy Downloading + XviD Playback + Web Browsing 3.38 hours 3.37 hours

The StorageMojo take
All workload testing is a compromise - but the preponderance of the evidence is clear: significant - i.e. 40% or better - notebook power advantages just aren’t there. UMPCs that can’t afford a disk - flash will win. Notebooks? Hasta la vista, baby.

The one SSD advantage that is yet to be debunked is durability. Someone made a case that just the maintenance advantages alone justify SSDs for enterprise notebooks. And it may be that simple.

Yet even there, the issues of hard CapEx dollars against softer expense dollars will work against SSDs.

Maybe the next gen of flash controllers will solve all the problems and usher in the age of flash storage everywhere. But piddly 20-30 minute gains for an extra $300 bucks won’t do it.

Comments welcome, of course. Just so everyone knows: I haven’t done any work in the last few years for either flash drive or disk drive vendors. I wish them both the best.

StorageMojo at SNIA Symposium

July 7th, 2008 by Robin Harris in Off-Topic

If your company is a SNIA member and you’re in the Bay Area the Storage Networking Industry Association Symposium might be the excuse you’re looking for to cut work on a lovely summer day.

I’ll be delivering a keynote address on Wednesday morning, July 23rd, at the St. Claire. Think of it as an interactive Animatronic version of StorageMojo.

Topic: Crossing the Next Storage Chasm: 5 New Technologies that will Change your Data Center. The blurb:

New technologies are changing the face of storage. Robin Harris, analyst and author of the StorageMojo blog, looks at 5 of them, including flash SSDs, 10 GigE, and Google-scale storage. Get the incisive StorageMojo take on these topics and what you really need to know.

Notice I left myself some wiggle room. What do you think the other 2 topics should be?

The StorageMojo take
The data center has more pieces in motion today than ever before. The possibilities are almost infinite, but budgets and attention spans aren’t.

As an industry we don’t do a very good job of a) listening to customers and b) responding with insightful solutions. How can the industry help itself and customers get through the maze? I have a few suggestions.

Comments welcome, of course.

Notebook SSDs are dead

July 2nd, 2008 by Robin Harris in Disk, Future Tech, SSD/Flash Disk

It’s all over but the shouting
The scoop: the gap between notebook SSD promise and performance has been growing steadily. Now a review in Tom’s Hardware puts the final nail in the coffin. The title says it all:

The SSD Power Consumption Hoax : Flash SSDs Don’t Improve Your Notebook Battery Runtime – they Reduce It

By as much as an hour. A winner with the stupid high-end notebook demographic. The Paris Hilton market.

Ouch. Oops. Who knew?

Or who should have known?

Details
There’s a longer piece with some detail at Storage Bits but here’s the summary:

  • A Crucial SSD - costing $25/GB - used more power - 1.6 W at idle - than any 2.5″ notebook drive requires.
  • A Memoright 32 GB drive used a full 2 W at idle
  • An Mtron 32 GB flash drive reduced battery life by almost an hour.
  • The slowest drive - a year old Sandisk SSD 5000 - almost equaled the Hitachi 7200 RPM Travelstar’s energy use. But the SSD offers fewer IOPS than the hard drive!
  • They tested against a 200 GB Hitachi Travelstar 7k200, but other 2.5″ 7200 RPM drives have similar power envelopes.

And, of course, a 5400 RPM drive is more efficient. And a 160 GB 1.8″ drive is even more efficient, roomier and cheaper than any of the SSDs TH tested.

My guess on the not-easily-or-quickly-fixed culprit? The flash control logic - disk translation layer - needs cycles for wear leveling, garbage collection, buffer and cache management, flash mux/demux and the SATA interface - with frequent background operations even when the drive is idle.

And don’t forget the 20 volts required to write a cell.

Tom’s singles out Crucial for special mention:

Users who purchase this drive because of Crucial’s statements such as “low power consumption” and the product being ideal for “users who want longer battery life” will most likely be disappointed. While the total battery runtime certainly depends on the workload — we used Mobilemark 07 — the minimum and maximum power consumption measurements prove that Crucial’s statements of low power consumption are in fact wrong: 1.6 W idle power is more than any 2.5” notebook hard drive requires.

Did anyone even think to check the facts? At least one engineer had to know - and he told someone.

What’s the dynamic?
Some will say I’m premature, like when I said HD DVD was dead a year ago. But think about the market dynamic:

  • Cool but costly new technology needs early adopters
  • Based on the marketing, hip high-end adopter spring for costly status symbol with claimed road-warrior features
  • But the supposed advantages don’t exist, so the early adopters feel like chumps
  • Word of mouth stops. Who wants to admit they were suckered?
  • Notebook SSDs slip into obscurity as enterprise and very low-end SSDs move into the spotlight

Making early investors/adopters look stupid is not a winning strategy.

The StorageMojo take
The notebook SSD vendors have dug themselves a very deep hole. How to fix?

  1. Stop digging. A month in detox would help. Some encounter group time with the HD DVD folks.
  2. Form a serious performance consortium and get real about performance, power and longevity.
  3. Do the hard work of getting notebook operating systems better optimized for flash. Use Linux and OS X to beat Microsoft into some semblance of cooperation. Do the engineering for Apple - they’re open source, right? If Apple does it, it’s cool - and you need cool.

What the SSD guys will do:

  • Deny and obfuscate. “Not representative. Slanted. Unfair. Conspiracy.”
  • Claim next gen will fix all problems.
  • Performance, performance, performance. Which is a weak reed as well.
  • Point to cost curves show that, without a doubt, flash overtakes disk in 5 years.

And then hope the smart, techy, affluent road warrior demographic has a short memory. Good luck with that.

Comments welcome, of course.

The Hitz report

July 1st, 2008 by Robin Harris in Off-Topic

The NetApp/Sun patent battle continues. I don’t see how NetApp can win this, given the Supreme Court’s Teleflex decision, which makes prior art a question that can be appealed all the way to the Supreme Court.

But the company is doggedly pursuing the battle, and Dave Hitz’s recent declaration - which he hoped would remain private - has been unsealed.

It is an illuminating document.

Lame logic
Dave early points to Sun’s Jeff Bonwick’s statement that NetApp’s WAFL was

. . . the first commercial file system to use the copy-on-write tree of blocks approach to file system consistency.

As if that proves anything. Sun is arguing that earlier non-commercial research experimented with those and other techniques, establishing the prior art and invalidating NetApp’s patents. One NetApp patent has already been removed from the litigation and I expect more to follow.

Fear and trembling
Hitz goes on to say

Because Sun is exploiting NetApp’s patented technology for free and creating interest in ZFS by giving it away for free, it does not have to cover the true cost of incorporating ZFS into the Sun Fire X4500 and marketing it. Sun is thus able to undercut NetApp’s pricing on a per gigabyte basis, like any counterfeiter. This negatively affects NetApp’s ability to compete in the storage space. In responding to normal market pressures, NetApp would have to consider shrinking its normal profit margins. Reduced profit margins in this marketplace can be permanent and difficult to quantify.

One would have to believe that if Sun were paying a reasonable license fee to NetApp the Sun Fire x4500 wouldn’t be competitive with NetApp’s products. That doesn’t compute: the x4500 is a box of disks with a commodity motherboard. It’s the packaging density and cost amortization across 48 drives that gives the x4500 its $/GB advantage.

NetApp could do the same tomorrow - and I hope they’re working on it. They’d enjoy the same cost structure as Sun. Sun still has all the costs of building, debugging and marketing a complex product so NetApp would have the cost advantage.

Losing in the court of public opinion
Dave later comments on the public campaign Sun is waging against NetApp:

I am painfully aware that IP litigation is not favorably viewed by many members of the open source community. Indeed, the mere paricipation in a lawsuit can bear a reputational cost. Aside from the obvious monetary costs of protracted litigation and the distraction of resources from normal business functions, if the Court grants Sun’s motion for a partial stay, NetApp will suffer irreparable harm to its reputation because it will prolong this whole matter rather than allowing for a prompt disposition.

Newsflash: NetApp has suffered harm and will suffer further harm no matter how this gets resolved. If they win, they lose. In the court of public opinion it would be better if they lost.

The court did grant the partial stay. And unsealed Dave’s declaration.

The StorageMojo take
So sad and unnecessary. I understand the impulse to protect one’s intellectual property. I’ve done it myself on occasion.

NetApp’s biggest misperception is that WAFL is somehow central to the success they are enjoying today. That was true about 10 years ago. Guys, your average F500 CIO today could care less about WAFL.

NetApp is growing because they offer a compelling value proposition of quality products, relevant services and worldwide support. WAFL certainly supports that, but as NetApp execs note much of their recent success is due to the integration software that NetApp now offers.

WAFL is a small piece of the picture. Sun could copy it line for line and still not have a quarter of what NetApp offers.

NetApp faces challenges. Storage commoditization threatens all vendors traditional 60% gross margins. The GX integration is problematic and the bottom line benefit uncertain. EMC’s move into cloud file services is a clever flanking strategy.

But letting fear drive you isn’t the answer. Boldness and innovation - NetApp’s traditional strengths - is the way to a profitable and high-growth future. Sun is a distraction, not a direction.

Comments welcome, of course. Disclosure: I’ve met Dave Hitz a couple of times and he is a genuinely fine person. If you think I’ve pulled some punches here that’s the reason.

Update: A commenter felt I didn’t get Dave’s point across because I’d edited the quote to what I - perhaps mistakenly - thought were the Most Significant Bits. Here’s the salient part of ⁋5 of Dave’s declaration:

5. Sun’s ZFS technology appears to be a conscious reimplementation of NetApp’s innovative WAFL filesystem, as admitted by the creators of ZFS: “The file system that has come closest to our design principles, other than ZFS itself, is WAFL . . . the first commercial file system to use the copy-on-write tree of blocks approach to fie system consistency.”

I still don’t follow the logic that Bonwick’s acknowledgment of WAFL’s technical features means that ZFS is a “conscious reimplementation” of WAFL. Evidently the judge wasn’t persuaded either.
End update.

David Caminer: app design for 1st business computer

June 29th, 2008 by Robin Harris in Enterprise

Sometime we forget how young the computer revolution is. The death 10 days ago of David Caminer, who led the application programming for the world’s first business computer, the Lyons Electronic Office (LEO) is a reminder.

LEO performed its first business calculation - with 2,000 words of memory - on November 17, 1951, evaluating costs and margins on baked goods for J. Lyons & Company, a British chain of tea shops. Mr. Caminer was the systems analyst for the project, which grew into an early computer company that eventually became part of ICL.

From the obituary in the Independent

In 1947 a Lyons fact-finding team visited the United States to catch up on new developments in office methods. They learned for the first time about the newly invented electronic computer. No machine had yet been built, but they learned that Maurice Wilkes at Cambridge University was as far ahead as anyone in constructing a machine. On its return to England, the team made contact with Wilkes, who agreed to supply the design information to Lyons, and Lyons agreed to provide some additional finance and manpower to the project.

The Cambridge machine sprang into life in May 1949, and Lyons then proceeded to construct a copy of the machine. A Cambridge engineer, John Pinkerton, led on the hardware side, while Caminer was put in charge of application development.

As today, many early computer projects went disastrously wrong. Not so at Lyons. Although the technology was radical and innovative, Caminer’s approach to the computerisation of business processes was utterly conservative. He assumed that what could go wrong would go wrong. He therefore set out on a learning curve – computerising simple jobs first, and gradually taking on ones that were critical to the business, such as payroll and stock control. Caminer was an early advocate of management by exception, using the computer to bring critical issues to the attention of management.

Like some current computer industry luminaries, Mr. Caminer was political active, campaigning against British Fascist Oswald Mosely in the 30s and 40s and apartheid later, welcoming Bishop Desmond Tutu to his Borough.

Read more. The New York Times obit. An appreciation from Frank Land at the Leo Computers Society web site.

The first LEO ran for over 13 years - presaging IT’s “if it ain’t broke, why fix it?” mentality.


A LEO computer [courtesy the LEO Computers Society]

The StorageMojo take
As with so many revolutionary 20th century technologies - jet aircraft, radar, antibiotics - the British had an early lead that Americans eventually erased. Arguably the British lead in commercial business computers was the largest of all.

Given Mr. Caminer’s success in bringing large IT projects in on time, we should probably be sorry that we didn’t learn more from him and his methods.

Comments welcome, of course.



Next Article »
StorageMojo RSS Feed August 2008 July 2008 June 2008 May 2008 April 2008 March 2008 February 2008 January 2008 December 2007 November 2007 October 2007 September 2007 August 2007 July 2007 June 2007 May 2007