The importance of balance
Balanced systems, yes!
Do not focus on the parts.
Optimize the whole.
Schooner Labs Evaluation and Optimization of Database Technologies: Flash Memory + Multi-Core Processors + Software
Schooner Labs Evaluation and Optimization of Database Technologies:
Flash Memory + Multi-Core Processors + Software
Dr. John R Busch, CTO and Founder
At Schooner Labs we continually analyze system architectures and evaluate technologies as we create our tightly-integrated appliances. Our goal is to optimize the overall system, including both software and hardware, to maximize performance and availability while minimizing the total cost of ownership. Our experimental methodology includes micro-benchmarks and system-level application benchmarks, in real life situations. We analyze the resulting relative performance, capacity, cost, availability, lifetime, power, and total cost of ownership to drive our architectural and engineering optimization decisions.
In this blog entry we discuss optimization for MySQL to fully exploit the power of flash memory and multi-core processors. In upcoming blogs we will report on our research in emerging technologies and other application segments, including processors with more cores, distributed caching, NoSQL, and high-performance synchronous replication.
I. Balanced Database Systems Exploiting Flash Memory, Multi-core Processors and Optimized Software
Multi-core processors and flash memory are an excellent fit for databases, including MySQL. Databases can inherently exploit extensive multi-threading since they are designed to process thousands of concurrent connections and transactions. This makes multi-core processors an ideal technology fit. However, the performance of database workloads has been severely limited by hard drive performance (both IOPS and latency), making flash memory an ideal technology fit, with random IOP rates of ~100x those of hard drives. However, effectively utilizing multi-core processors and flash memory to create a balanced database system solution with optimal performance, cost, and availability is a challenge.
Even when the best-of-class multi-core processors and flash memory are assembled into a server, the utilization of the CPUs and flash is typically very low: the system is not balanced. The software needs sufficient parallelism and concurrency control to exploit multi-core and flash parallelism and the memory hierarchy management needs to be optimized as well. The flash components need to be selected to provide the proper balance of read IOPS, write IOPS, and storage capacity, and the overall system needs to be optimized for performance, cost, and availability.
II. Schooner Analysis and Optimization Methodology of Balanced Database Systems
The standard 2U server with dual socket multi-core processors is the natural building block for the database tier of a typical scale-out datacenter since this is the design point that has been technology- and cost-optimized by the dominant computer system and microprocessor vendors. In order to evaluate and optimize balanced database solutions exploiting multi-core processor, flash memory, and software technologies, we select a standard 2U server, and then optimize the storage hierarchy and software to create a balanced, scalable database server building block.
1. Configuration Description
In terms of flash technologies, there are numerous SATA, SAS and PCIe flash alternatives on the market today, and a new generation of products that have been announced and that are beginning to enter the market. We report here on our comparative measurements and analyses of both early and next-generation PCIe and SSD enterprise flash products. Our specific studies reported in this blog include: first-generation PCIe flash storage from Fusion-io (ioDrive Duo); first-generation SSD flash storage from Intel (Intel X25); and next-generation enterprise flash storage SSD from OCZ/Sandforce (SF1500 flash processor in OCZ Deneva SSD).
In the analysis reported here, we normalized across technologies by using a standard IBM 3650M2 2U server with dual-quad core 3GHz Nehalem CPUs and 64GB of DRAM. SATA SSDs are connected to multiple LSI, PCIe SATA controllers. PCIe flash adapters are inserted into x8 PCIe riser slots. To compare the results on a consistent basis, they were normalized to this 2U server configuration. There are 8 available drive bays and 2 available PCIe slots. Our test results indicate near-perfect linear scalability up to 8 SATA drives using PCIe SATA adapters. The 2U configurations consist of 8 SSDs in the case of Intel and OCZ, or 2 Fusion-io Duo cards occupying 2 PCIe slots.
2. Micro-benchmark Level Analysis
2.a Benchmark Description iozone http://www.iozone.org/
We use the results of micro-benchmarking to validate datasheet performance claims in a comparable and realistic context, to measure the impact of configuration and settings, and to gain an understanding of performance, cost and availability at the component level as it relates to system-level performance, cost and availability impacts.
Each micro-benchmark cycle starts by sequentially writing two passes across the entire drive. Then all locations are read randomly, followed by randomly writing all locations on the drive. This benchmark scenario is required to get a drive to a steady-state condition. Until a flash drive has been completely written, there is extra spare capacity on the drive, easing the burden on the drive’s garbage collection system. Without this write pre-conditioning, the drive’s write performance is not representative of what it delivers in real-life usage. With some of the drives tested, the true steady-state performance is far less than the data sheet performance claims indicate.
Throughput data is collected continuously during the micro-benchmark execution so the time-varying impact of garbage collection can be assessed. Each test result generates a graph of the throughput over time, through the precondition, read, and random write phases of the test. Real applications depend on sustainable throughput during peak loads. Thus we report the minimum one-minute average during a test as the comparison throughput for a measured flash technology
The detailed IOZONE specifics : We used 32 threads and 32 separate files. We also tried fewer and more threads with each device to make sure there was enough parallelism to keep the device busy and not too many threads to cause a drop-off in throughput. Here are the command line options we used:
iozone -O -I -i 0 -i 2 -R -t 32 -r 16k -s 7.5g -F /mnt/ssd/file1
/mnt/ssd/file2 …
-O => output using IOPS metric
-I => use direct I/O, bypass file cache
-i0 -i2 => Run test 0, sequential write, followed by test 2, random read/write
-R => output MS-Excel compatible format to stdout
-t32 => throughput mode, use 32 accessor threads (processes) and 32 files
-r16K => use a record size (block size) of 16 KBytes
-s7.5G => each file to be 7.5 binary gigabytes size
-F <files> list of 32 file paths
iozone -O -I -i 0 -i 2 -R -t 32 -r 4k -s 7.5g -F /mnt/ssd/file1 /mnt/ssd/file2 … /mnt/ssd/file32
You need to name all 32 files from /mnt/ssd/file1 thru /mnt/ssd/file32. The directory /mnt/ssd needs to exist, iozone will create the files specified on the command line. This will need to be adapted for your exact path and desired blocksize, also to the device size. The one included above is for a 256GB Fusion drive.
2. b. Technology Evaluation: First-Generation SSD Flash : Fusion–io ioDrive Duo SLC
Fusion-io sells PCI-e based flash cards. Fusion-io’s ioDrive Duo product is described at http://www.fusionio.com/products/iodriveduo/. It provides 160 GB (RAID10) or 320 GB (non-RAID) of SLC flash with a street price of approximately $13,179, for example at http://accessories.us.dell.com/sna/productdetail.aspx?sku=A3131275&cs=04&c=us&l=en&dgc=SS&cid=27722&lid=628335.
The Fusion-io architecture is based on simple discrete hardware coupled with server-side software. It uses the host server’s processor cores and DRAM to manage the flash card’s logical-to-physical mapping, write coalescing, garbage collection, and wear leveling. The Fusion-io card uses an FPGA to control the interface to the flash chips and relies on software in the host to perform the allocation, garbage collection, and high-level program/erase algorithms. As a consequence, the peak IOPS throughput is very high relative to the small storage capacity, but there is a substantial burden on the host CPU and host DRAM, as reported in the application system level benchmark results below. Additionally, the measured micro- and application system-level benchmark results indicate that Fusion-io garbage collection algorithms operate with great variability, which has a substantial and unpredictable impact on the sustainable write throughput. 
Some notes about the Fusion-io configuration and result: We de-configured the available storage on each Fusion-io card from 320GB to 256GB in order to minimize the impact of the Fusion-io periodic garbage collection algorithm on steady-state write performance. This is recommended by Fusion-io for write-intensive workloads. Our initial tests with the ioDrive at the default storage size (320GB per card) resulted in highly-variable performance and relatively low write throughput. The results shown are for the de-configured storage size, but still show a 2:1 performance variability. The normalized throughput is the minimum which is sustainable during the garbage collection interval of several minutes duration.
| Normalized to 2U | ioDrive Duo |
| Capacity GB | 440 |
| 4KB Read IOPS | 324,000 |
| 4KB Write IOPS | 220,000 |
| CPU time per I/O | 19 uS |
| System $/GB | $59 |
| Storage subsystem cost | $26,360 |
2. c. Technology Evaluation: First-Generation SSD Flash : Intel X25e
The Intel X25e is a SATA 2.5” SSD 64GB SLC drive incorporating a SSD-based intelligent ASIC which manages all flash buffering, write coalescing, garbage collection, etc. on the SSD drive. This product is fully described at: http://accessories.us.dell.com/sna/productdetail.aspx?sku=A3131275&cs=04&c=us&l=en&dgc=SS&cid=27722&lid=628335. The product carries a street price of $679 http://www.newegg.com/Product/Product.aspx?Item=N82E16820167014&nm_mc=OTC-Froogle&cm_mmc=OTC-Froogle-_-Solid+State+Disk-_-Intel-_-20167014.
The Intel X25e SATA SSD is a first-generation enterprise flash technology in a 2.5” SATA form factor. It is the first affordable NAND SSD to incorporate an efficient on-board SSD program/erase algorithm. The X25e uses an embedded microprocessor in the drive to perform the program/erase, allocation, and garbage collection algorithms. The capacity per drive, at 32GB and 64GB, is smaller than the Fusion-io drive. As a consequence, the peak throughput and sustainable write throughput of a single X25e is less than that of a single Fusion-io card. However, the X25e’s relatively low cost allows the efficient use of several drives and controllers in parallel to cost-effectively scale IOPS and bandwidth. 
| Scaled to 2U | Intel X25E |
| Capacity GB | 488 |
| 4KB Read IOPS | 288,000 |
| 4KB Write IOPS | 29,600 |
| CPU time per I/O | 7 uS |
| System $/GB | $14 |
| Storage subsystem cost | $6,792 |
2. d. Technology Evaluation: Next-Generation SSD Flash : Sand Force SF-1500 and OCZ Deneva SSD
There has been a significant maturation in the flash industry in the last two years, with a value-chain shift to optimized flash processors which are incorporated into a broad range of PCIe and SSD products by numerous suppliers. This next generation of flash technologies provides significant improvements in flash performance, endurance, reliability, cost, and capacity. These flash processors have also been optimized to provide significant improvements in the performance and durability of MLC-based flash, especially when exploiting eMLC for both excellent performance and wear characteristics.
We evaluated the Sand Force SF 1500 processor as built into the OCZ Deneva 200GB SATAII SSD. The datasheet for the Sand ForceSF-1500 Enterprise SSD Processor is located at http://www.sandforce.com/index.php?id=21&parentId=2.. The datasheet for the OCZ Deneva SSD is located at http://www.oczenterprise.com/details/ocz-deneva-reliability-2-5-slc-ssd.html.
| Scaled to 2U | OCZ Deneva |
| Capacity GB | 1456 |
| 4KB Read IOPS | 160,000 |
| 4KB Write IOPS | 88,000 |
| CPU time per I/O | 7 uS |
| System $/GB | 7 |
| Storage subsystem cost | $10,600 |
2.e Summary of Normalized 2U Micro-benchmark Results across Flash Component Technologies
To compare the results on a consistent basis, they were normalized to a 2U server configuration. There are 8 available drive bays and 2 available PCIe slots. Our test results indicate near-perfectly linear scalability up to 8 SATA drives using PCIe SATA adapters. THE 2U configurations consist of 8 SSDs in the case of Intel and OCZ, or 2 Fusion-io Duo cards occupying 2 PCIe slots.
| Scaled to 2U | FusionIO Duo | Intel X25E | OCZDeneva |
| Capacity GB | 440 | 488 | 1456 |
| 4KB Read IOPS | 324,000 | 288,000 | 160,000 |
| 4KB Write IOPS | 220,000 | 29,600 | 88,000 |
| CPU time per I/O | 19uS | 7uS | 7uS |
| Subsystem $/GB | $60 | $14 | $7 |
| Storage Subsystem Cost | $26,360 | $6,792 | $10,600 |
Observations:
- The Intel and OCZ flash technologies utilizing SSD-based flash processors rather than server driver-based mapping, garbage collection, and write coalescing solutions provided a 2.5x improvement in server CPU time per transaction over Fusion-io while incurring no server memory overhead (vs. many GB for Fusion-io). This leaves the host processor cores and memory available for useful application work.
- Both PCIe card-based flash and SATA SSD-based flash with PCIe-based controllers use the PCIe bus effectively. In both configurations using 2 PCIe slots the highest observed bandwidth was on the order of 1.6 gigabytes per second. There is no benefit from direct PCIe attachment for flash.
- The cost/GB of the SATA SSD solutions is dramatically lower than the PCIe-based solutions.
- As we discuss in the next section, real-world system-level application benchmarks and customer workloads require a balance of read and write performance, along with large storage capacity, to achieve a balanced system with optimum resource utilization and TCO. The performance (IOPS and bandwidth) and capacity of parallel SATA SSDs can be scaled to meet the needs of a balanced system, and they also provide the critical availability feature of hot swap that is not available with PCIe-based solutions.
- Systems based on hard drives typically accommodate up to 12 HDDs and offer I/O throughput up to about 2000 IOPS. Database applications such as MySQL are currently limited by the I/O performance and typically could run at over 10x the current throughput based on the available CPU capacity of dual-socket servers. We show in the next section presenting system-level benchmarks that software optimization is critical for obtaining a balanced system that fully utilizes multi-core processors and parallel flash memory for database workloads. With optimized software a good balance of IOPS capability per 2U, 2 socket server, allowing for a further 2X increase in CPU processing power, is on the order of 80,000 IOPS at about a 2:1 ratio of reads to writes. As can be seen from the summary table above, the parallel Intel drives are adequate for the current CPUs and the OCZ drive performance is well balanced for the next generation CPUs, whereas Fusion-io is overkill for database application workloads, with the additional expense per gigabyte and overhead in CPU time and DRAM being unjustified.
- The critical characteristics of flash technologies for creating balanced systems in terms of performance, capacity, and cost have been significantly improved with the next generation of flash processors and SSD products.
3. Application Level System Database Benchmarks: DBT2 (TPC-C equivalent)
DBT2 is an open-source implementation of TPC-C. TPC-C is the industry standard benchmark for on-line transactional databases. This benchmark has been constructed by a consortium including IBM, Oracle, HP, Cisco, Microsoft, and others to be representative of real workloads running real databases. All companies measure their systems with the standard benchmark according to the rules and publish their results TpmC (Total, $/TpmC, watts/TpmC) at:
http://www.tpc.org/tpcc/results/tpcc_perf_results.asp
While TPC benchmarks certainly involve the measurement and evaluation of computer functions and operations, the TPC regards a transaction as it is commonly understood in the business world: a commercial exchange of goods, services, or money. A typical transaction, as defined by the TPC, would include the updating to a database system for such things as inventory control (goods), airline reservations (services), or banking (money). The TPC-C workload simulates the on-line entry and processing of customer orders. It is representative of write-intensive workloads with moderate transaction complexity and well-defined data consistency requirements.
3.a System-Level Database Performance Benchmark Results
The graph below shows DBT2 performance on the normalized 2U server with dual quad-core Nehalem processors and 64 GB DRAM with various storage subsystem and software alternatives. The RAID modes reflect the most common configuration for availability and performance in a 2U form factor. In all cases, the operating system level is CentOS version 5.4 with the Linux Kernel version 2.6.18. We compare the software versions MySQL 5.1.44, MySQL 5.5.4 Beta (GA in late 2010), and Schooner MySQL Enterprise 2.5 (available now and certified by Oracle as 100% compatible with MySQL Enterprise 5.1.44).
The first and second columns, labeled Hard Drives, show the performance of MySQL 5.1.44 and Schooner 2.5 optimized software with 12 hard disk drives (15 k RPM) configured in RAID 5. The third, fourth and fifth columns, labeled Intel X25E, shows the performance of MySQL 5.1.44, MySQL 5.5.4 Beta and Schooner MySQL 2.5 with 8 Intel X25E solid state drives (SSDs) configured in RAID 5. The sixth, seventh, and eighth columns, labeled Fusion-io, shows the performance of MySQL 5.1.44, MySQL 5.5.4 Beta and Schooner MySQL 2.5 with 2 Fusion-io ioDrive Duo 320s configured in RAID 10. The ninth column, labeled OCZ, shows Schooner MySQL 2.5 using 8 parallel 200GB SandForce/OCZ SSDs in RAID 5.
The performance results indicate clearly the critical role of system architecture and software optimization in creating optimal balanced systems. The highly optimized, tightly-coupled Schooner MySQL Enterprise 2.5 solution achieves:
- more than 6x the transaction throughput of currently available MySQL 5.1.44 with equivalent commodity Intel SSDs;
- more than 14x when compared to a MySQL 5.1.44 server using parallel optimized hard disk drives; and
- over 3x relative to Fusion-io flash storage with MySQL 5.1.44.
The currently available Schooner Appliance provides 4x the transaction throughput of the upcoming MySQL 5.5.4 Beta on identical hardware using commodity Intel X25E SSDs, and more than 250% of the transaction throughput of the upcoming MySQL 5.5.4 Beta with Fusion-io flash storage.
The Schooner MySQL Enterprise 2.5 software is able to effectively utilize all of the flash technologies to get a high-performance balanced system. With Schooner MySQL 2.5 Fusion-io has marginally higher throughput than Schooner 2.5 with Intel X25Es, while Schooner MySQL 2.5 with OCZ/Sandforce outperforms Schooner MySQL 2.5 with Fusion-io (due to low OCZ/Sandforce consumption of server processor and DRAM resources). In addition to improved performance characteristics, the second-generation technologies such as OCZ/SandForce provide a large increase in cost-effective capacity.
The figure below shows the connection scalability of the tightly-coupled Schooner software stack relative to legacy MySQL software alternatives. The Schooner Appliance supports increasing connections without loss of throughput to over 20,000 connections, whereas the performance of stock MySQL software on identical hardware drops off rapidly as the number of connections increases.
4. Total Cost of Ownership (TCO)
Our TCO analysis normalizes to a 2M TPM solution with an 8 TB total data size. It includes the cost of systems, racks, networking, and power. Our TCO model calculates the number of 2U servers needed to meet both required storage capacity and required throughput for each server configuration. The 3 year TCO is then computed based on a hosted datacenter model with fixed server loading per rack and monthly cost quantized per rack. Initial capital expense is totaled for server acquisition and per-rack installation. Monthly operating expense is then added in for a total of 36 months including hosting charges for rack, power, and pipe plus monthly maintenance. The results for first-generation flash technologies are shown, normalized to the Schooner 3 year TCO.

As indicated in the chart, creating a balanced system provides great benefits in both performance and total cost of ownership. In addition to the 3x to 11x performance benefit, the Schooner tightly-coupled database server building block provides over a 50% improvement in TCO over any roll-your-own alternative.
III. Effectively Exploiting Multi-Core and Flash with Databases: System Architecture, Tight Coupling and Software Optimization Required
As indicated clearly from the studies above, when assembling the best-of-class multi-core processors and flash memory to match the workload, effective utilization of the CPUs, DRAM and flash to achieve a balanced system requires extensive software optimization.
Based on extensive research and experimentation, Schooner has developed the Schooner 2.5 MySQL Enterprise for InnoDB Appliance, which optimizes high-performance multi-core processors, flash memory, DRAM, and low-latency interconnects in a highly-parallel, low-overhead manner to balance system resources and maximize system throughput. The Schooner software is designed with high thread-level parallelism, granular concurrency control, optimized thread and data affinity management, optimized memory hierarchy management, efficient and parallel flash I/O initiation and completion, recovery log/checkpoint management, and scan resistance for workload variation resiliency. Schooner Appliances make optimal use of the CPU cores, flash memory, and low-latency interconnects to optimize application-level transactions/second, transactions/$ and transactions/watt while minimizing total cost of ownership. They also maximize service availability and improve fault tolerance and RAID on all flash and hard drives, high-speed replication and recovery, and high-speed backup and restore. Additionally, they provide rich administrative services for easy deployment, management, monitoring, and scaling.
Schooner Appliances deliver a very cost-effective solution with the currently available processor cores and flash. The next generation of multi-core and flash technologies enables further advances in performance, TCO and availability. We will be reporting on these in upcoming blogs.
Schooner solves big problems
Expensive DRAM,
and underused multi-cores?
Schooner server solves!
Enterprise Flash Technology in Perspective: Beyond IOPS Hype and Misrepresentations
Enterprise Flash Technology in Perspective: Beyond IOPS Hype and Misrepresentations
Dr. John R Busch, CTO and Founder
Before we dive into our Schooner Labs Enterprise Flash Technology Evaluation reports, let’s get some perspective. There is a lot of hype and misrepresentiation in the industry around flash. We think it should stop. It adds no value to users.
IOPS Wars and PC Processor-Frequency Wars: Lessons in Local Optimization with Diminishing Returns…
Today’s flash hype is akin to the days when PC micro-processor manufacturers focused on and advertised processor clock frequency, even though higher clock frequencies provided no significant benefit to most real applications or to the user experience. The same has been true of the hype around IOPS in the flash industry. Flash manufacturers have been chasing the max IOPS crown (with LSI in the lead at one million IOPS in a single dual-socket commodity server), even though the IOPS provided by these products greatly exceed the need of most applications in a balanced system configuration.
This is a classic story of local optimization with diminishing returns. We need to shift away from this thinking. We need to shift to understanding the applications and the datasets for which flash technology is a fit. We need to focus on the design of balanced system solutions for them.
Exploiting Parallelism and Cost Matters! We Learned This with Multi-core…
There have also been blatant mis-compares of flash component technologies that ignore relative costs and parallelism. For example, several microbenchmark results are reported comparing a $15,000 Fusion-io card directly with a single $750 Intel X25e SSD. But these benchmarks do not factor in the ability to utilize commodity SSDs in parallel. Do so with a properly-designed balanced system and you can achieve the IOPS required by many applications and data sets at a small fraction of the cost.
This is reminiscent of the misunderstandings when we created multi-core processors. Multi-core processors are optimized to exploit thread-level parallelism rather than optimizing for a single thread of instructions though instruction-level parallelism. The complex software and hardware-optimizing single-thread instruction streams in expensive, non-commodity processors continue to be justified in some very high-performance computing applications (HPC). But the mainstream of computing has shifted to computer systems based on multi-core processors, exploiting the thread-level parallelism possible in properly-tuned modern software. The effective cumulative real instruction throughput and cost benefits of easily replicating simple cores on a single die and the benefits of industry commoditization greatly outweigh any benefits from optimizing for single thread instruction-level parallelism through complex applications, compilers, and a multi-issue, out-of-order pipeline processor.
This is also the case with the evolution of enterprise flash technologies. There’s a clear shift away from proprietary discrete FPGA-based controllers requiring complex server-based driver software to simple, parallel, commodity self-managing ASIC-based flash building blocks.
Sound Basis for Analysis: an Open Benchmarking Manifesto
Let’s get the metrics right. Let’s understand which applications and workloads are a good fit for flash. Let’s identify appropriate benchmarks and normalized configurations. Let’s define appropriate experimental designs. Let’s understand enterprise flash technology trends. Let’s evaluate first- and new-generation products accordingly. We hope our up-coming Schooner Labs Enterprise Flash Technology Evaluation reports on enterprise flash will help move the industry in this direction.
Why flash memory for MySQL?
Hard drives spin slowly.
Users wait, unhappy, sad.
MySQL needs flash!
Flash Memory Architecture Alternatives
Flash Memory Architecture Alternatives
Dr. John R Busch, CTO and Founder
Flash Chips: NOR vs NAND
Flash memory chips are constructed from different types of cells (NOR and NAND), and with different numbers of cells per memory location (single-level cell or SLC; and multi-level cell or MLC). These variations result in very different performance, cost, and reliability characteristics. NOR flash memory chips have much lower density, much lower bandwidth, much longer write and erase latencies, and much higher cost than NAND flash memory chips. For these reasons, NOR flash has minimal penetration in enterprise deployments; it is primarily used in consumer devices. Leading enterprise solid state drives (SSDs) are all designed with NAND flash.
Flash Chips: SLC vs MLC
Another distinction in flash memory is SLC versus MLC. MLC increases density by storing more than a single bit per memory cell. With their increased density, the cost of MLC flash chips is roughly half that of SLC, but the MLC write bandwidth is about 2 times worse than SLC, and MLC supports from 3 to 30 times fewer erase cycles than SLC. A new generation of SSDs incorporates special firmware that closes the performance and durability gap between SLC and MLC.
Flash Form Factor and Physical Interface: PCIe vs SSD, SATA/SAS
Flash memory can be installed into a server as PCIe flash or as SSDs. With PCIe flash the controller and flash chips are placed onto standard form factor PCIe cards that are plugged directly into the server’s PCIe slots. With SSDs, flash chips and controllers are placed into 2.5” or 1.5” cartridges which are installed into server hard disk drive slots and which interface through SATA or SAS controllers that are plugged into the server’s PCIe slots.
Because of the direct connection to PCIe, a single PCIe flash card has a lower latency and higher bandwidth than a single SSD (which is connected to PCIe through a controller card). However, in most workloads the latency difference is not significant, and any desired level of flash bandwidth can be achieved by using multiple SSDs. PCIe flash cards are significantly more expensive than SSDs on a $/GB basis.
In a typical 2U server, many more SSDs can be operated in parallel than PCIe cards. As a result a much higher total degree of flash parallelism, bandwidth, and capacity can be achieved through the use of parallel SSDs instead of PCIe flash memory subsystems. The SSD flash memory configuration can be adjusted to match workload capacity, bandwidth, and latency requirements with optimized controller/SSD configurations.
An SSD-based flash subsystem is also easier to maintain. It is much easier to replace an SSD than a PCIe flash memory card—somewhat similar to the difference between installing a memory stick into a USB port on a typical personal computer (PC) versus opening up a PC to install a graphics card.
An SSD flash subsystem allows hot swapping, resulting in lower downtime and higher serviceability than possible with a PCIe flash subsystem, since the latter requires a system to be taken out of service to add or replace a flash card.
Flash Space Management: Server-Based vs Device-Based
The flash memory management functions of write coalescing, space management, logical-to-physical mapping, wear leveling, and garbage collection require significant on-going computation and data movement. The first generation of enterprise-class flash technology was based on discrete logic, using FPGAs (field programmable gate arrays) for control. FPGAs have limited computational capability, so the flash interface to software was very low level, This forced the write coalescing, garbage collection, logical-to-physical mapping, and wear leveling all to be performed by special driver software executing in the server.
The new generation of enterprise flash contains advanced ASICs which provide the flash management functions very efficiently on the flash cards themselves, exploiting internal flash buses and device characteristics. This frees the server’s processor cores and DRAM for application use. This is very significant. For example, the first generation PCIe FPGA-based flash memory cards we evaluated perform these functions using the server’s resources. In our system-level benchmarking we measured that 25% of the server’s processor cores and 10 GB of the server DRAM were consumed for flash management overhead.
Selecting Flash Technology
The analysis of an appropriate flash subsystem configuration comes down to creating a balanced system for the target workload with required system uptime at the best price/performance. This requires workload characterization, system measurement, and TCO and availability modeling. Our Schooner Labs evaluation of flash technologies analyzes in this context.
An introductory haiku from Schooner CEO, Jerry Rudisin
Expensive. Too slow.
Not Schooner! Fusion-io:
Second best, wrong choice.
Enterprise Flash Memory: Overview
Enterprise Flash Memory Overview
Dr. John R Busch, CTO and Founder
Why Use Flash in Scale-Out Datacenters?
Flash memory is a high-performance computer memory that can be electronically erased and reprogrammed (non-volatile). Enterprise-class flash memory in the form of commodity solid state disks (SSDs) and PCI-e cards is available today with a broad range of architectures, performance and reliability characteristics, capacities, and price points. In this series of blog entries we provide our analysis of the state and trends of enterprise-class commodity flash memory as it relates to deployment in datacenters scale out.
The latency, bandwidth, capacity, cost, and persistence benefits of flash memory are compelling.
Flash memory offers access times that are 100x faster than those of hard disk drives (HDDs), and it requires much less space and power than HDDs. It consumes only 1/100th the power of DRAM, and can be packed much more densely—providing much higher capacities than DRAM. And flash memory is far less expensive than DRAM to both purchase and operate. Flash memory is persistent when written, whereas DRAM loses its content when its power is turned off. Flash memory can be organized into modules of different capacities, form factors, and physical and programmatic interfaces.
Challenges in Exploiting Flash Memory
Flash memory has many promising characteristics—but also many idiosyncrasies. Effectively incorporating flash memory into system architectures requires thoughtful design and optimization—starting at the application layer, and extending throughout the operating environment and down to the physical machine organization.
The algorithms and mechanisms of current operating systems and applications were not designed to exploit the very high access rates of flash memory. These algorithms and mechanisms generally lack the high degree of parallelism and granular concurrency control necessary to effectively exploit the very high IOPS of flash memory.
Flash memory access times are 1000x slower than DRAM, so to effectively utilize flash memory a memory hierarchy with intelligent, specialized DRAM caching of flash content is required .
Typical software was also not designed to deal with the idiosyncrasies of flash memory. Flash memory chips have write-access behavior that is very different than DRAM memory. At the chip level, flash memory writes can only be done into pages (~4 kB), and before writing, a block of pages (128KB) needs to be erased, which is very slow (~1.5 millisecond). As a result, smaller logical writes need to be buffered and combined into larger physical write blocks before writing (this is called write coalescing). The deletion of small logical writes leaves holes (internal fragmentation) in the physical write blocks which need to be combined in the background to free up space while remapping the location of the stored logical writes (garbage collection). Garbage collection needs to be aggressive since new data cannot be written into flash until erase blocks have been freed up and pre-erased, which is a very slow operation.
Flash memory has limits on how many times a block can be erased (~100k times for SLC, ~3000k times for MLC). As a result, block writes need to be spread uniformly across the total flash memory subsystem to maximize the effective lifetime (this is called wear leveling).
Flash memory stores permanent data, so fault tolerance is required to insure the accuracy and availability of the data when it is written. Strong error detection and correction codes are required to recover from flash cell and flash chip failures. Replication at the system level and/or at the SSD/flash card level is required to handle SSD or PCI-e flash card failures.
There are numerous architectural and product alternatives for enterprise flash memory. The selection of an appropriate flash technology and flash subsystem configuration comes down to creating a balanced system for the target workload that achieves the required system uptime at the best price/performance. This analysis requires workload characterization, system measurement, and TCO and availability modeling.
Our Schooner Labs evaluation of flash technologies analyzes alternative flash architectures, alternative flash products, and system and application level performance, availability and TCO.
Datacenter Challenges in Selecting Hardware, Selecting Software, Creating Balanced Systems
Datacenter Challenges in Selecting Hardware, Selecting Software,
and Creating Balanced Systems
Dr. John R Busch, CTO and Founder
Datacenter quality-of-service and total cost of ownership have a major impact on the success of any enterprise. Solution architects and datacenter managers seek architectures and deployments that provide excellent performance scalability and high service availability to effectively meet rising service demand, while controlling capital and operating expenses.
Tremendous technology advances have been made in recent years. But it is a major engineering effort to define and develop highly-effective hardware and software architectures and deployment technologies and implementations that can meet datacenter performance, availability and cost objectives.
In order to deliver cost-effective, high-performance, highly-available solutions, we need to carefully integrate technologies to create balanced networked system configurations for targeted workloads.
When configuring servers for a workload, we must ask:
- How much DRAM, how many processor cores, what storage (flash, hard disk drives)?
- Are the software and hardware effectively integrated?
- What caching levels are required to exploit DRAM and flash and disk access time and bandwidth variations?
- Is there sufficient thread-level parallelism and concurrency control to exploit more processor cores and flash IOPS?
Service availability is as important as performance scalability.
- To what extent should failure tolerance be handled by replication at the component level (e.g. RAID of hard disks and flash drives, redundant fans and power supplies, etc.), between systems with low-level replication and recovery mechanisms, or at the distributed application level?
How much networking capacity is required between clients and servers and among servers?
Where are the performance bottlenecks? How can the performance scalability be improved? What is the resulting service availability, and how can it be improved and at what cost? What is the total cost of ownership, and how is it affected by technology choices we are considering?
Technology selection is complicated by confusion in the market around the relative benefits of different technologies.
- When combined into a standard server with real workloads, what is the application-level performance difference, system cost difference, and the actual resulting Price/Performance of using $15,000 PCI-e based flash cards as opposed to an array of $750 flash SSDs ?
- How much application-level performance improvement will be achieved by going to servers with 6- and 8- core processors vs 4-core processors, and what is the resulting price/performance?
- To what extent will 10/40 Gb Ethernet or Infiniband technology improve data center quality-of-service, and at what cost?
- Which applications are best suited to a SQL or NoSQL data-access architecture?
- For which application scenarios can Cassandra, Couch, MongoDB, etc. effectively replace databases and what is the resulting scalability, performance, availability, and cost?
Which experiments, measurement and analysis tools, and studies can help determine the best technology choices, balanced system configurations, and the optimal software, high availability, and deployment architectures for your workloads?
Our Schooner research focuses on answering these questions.
Schooner Labs : Research Findings
Sharing our Industry Observations and our Research Results
Dr. John R Busch, CTO and Founder
Our research and engineering teams at Schooner have deeply studied datacenter challenges and opportunities, and have done extensive modeling and measurement of emerging technologies. We will share our findings in this blog. We share our observations on technology trends and our research, analysis, and measurement results.
Some of the topics we will be discussing in this blog in coming months include:
- Datacenter Solutions: Hardware, Software, and Balanced Systems
- Analysis of Advanced Technologies: Riding the Innovation Wave
- Effectively Exploiting Multi-core and Flash Technologies: Software Optimization Required
- Achieving High Availability and Performance without Sacrificing Consistency
- Factors in the Total Cost of Ownership and the Models
- Industry Trends:
Combining Vertical and Horizontal Scaling
Using High-Level Generic Building Blocks
SQL and/or NoSQL
Cloud Computing Evolution
We hope you find our blog useful, and look forward to your posts!





