Schooner Labs Evaluation and Optimization of Database Technologies: Flash Memory + Multi-Core Processors + Software
Schooner Labs Evaluation and Optimization of Database Technologies:
Flash Memory + Multi-Core Processors + Software
Dr. John R Busch, CTO and Founder
At Schooner Labs we continually analyze system architectures and evaluate technologies as we create our tightly-integrated appliances. Our goal is to optimize the overall system, including both software and hardware, to maximize performance and availability while minimizing the total cost of ownership. Our experimental methodology includes micro-benchmarks and system-level application benchmarks, in real life situations. We analyze the resulting relative performance, capacity, cost, availability, lifetime, power, and total cost of ownership to drive our architectural and engineering optimization decisions.
In this blog entry we discuss optimization for MySQL to fully exploit the power of flash memory and multi-core processors. In upcoming blogs we will report on our research in emerging technologies and other application segments, including processors with more cores, distributed caching, NoSQL, and high-performance synchronous replication.
I. Balanced Database Systems Exploiting Flash Memory, Multi-core Processors and Optimized Software
Multi-core processors and flash memory are an excellent fit for databases, including MySQL. Databases can inherently exploit extensive multi-threading since they are designed to process thousands of concurrent connections and transactions. This makes multi-core processors an ideal technology fit. However, the performance of database workloads has been severely limited by hard drive performance (both IOPS and latency), making flash memory an ideal technology fit, with random IOP rates of ~100x those of hard drives. However, effectively utilizing multi-core processors and flash memory to create a balanced database system solution with optimal performance, cost, and availability is a challenge.
Even when the best-of-class multi-core processors and flash memory are assembled into a server, the utilization of the CPUs and flash is typically very low: the system is not balanced. The software needs sufficient parallelism and concurrency control to exploit multi-core and flash parallelism and the memory hierarchy management needs to be optimized as well. The flash components need to be selected to provide the proper balance of read IOPS, write IOPS, and storage capacity, and the overall system needs to be optimized for performance, cost, and availability.
II. Schooner Analysis and Optimization Methodology of Balanced Database Systems
The standard 2U server with dual socket multi-core processors is the natural building block for the database tier of a typical scale-out datacenter since this is the design point that has been technology- and cost-optimized by the dominant computer system and microprocessor vendors. In order to evaluate and optimize balanced database solutions exploiting multi-core processor, flash memory, and software technologies, we select a standard 2U server, and then optimize the storage hierarchy and software to create a balanced, scalable database server building block.
1. Configuration Description
In terms of flash technologies, there are numerous SATA, SAS and PCIe flash alternatives on the market today, and a new generation of products that have been announced and that are beginning to enter the market. We report here on our comparative measurements and analyses of both early and next-generation PCIe and SSD enterprise flash products. Our specific studies reported in this blog include: first-generation PCIe flash storage from Fusion-io (ioDrive Duo); first-generation SSD flash storage from Intel (Intel X25); and next-generation enterprise flash storage SSD from OCZ/Sandforce (SF1500 flash processor in OCZ Deneva SSD).
In the analysis reported here, we normalized across technologies by using a standard IBM 3650M2 2U server with dual-quad core 3GHz Nehalem CPUs and 64GB of DRAM. SATA SSDs are connected to multiple LSI, PCIe SATA controllers. PCIe flash adapters are inserted into x8 PCIe riser slots. To compare the results on a consistent basis, they were normalized to this 2U server configuration. There are 8 available drive bays and 2 available PCIe slots. Our test results indicate near-perfect linear scalability up to 8 SATA drives using PCIe SATA adapters. The 2U configurations consist of 8 SSDs in the case of Intel and OCZ, or 2 Fusion-io Duo cards occupying 2 PCIe slots.
2. Micro-benchmark Level Analysis
2.a Benchmark Description iozone https://www.iozone.org/
We use the results of micro-benchmarking to validate datasheet performance claims in a comparable and realistic context, to measure the impact of configuration and settings, and to gain an understanding of performance, cost and availability at the component level as it relates to system-level performance, cost and availability impacts.
Each micro-benchmark cycle starts by sequentially writing two passes across the entire drive. Then all locations are read randomly, followed by randomly writing all locations on the drive. This benchmark scenario is required to get a drive to a steady-state condition. Until a flash drive has been completely written, there is extra spare capacity on the drive, easing the burden on the drive’s garbage collection system. Without this write pre-conditioning, the drive’s write performance is not representative of what it delivers in real-life usage. With some of the drives tested, the true steady-state performance is far less than the data sheet performance claims indicate.
Throughput data is collected continuously during the micro-benchmark execution so the time-varying impact of garbage collection can be assessed. Each test result generates a graph of the throughput over time, through the precondition, read, and random write phases of the test. Real applications depend on sustainable throughput during peak loads. Thus we report the minimum one-minute average during a test as the comparison throughput for a measured flash technology
2. b. Technology Evaluation: First-Generation SSD Flash : Fusion–io ioDrive Duo SLC
Fusion-io sells PCI-e based flash cards. Fusion-io’s ioDrive Duo product is described at https://www.fusionio.com/products/iodriveduo/. It provides 160 GB (RAID10) or 320 GB (non-RAID) of SLC flash with a street price of approximately $13,179, for example at http://accessories.us.dell.com/sna/productdetail.aspx?sku=A3131275&cs=04&c=us&l=en&dgc=SS&cid=27722&lid=628335.
The Fusion-io architecture is based on simple discrete hardware coupled with server-side software. It uses the host server’s processor cores and DRAM to manage the flash card’s logical-to-physical mapping, write coalescing, garbage collection, and wear leveling. The Fusion-io card uses an FPGA to control the interface to the flash chips and relies on software in the host to perform the allocation, garbage collection, and high-level program/erase algorithms. As a consequence, the peak IOPS throughput is very high relative to the small storage capacity, but there is a substantial burden on the host CPU and host DRAM, as reported in the application system level benchmark results below. Additionally, the measured micro- and application system-level benchmark results indicate that Fusion-io garbage collection algorithms operate with great variability, which has a substantial and unpredictable impact on the sustainable write throughput.
Some notes about the Fusion-io configuration and result: We de-configured the available storage on each Fusion-io card from 320GB to 256GB in order to minimize the impact of the Fusion-io periodic garbage collection algorithm on steady-state write performance. This is recommended by Fusion-io for write-intensive workloads. Our initial tests with the ioDrive at the default storage size (320GB per card) resulted in highly-variable performance and relatively low write throughput. The results shown are for the de-configured storage size, but still show a 2:1 performance variability. The normalized throughput is the minimum which is sustainable during the garbage collection interval of several minutes duration.
| Normalized to 2U | ioDrive Duo |
| Capacity GB | 440 |
| 4KB Read IOPS | 324,000 |
| 4KB Write IOPS | 220,000 |
| CPU time per I/O | 19 uS |
| System $/GB | $59 |
| Storage subsystem cost | $26,360 |
2. c. Technology Evaluation: First-Generation SSD Flash : Intel X25e
The Intel X25e is a SATA 2.5” SSD 64GB SLC drive incorporating a SSD-based intelligent ASIC which manages all flash buffering, write coalescing, garbage collection, etc. on the SSD drive. This product is fully described at: http://accessories.us.dell.com/sna/productdetail.aspx?sku=A3131275&cs=04&c=us&l=en&dgc=SS&cid=27722&lid=628335. The product carries a street price of $679 https://www.newegg.com/Product/Product.aspx?Item=N82E16820167014&nm_mc=OTC-Froogle&cm_mmc=OTC-Froogle-_-Solid+State+Disk-_-Intel-_-20167014.
The Intel X25e SATA SSD is a first-generation enterprise flash technology in a 2.5” SATA form factor. It is the first affordable NAND SSD to incorporate an efficient on-board SSD program/erase algorithm. The X25e uses an embedded microprocessor in the drive to perform the program/erase, allocation, and garbage collection algorithms. The capacity per drive, at 32GB and 64GB, is smaller than the Fusion-io drive. As a consequence, the peak throughput and sustainable write throughput of a single X25e is less than that of a single Fusion-io card. However, the X25e’s relatively low cost allows the efficient use of several drives and controllers in parallel to cost-effectively scale IOPS and bandwidth.
| Scaled to 2U | Intel X25E |
| Capacity GB | 488 |
| 4KB Read IOPS | 288,000 |
| 4KB Write IOPS | 29,600 |
| CPU time per I/O | 7 uS |
| System $/GB | $14 |
| Storage subsystem cost | $6,792 |
2. d. Technology Evaluation: Next-Generation SSD Flash : Sand Force SF-1500 and OCZ Deneva SSD
There has been a significant maturation in the flash industry in the last two years, with a value-chain shift to optimized flash processors which are incorporated into a broad range of PCIe and SSD products by numerous suppliers. This next generation of flash technologies provides significant improvements in flash performance, endurance, reliability, cost, and capacity. These flash processors have also been optimized to provide significant improvements in the performance and durability of MLC-based flash, especially when exploiting eMLC for both excellent performance and wear characteristics.
We evaluated the Sand Force SF 1500 processor as built into the OCZ Deneva 200GB SATAII SSD. The datasheet for the Sand ForceSF-1500 Enterprise SSD Processor is located at https://www.sandforce.com/index.php?id=21&parentId=2.. The datasheet for the OCZ Deneva SSD is located at https://www.oczenterprise.com/details/ocz-deneva-reliability-2-5-slc-ssd.html.
| Scaled to 2U | OCZ Deneva |
| Capacity GB | 1456 |
| 4KB Read IOPS | 160,000 |
| 4KB Write IOPS | 88,000 |
| CPU time per I/O | 7 uS |
| System $/GB | 7 |
| Storage subsystem cost | $10,600 |
2.e Summary of Normalized 2U Micro-benchmark Results across Flash Component Technologies
To compare the results on a consistent basis, they were normalized to a 2U server configuration. There are 8 available drive bays and 2 available PCIe slots. Our test results indicate near-perfectly linear scalability up to 8 SATA drives using PCIe SATA adapters. THE 2U configurations consist of 8 SSDs in the case of Intel and OCZ, or 2 Fusion-io Duo cards occupying 2 PCIe slots.
| Scaled to 2U | FusionIO Duo | Intel X25E | OCZDeneva |
| Capacity GB | 440 | 488 | 1456 |
| 4KB Read IOPS | 324,000 | 288,000 | 160,000 |
| 4KB Write IOPS | 220,000 | 29,600 | 88,000 |
| CPU time per I/O | 19uS | 7uS | 7uS |
| Subsystem $/GB | $60 | $14 | $7 |
| Storage Subsystem Cost | $26,360 | $6,792 | $10,600 |
Observations:
- The Intel and OCZ flash technologies utilizing SSD-based flash processors rather than server driver-based mapping, garbage collection, and write coalescing solutions provided a 2.5x improvement in server CPU time per transaction over Fusion-io while incurring no server memory overhead (vs. many GB for Fusion-io). This leaves the host processor cores and memory available for useful application work.
- Both PCIe card-based flash and SATA SSD-based flash with PCIe-based controllers use the PCIe bus effectively. In both configurations using 2 PCIe slots the highest observed bandwidth was on the order of 1.6 gigabytes per second. There is no benefit from direct PCIe attachment for flash.
- The cost/GB of the SATA SSD solutions is dramatically lower than the PCIe-based solutions.
- As we discuss in the next section, real-world system-level application benchmarks and customer workloads require a balance of read and write performance, along with large storage capacity, to achieve a balanced system with optimum resource utilization and TCO. The performance (IOPS and bandwidth) and capacity of parallel SATA SSDs can be scaled to meet the needs of a balanced system, and they also provide the critical availability feature of hot swap that is not available with PCIe-based solutions.
- Systems based on hard drives typically accommodate up to 12 HDDs and offer I/O throughput up to about 2000 IOPS. Database applications such as MySQL are currently limited by the I/O performance and typically could run at over 10x the current throughput based on the available CPU capacity of dual-socket servers. We show in the next section presenting system-level benchmarks that software optimization is critical for obtaining a balanced system that fully utilizes multi-core processors and parallel flash memory for database workloads. With optimized software a good balance of IOPS capability per 2U, 2 socket server, allowing for a further 2X increase in CPU processing power, is on the order of 80,000 IOPS at about a 2:1 ratio of reads to writes. As can be seen from the summary table above, the parallel Intel drives are adequate for the current CPUs and the OCZ drive performance is well balanced for the next generation CPUs, whereas Fusion-io is overkill for database application workloads, with the additional expense per gigabyte and overhead in CPU time and DRAM being unjustified.
- The critical characteristics of flash technologies for creating balanced systems in terms of performance, capacity, and cost have been significantly improved with the next generation of flash processors and SSD products.
3. Application Level System Database Benchmarks: DBT2 (TPC-C equivalent)
DBT2 is an open-source implementation of TPC-C. TPC-C is the industry standard benchmark for on-line transactional databases. This benchmark has been constructed by a consortium including IBM, Oracle, HP, Cisco, Microsoft, and others to be representative of real workloads running real databases. All companies measure their systems with the standard benchmark according to the rules and publish their results TpmC (Total, $/TpmC, watts/TpmC) at:
https://www.tpc.org/tpcc/results/tpcc_perf_results.asp
While TPC benchmarks certainly involve the measurement and evaluation of computer functions and operations, the TPC regards a transaction as it is commonly understood in the business world: a commercial exchange of goods, services, or money. A typical transaction, as defined by the TPC, would include the updating to a database system for such things as inventory control (goods), airline reservations (services), or banking (money). The TPC-C workload simulates the on-line entry and processing of customer orders. It is representative of write-intensive workloads with moderate transaction complexity and well-defined data consistency requirements.
3.a System-Level Database Performance Benchmark Results
The graph below shows DBT2 performance on the normalized 2U server with dual quad-core Nehalem processors and 64 GB DRAM with various storage subsystem and software alternatives. The RAID modes reflect the most common configuration for availability and performance in a 2U form factor. In all cases, the operating system level is CentOS version 5.4 with the Linux Kernel version 2.6.18. We compare the software versions MySQL 5.1.44, MySQL 5.5.4 Beta (GA in late 2010), and Schooner MySQL Enterprise 2.5 (available now and certified by Oracle as 100% compatible with MySQL Enterprise 5.1.44).
The first and second columns, labeled Hard Drives, show the performance of MySQL 5.1.44 and Schooner 2.5 optimized software with 12 hard disk drives (15 k RPM) configured in RAID 5. The third, fourth and fifth columns, labeled Intel X25E, shows the performance of MySQL 5.1.44, MySQL 5.5.4 Beta and Schooner MySQL 2.5 with 8 Intel X25E solid state drives (SSDs) configured in RAID 5. The sixth, seventh, and eighth columns, labeled Fusion-io, shows the performance of MySQL 5.1.44, MySQL 5.5.4 Beta and Schooner MySQL 2.5 with 2 Fusion-io ioDrive Duo 320s configured in RAID 10. The ninth column, labeled OCZ, shows Schooner MySQL 2.5 using 8 parallel 200GB SandForce/OCZ SSDs in RAID 5.
The performance results indicate clearly the critical role of system architecture and software optimization in creating optimal balanced systems. The highly optimized, tightly-coupled Schooner MySQL Enterprise 2.5 solution achieves:
- more than 6x the transaction throughput of currently available MySQL 5.1.44 with equivalent commodity Intel SSDs;
- more than 14x when compared to a MySQL 5.1.44 server using parallel optimized hard disk drives; and
- over 3x relative to Fusion-io flash storage with MySQL 5.1.44.
The currently available Schooner Appliance provides 4x the transaction throughput of the upcoming MySQL 5.5.4 Beta on identical hardware using commodity Intel X25E SSDs, and more than 250% of the transaction throughput of the upcoming MySQL 5.5.4 Beta with Fusion-io flash storage.
The Schooner MySQL Enterprise 2.5 software is able to effectively utilize all of the flash technologies to get a high-performance balanced system. With Schooner MySQL 2.5 Fusion-io has marginally higher throughput than Schooner 2.5 with Intel X25Es, while Schooner MySQL 2.5 with OCZ/Sandforce outperforms Schooner MySQL 2.5 with Fusion-io (due to low OCZ/Sandforce consumption of server processor and DRAM resources). In addition to improved performance characteristics, the second-generation technologies such as OCZ/SandForce provide a large increase in cost-effective capacity.
The figure below shows the connection scalability of the tightly-coupled Schooner software stack relative to legacy MySQL software alternatives. The Schooner Appliance supports increasing connections without loss of throughput to over 20,000 connections, whereas the performance of stock MySQL software on identical hardware drops off rapidly as the number of connections increases.
4. Total Cost of Ownership (TCO)
Our TCO analysis normalizes to a 2M TPM solution with an 8 TB total data size. It includes the cost of systems, racks, networking, and power. Our TCO model calculates the number of 2U servers needed to meet both required storage capacity and required throughput for each server configuration. The 3 year TCO is then computed based on a hosted datacenter model with fixed server loading per rack and monthly cost quantized per rack. Initial capital expense is totaled for server acquisition and per-rack installation. Monthly operating expense is then added in for a total of 36 months including hosting charges for rack, power, and pipe plus monthly maintenance. The results for first-generation flash technologies are shown, normalized to the Schooner 3 year TCO.
As indicated in the chart, creating a balanced system provides great benefits in both performance and total cost of ownership. In addition to the 3x to 11x performance benefit, the Schooner tightly-coupled database server building block provides over a 50% improvement in TCO over any roll-your-own alternative.
III. Effectively Exploiting Multi-Core and Flash with Databases: System Architecture, Tight Coupling and Software Optimization Required
As indicated clearly from the studies above, when assembling the best-of-class multi-core processors and flash memory to match the workload, effective utilization of the CPUs, DRAM and flash to achieve a balanced system requires extensive software optimization.
Based on extensive research and experimentation, Schooner has developed the Schooner 2.5 MySQL Enterprise for InnoDB Appliance, which optimizes high-performance multi-core processors, flash memory, DRAM, and low-latency interconnects in a highly-parallel, low-overhead manner to balance system resources and maximize system throughput. The Schooner software is designed with high thread-level parallelism, granular concurrency control, optimized thread and data affinity management, optimized memory hierarchy management, efficient and parallel flash I/O initiation and completion, recovery log/checkpoint management, and scan resistance for workload variation resiliency. Schooner Appliances make optimal use of the CPU cores, flash memory, and low-latency interconnects to optimize application-level transactions/second, transactions/$ and transactions/watt while minimizing total cost of ownership. They also maximize service availability and improve fault tolerance and RAID on all flash and hard drives, high-speed replication and recovery, and high-speed backup and restore. Additionally, they provide rich administrative services for easy deployment, management, monitoring, and scaling.
Schooner Appliances deliver a very cost-effective solution with the currently available processor cores and flash. The next generation of multi-core and flash technologies enables further advances in performance, TCO and availability. We will be reporting on these in upcoming blogs.
Comment from Rick Cattell
Time July 3, 2010 at 11:54 PM
This is an excellent analysis of the issues, and the first good scientific study I’ve seen of the software and hardware payoffs/tradeoffs creating a balanced DBMS with flash memory and multi-core processors.
The major RDBMSs, including MySQL (more precisely, InnoDB and its use of the underlying file system) were written a decade or more ago, and were designed for a two-level storage hierarchy, not three (with flash). Effectively using flash really requires a rewrite of low-level code, not a simple user-level change like moving some MySQL tables to SSD, or a simple system-level change like making flash+disk look like a disk. The major RDBMSs were also not designed for the large number of CPU cores available today; their effective use likewise requires major code changes. Your results make clear that Schooner has done the right rewrites to make MySQL/InnoDB perform… good job!
I’ve been studying a lot of the new systems designed for horizontal scaling (see http://cattell.net/datastores). With all the interest in MySQL sharding and NoSQL, I find that people forget the importance of single-node performance. 10X improvement can make the complexity of horizontal scaling unnecessary, or greatly reduce the number of nodes required. It also reduces the limitations of horizontal scaling: for example, queries and transactions that span multiple nodes can run on a single node or fewer nodes.