Automated Database Design for Large-Scale Scientific Data

Overview:

Database support for scientific data is challenging due to the massive data volumes and the complex and diverse query workloads in modern scientific applications. An effective database physical design is critical in supporting a large variety of SQL queries over large-scale data. Taking advantage of detailed workload information is the key to guiding the physical design process towards efficient solutions: For instance, the Sloan Digital Sky Survey (SDSS) astronomical database contains tables with hundreds of attributes, which can be queried in various combinations. Designing indexes requires detailed workload knowledge in order to identify attribute subsets that are important for queries and must be indexed. Besides performance, database physical design must satisfy additional constraints, related to the maintenance of large-scale data: Database structures like indexes or materialized views are constrained by the available resources (like disk space) and also by the intensity of database updates which is proportional do the number of structures that must be maintained.

The details of each component of the project are given below.

AutoPart: Automated Schema Design Through Data Partitioning

Partitioning improves performance by replacing massive database tables by smaller ones, that are faster to access. Workload information is necessary, however, in order to determine which table attributes must be kept in the same table fragment so that the queries accessing them does not suffer a performance penalty by having to combine multiple fragments. We developed AutoPart, an algorithm for table partitioning given a representative input query workload, that can interface to commercial systems, similarly to available index and materialized view design tools. We experimented with AutoPart in the context of the SDSS database, and demonstrated
that it can improve performance by up to an order of magnitude, without utilizing any additional database structures. On the other hand, when combined
with indexes, AutoPart required less storage and exhibited much better update performance compared to an index-only design approach. Furthermore, we demonstrated that partitioning improves performance while minimizing the impact on the storage requirements and update performance of the database.


Schematic Diagram of AutoPart Implementation

Using AutoPart in the Physical Design of Database Caches

Making multi-terabyte scientific databases publicly accessible over the Internet is increasingly important in disciplines such as Biology and Astronomy. However, contention at a centralized, backend database is a major performance bottleneck limiting the scalability of Internet-based, database applications. Mid-tier caching reduces contention at the backend database by distributing database operations to the cache, increasing scalability. To improve the performance of mid-tier caches, we propose the caching of query prototypes, a workload-driven unit of cache replacement in which the cache object is chosen from various classes of queries in the workload. In existing mid-tier caching systems, the storage organization in the cache is statically defined. Our approach adapts cache storage to workload changes, requires no prior knowledge about the workload, and is transparent to the application. Experiments over a one-month, 1.4 million query Astronomy workload demonstrate up to 70% reduction in network traffic and reduce query response time by up to a factor of three compared to alternative units of cache replacement.

Schematic Diagram of SkyQuery System

Efficient and Accurate Cost Estimation for Automated Database Design Algorithms

State-of-the-art database design tools rely on the query optimizer for comparing between physical design alternatives. Despite providing an appropriate cost model for physical design, query optimization is a computationally expensive process. The significant time consumed by optimizer invocations poses serious performance limitations for physical design tools, causing long running times, especially for large problem instances. In this paper we introduce the Index Usage Model (INUM), a technique for minimizing the number of optimizer calls performed by index selection algorithms. INUM intelligently caches the results of a few key optimizer invocations and reuses them for query cost estimation during the tuning process. INUM is several orders of magnitude faster than the query optimizer and produces nearly identical cost estimates. Integrating INUM with index selection tools offers significant improvements in terms of their performance and their recommendation quality.



Design of INUM

People:

Carnegie Mellon University John Hopkins University

Publications:

  • S.Papadomanolakis and A.Ailamaki. AutoPart: Automated schema design for large scientific databases using data partitioning. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management, 2004. (PDF)
  • Stratos Papadomanolakis, Anastassia Ailamaki, "Workload-Driven Schema Design for Large Scientific Databases", Bulletin of the Technical Committee on Data Engineering, p. 21, vol. 27(4), (2004). (PS)
  • S. Papadomanolakis, A. Ailamaki, T. Tu, D. R. O.Hallaron, G. Heber. “Efficient Query Processing on Unstructured Tetrahedral Meshes”. ACM SIGMOD International Conference on Management of Data.(2006) (PDF)
  • Minglong Shao, Steven W. Schlosser, Stratos Papadomanolakis, Jiri Schindler, Anastassia Ailamaki, Christos Faloutsos, and Gregory R. Ganger, "MultiMap: Preserving disk locality for multidimensional datasets", IEEE International Conference for Data Engineering, p. 1, vol. 1,(2007). (PDF)
  • X. Wang, T. Malik, R. Burns, S. Papadomanolakis, and A. Ailamaki, "A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching", In Proceedings of Database Systems for Advanced Applications. (2007). (PDF)
  • S. Papadomanolakis, D. Dash, A. Ailamaki, “Intelligent Use of the Query Optimizer for Automated Physical Design”, VLDB 2007. (PDF)
  • T. Malik, R. Burns, and N. Chawla.  "A Black-Box Approach to Query Cardinality Estimation,"  Conference on Innovative Data Systems Research (CIDR),  2007 (PDF)
  • “An Integer Linear Programming Approach to Database Physical Design”. Stratos Papadomanolakis, Anastassia Ailamaki, ICDE Workshop on Self-Managing Database Systems (SMDB 2007).
  • "Automated Physical Design for Database Caches",T. Malik, X. Wang, R. Burns, D. Dash and A. Ailamaki, ICDE Workshop on Self-Managing Database Systems (SMDB 2008) (PDF)

Acknowledgment:

We would like to thank NSF for the grant supporting this project: Award # 0431008, COLLABORATIVE RESEARCH: SEI + II (AST): Bypass-Yield Caching for Large-Scale Scientific Database Workloads in the World-Wide Telescope.