Coalescing Analysis & Storage

Carlos Maltzahn

Overview

World-wide threats like pandemics, terrorism, wars, cyber-attacks, and climate change, as well as regional emergencies such as hurricanes and wildfires can only be successfully addressed by rapid response using large amounts of real-time data, predictive models, and highly evolved logistics. Today those components are often available in high fidelity individually but their composition to quickly synthesize new data into meaningful information and predictions is limited by the current state of art in large-scale data processing and high-performance computing.


The world's fastest supercomputers are designed for compute-intensive simulations based on fixed data sets. Focusing on computational speed, they separate computation from storage and off-load I/O to a few dedicated nodes which access data on a file system running on a separate cluster. Analysis is run on yet another cluster that retrieves simulation results from the file system. In data-intensive rapid response problems, the data sets are continually changing so that I/O performance dominate overall performance. Simulations need to query for updates based on the state of the simulation and the nature of recent updates, e.g. "find all routes between airports within the last day that experienced an exponential increase of infections of disease X.” This has far-reaching consequences for future super-computing systems, algorithms, and architectures, favoring designs where computation is closer to the data to minimize overall data movement.


This class will focus on the research of the UCSC Systems Research Lab (and others) to coalesce common data analysis tasks with large-scale storage systems, and will include guest speakers from industry and government centers of excellence.  We will begin with background reading that will cover large-scale data analysis system stacks, including visualization and data analysis frameworks, super-computing and data center architectures, and distributed storage systems. We will then survey system architecture designs that move computation closer to data. Next we will examine the different technologies that play an important role in these architectures, including data management, data models, high-level I/O interfaces and their implementation, and performance management and virtualization.  Finally, we will focus on specific projects enabling the integration of data analysis with storage.


The class will consist of three parts:

  1. 1.Weekly readings and class discussions on papers related to the class topic.

  2. 2.An individual or group project. We will develop a number of specific project ideas as part of the class and everyone will be expected to implement one of these ideas, either individually or as part of a group.

  3. 3.A final report.  Everyone will be expected to turn in a project writeup similar to the conference papers we will be reading in class.


Prerequisites: you are expected to have basic operating system knowledge, such as presented in a standard undergraduate course such as CMPS 111. Furthermore, you are expected to have taken CMPS 221, Advanced Operating Systems. Others will be admitted with the instructors permission based upon demonstrated systems background and sophistication necessary for successful completion of this course.


Course Requirements

One or more articles will be assigned as reading prior to each class meeting - usually two per class. These articles should be read carefully, and a short summary of each article and a few questions or insightful comments about the material (at least 3 per paper) prepared for the following class meeting. The summary of each article consists of brief answers to the following seven questions:


   1. What is the problem the authors are trying to solve?

   2. What other approaches or solutions existed at the time that this work was done?

   3. What was wrong with the other approaches or solutions?

   4. What is the authors' approach or solution?

   5. Why is it better than the other approaches or solutions?

   6. How does it perform?

   7. Why is this work important?

   8. 3+ comments/questions


You will be required to write a report on a topic in the area of storage systems. This report should be the results of a project, original research (preferred), or a strong survey of prior art. Reporting work done for another course is not acceptable. You must choose a topic by second week of the quarter. Each student will give a final presentation on their project at the end of the quarter.


Your grade in the course is based 25% on preparedness and class participation, 25% for presentations, and 50% for your term project and report.


Attendance

Class attendance is required. This is a discussion-based seminar course and you will not pass if you routinely miss class.


Academic Honesty

All the work you turn in must be your own. If you get ideas or material from any source other than your own mind (even from conversations with others), you must cite that source. Failure to do so constitutes plagiarism and will not be tolerated - you will not pass the course.