Big Data Systems
Class Meetings
Tuesdays and Thursdays (1/8/13 - 3/14/13), 10-11:45am PT, E2-506
Office Hours (contacts: Scott, Carlos)
Tuesdays: 1-3pm PT.
Schedule and Readings
Please retrieve the readings from the Web. Contact us if you have trouble finding them. The schedule and readings might change to accommodate guest speakers.
Guest speakers so far:
1/22: Christopher Olston (Google)
2/12: Dean Hildebrand (IBM Almaden)
2/21: Pat Helland (Salesforce)
2/26: Fatma Özcan (IBM Almaden)
2/28: Vishy Vishwanathan (Purdue University)
3/5: Bob Felderman (Google)
Tuesday, January 8, 2013
Introduction
Thursday, January 10, 2013
Big Data Now
O’Reilly Media, “Big Data Now” (2nd edition)
Tuesday, January 15, 2013
Big Data Storage
K. V. Shvachko, “Hdfs scalability: The limits to growth,” ;login:, vol. 35, no. 2, 2010.
S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long, and C. Maltzahn, “Ceph: A scalable, high-performance distributed file system,” in OSDI’06, (Seattle, WA), Nov. 2006.
[optional] D. Beaver, S. Kumar, H. C. Li, J. Sobel, and P. Vajgel, “Finding a needle in haystack: Facebook’s photo storage,” in OSDI’10, (Vancouver, Canada), November 4-6 2010.
Optional: CMPS 290H: Provenance
Thursday, January 17, 2013
Batch-oriented processing
J. Dean and S. Ghemawat, “MapReduce: Simplified data processing on large clusters,” in OSDI’04, (San Francisco, CA), Dec. 2004.
M. Isard, M. Budiu, Y. Yu, A. Birrel, and D. Fetterly, “Dryad: Distributed data-parallel programs from sequential building blocks,” in Eurosys 2007, (Lisboa, Portugal), March 21–23 2007.
Tuesday, January 22, 2013
Guest Lecture: Christopher Olston (Google)
This talk gives an overview of my team's work on large-scale data
processing at Yahoo! Research. The talk begins by introducing two data
processing systems we helped develop: PIG, a dataflow programming
environment and Hadoop-based runtime, and NOVA, a workflow manager for
Pig/Hadoop. The bulk of the talk focuses on debugging, and looks at what
can be done before, during and after execution of a data processing
operation:
* Pig's automatic EXAMPLE DATA GENERATOR is used before running a Pig
job to get a feel for what it will do, enabling certain kinds of
mistakes to be caught early and cheaply. The algorithm behind the
example generator performs a combination of sampling and synthesis to
balance several key factors---realism, conciseness and completeness---of
the example data it produces.
* INSPECTOR GADGET is a framework for creating custom tools that
monitor Pig job execution. We implemented a dozen user-requested tools,
ranging from data integrity checks to crash cause investigation to
performance profiling, each in just a few hundred lines of code.
* IBIS is a system that collects metadata about what happened during
data processing, for post-hoc analysis. The metadata is collected from
multiple sub-systems (e.g. Nova, Pig, Hadoop) that deal with data and
processing elements at different granularities (e.g. tables vs. records;
relational operators vs. reduce task attempts) and offer disparate ways
of querying it. IBIS integrates this metadata and presents a uniform and
powerful query interface to users.
[optional reading] C. A. Olson, B. C. Reed, U. Srivastava, R. Kumar, and A. Tomkins, “Pig latin: a not-so-foreign language for data processing,” in SIGMOD ’08, (Vancouver, Canada), June 9-12 2008.
Thursday, January 24, 2013
Batch-oriented processing
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica, “Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing,” in NSDI 2012, 2012.
C. Engle, A. Lupher, R. Xin, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica, “Shark: Fast data analysis using coarse-grained distributed memory,” in SIGMOD ’12, (Scottsdale, AZ), May 20-24 2012.
Tuesday, January 29, 2013
Batch-processing of Graphs and Stream Processing
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, “Pregel: A system for large-scale graph processing,” in SIGMOD ’10, (Indianapolis, IN), June 6-11 2010.
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari, “S4: Distributed stream computing platform,” in KDCloud’11, 2011.
Thursday, January 31, 2013
Key/value Stores
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. t E. Gruber, “Bigtable: A distributed storage system for structured data,” in OSDI’06, (Seattle, WA), November 2006.
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo: Amazon’s highly available key-value store,” in Proceedings of the 21st ACM Symposium on Operating Systems Principles (SOSP ’07), pp. 205–220, 2007.
Tuesday, February 5, 2013
Interactive transactions on distributed data
D. Peng and F. Dabek, “Large-scale incremental processing using distributed transactions and notifications,” in OSDI’10, (Vancouver, Canada), October 4-6 2010.
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis, “Dremel: Interactive analysis of web-scale datasets,” in Proc. of the 36th Int’l Conf on Very Large Data Bases, pp. 330–339, 2010.
Thursday, February 7, 2013
Performance Virtualization
J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. L ́eon, Y. Li, A. Lloyd, and V. Yushprakh, “Megastore: Providing scalable, highly available storage for interactive services,” in CIDR ’11, (Asilomar, CA), January 9-12 2011.
J. C. Corbett et al., “Spanner: Google’s globally-distributed database,” in OSDI’12, 2012.
Tuesday, February 12, 2013
Guest Lecture: Dean Hildebrand (IBM)
F. Schmuck and R. Haskin, “GPFS: A shared-disk file system for large computing clusters,” in Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pp. 231–244, USENIX, Jan. 2002.
D. Hildebrand and P. Honeyman, “Exporting storage systems in a scalable manner with pnfs,” in MSST ’05, (Monterey, CA), April 11-14 2005.
[Optional] R. Ananthanarayanan, M. Eshel, R. Haskin, M. Naik, F. Schmuck, and R. Tewari, “Panache: A parallel WAN cache for clustered filesystems,” ACM SIGOPS Operating Systems Review, vol. 42, pp. 48–53, January 2008.
Thursday, February 14, 2013
FAST 2013
Attendance of FAST 2013
Tuesday, February 19, 2013
Structured Data Processing
J. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. A. Brandt, “Scihadoop: Array-based query processing in hadoop,” in SC ’11, (Seattle, WA), November 2011.
T. Kaldewey, E. J. Shekita, and S. Tata, “Clydesdale: Structured data processing on mapreduce,” in EDBT 2012, (Berlin, Germany), March 26-30 2012.
Thursday, February 21, 2013
Guest Lecture: Pat Helland (Salesforce)
Talk 1: If You Have Too Much Data, then “Good Enough” Is Good Enough
Classic database systems offer crisp answers for a relatively small amount of data. These systems hold their data in one or a relatively small number of computers. With a tightly defined schema and transactional consistency, the results returned from queries are crisp and accurate.
New systems have humongous amounts of data content, change rates, and querying rates and take lots of computers to hold and process. The data quality and meaning are fuzzy. The schema, if present, is likely to vary across the data. The origin of the data may be suspect, and its staleness may vary.
Today's data systems coalesce data from many sources. The Internet, B2B, and enterprise application integration (EAI) combine data from different places. No computer is an island. This large amount of interconnectivity and interdependency has led to a relaxation of many database principles. Let's consider the ways in which today's answers differ from what we used to expect.
Talk 2: Immutability Changes Everything
For a number of decades, I've been saying "Computing Is Like Hubble's Universe, Everything Is Getting Farther Away from Everything Else". It used to be that everything you cared about ran on a single database and the transaction system presented you the abstraction of a singularity; your transaction happened at a single point in space (the database) and a single point in time (it looked like it was before or after all other transactions).
Now, we see a more complicated world. Across the Internet, we put up HTML documents or send SOAP calls and these are not in a transaction. Within a cluster, we typically write files in a file system and then read them later in a big map-reduce job that sucks up read-only files, crunches, and writes files as output. Even inside the emerging many-core systems, we see high-performance computation on shared memory but increasing cost to using semaphores. Indeed, it is clear that "Shared Memory Works Great as Long as You Don't Actually SHARE Memory".
There are emerging solutions which are based on immutable data. It seems we need to look back to our grandparents and how they managed distributed work in the days before telephones. We realize that "Accountants Don't Use Erasers" but rather accumulate immutable knowledge and then offer interpretations of their understanding based on the limited knowledge presented to them. This talk will explore a number of the ways in which our new distributed systems leverage write-once and read-many immutable data.
Tuesday, February 26, 2013
Guest Lecture: Fatma Ozcan (IBM Almaden): Improving Query Processing on Hadoop
ABSTRACT:
Hadoop has become an attractive platform for large-scale data analytics. An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These queries are typically expressed via high-level query languages such as Jaql, Pig, and Hive, and are used either directly for business-intelligence applications or to prepare the data for statistical model building and machine learning. Despite Hadoop’s popularity, it still suffers from major performance bottlenecks. In this seminar, I will talk about some techniques, borrowed from parallel databases, to speed up query processing on Hadoop.
The first of these techniques addresses the lack of ability to colocate related data on the same set of nodes in an HDFS cluster. To overcome this bottleneck, I will describe CoHadoop, a lightweight extension of Hadoop that allows applications to control where data are stored. Colocation can be used to improve the efficiency of many operations, including indexing, grouping, aggregation, columnar storage, joins, and sessionization.
Next, I will present the Eagle-Eyed-Elephant (E3) framework for boosting the efficiency of query processing in Hadoop by avoiding accesses of data splits that are irrelevant to the query at hand. Using novel techniques involving inverted indexes over splits, domain segmentation, materialized views, and adaptive caching, E3 avoids accessing irrelevant splits even in the face of evolving workloads and data.
Thursday, February 28, 2013
Guest Lecture: Vishy Vishwanathan (Purdue): Challenges in Scaling Machine Learning to Big Data
Abstract: We will start with a very innocuous sounding question: How do
you estimate a multinomial distribution given some data? We will then
show that this fundamental question underlies many applications of
machine learning. We will survey some learning algorithms and challenges
in scaling them to massive datasets. The last part of the class will be
an interactive thought session on how we can bring together ideas from
systems and Machine Learning to attack this problem.
Suggested reading:
Goldwater, S., Griffiths, T. L., Johnson, M. (2011). Producing power-law
distributions and damping word frequencies with two-stage language
models. Journal of Machine Learning Research, 12, 2335-2382.
Tuesday, March 5, 2013
Guest Lecture: Bob Felderman (Google): Warehouse Scale Computing and the Perils and Pitfalls of Optimization
Abstract: Lots of the action in computer system design has migrated to the ends of the scale spectrum. Today, warehouse scale computing and mobile computing get a lot of attention. We'll present WSC from the Google perspective, then dive in to get a better idea on what it takes to create optimal systems.
Abts and B. Felderman, “A guided tour of data-center networking,” CACM, vol. 55, pp. 44–51, June 2012.
S. Han, S. Marshall, B.-G. Chun, and S. Ratnasamy, “Megapipe: A new programming interface for scalable network i/o,” in OSDI’12, 2012.
Thursday, March 7, 2013
Structured Data Processing
N. Watkins, C. Maltzahn, S. A. Brandt, and A. Manzanares, “Datamods: Programmable file system services,” in PDSW’12, (Salt Lake City, UT), November 12 2012.
J. He, J. Bent, A. Torres, G. Grider, G. Gibson, C. Maltzahn, and X.-H. Sun, “Discovering structure in unstructured i/o,” in PDSW’12, (Salt Lake City, UT), November 12 2012.
Tuesday, March 12, 2013
Project presentations
Thursday, March 14, 2013
Project presentations
Friday, March 22, 2013
Firm deadline for class project reports