Speakers
Speakers
Tuesday, 1/22/13 Christpher Olston, Staff Research Scientist, Google
(hosted by Phokion Kolaitis and the Database Group)
At Yahoo, Olston co-created Apache Pig, which is used for large-scale data processing by LinkedIn, Netflix, Salesforce, Twitter, Yahoo and others, and is offered by Amazon as a cloud service. Olston gave the 2011 Symposium on Cloud Computing keynote, and won the 2009 SIGMOD best paper award. During his flirtation with academia, Olston taught undergrad and grad courses at Berkeley, Carnegie Mellon and Stanford, and signed several Ph.D. dissertations.
Abstract: This talk gives an overview of my team's work on large-scale data processing at Yahoo! Research. The talk begins by introducing two data processing systems we helped develop: PIG, a dataflow programming environment and Hadoop-based runtime, and NOVA, a workflow manager for Pig/Hadoop. The bulk of the talk focuses on debugging, and looks at what can be done before, during and after execution of a data processing operation:
•Pig's automatic EXAMPLE DATA GENERATOR is used before running a Pig job to get a feel for what it will do, enabling certain kinds of mistakes to be caught early and cheaply. The algorithm behind the example generator performs a combination of sampling and synthesis to balance several key factors---realism, conciseness and completeness---of the example data it produces.
•INSPECTOR GADGET is a framework for creating custom tools that monitor Pig job execution. We implemented a dozen user-requested tools, ranging from data integrity checks to crash cause investigation to performance profiling, each in just a few hundred lines of code.
•IBIS is a system that collects metadata about what happened during data processing, for post-hoc analysis. The metadata is collected from multiple sub-systems (e.g. Nova, Pig, Hadoop) that deal with data and processing elements at different granularities (e.g. tables vs. records; relational operators vs. reduce task attempts) and offer disparate ways of querying it. IBIS integrates this metadata and presents a uniform and powerful query interface to users.
Tuesday, 2/12/13 Dean Hildebrand, Master Inventor, IBM Almaden
Title: Moving Beyond Storage of the Undead
Abstract: Many of today's data centers store and access data using several different storage systems, each of which has a specialized purpose. While a large variety of available storage management and optimization options exist, these systems have limited functionality. Not unlike a zombie that only utilizes part of its brain, these systems is not living up to their full potential.
As the amount of data grows, the tolerance for using undead data management techniques is shrinking fast. All of this fits into the emerging focus on Software Defined Storage, which proposes among other things to make storage more intelligent. Enterprise features such as parallel data access, high-availability, backup, disaster recovery, compression, encryption, tiering, etc, that are considered optional today will become mandatory for many Big Data infrastructures.
In this talk I will discuss some of the ways our research group at IBM Almaden is seeking to make GPFS smarter for managing Big Data. This includes new features and enhancements to existing features to support cloud architectures, pNFS, analytics, efficient disaster recovery, and more.
Thursday, 2/21/13 Pat Helland, Software Architect, SalesForce.com
Abstracts:
Talk 1: If You Have Too Much Data, then “Good Enough” Is Good Enough
Classic database systems offer crisp answers for a relatively small amount of data. These systems hold their data in one or a relatively small number of computers. With a tightly defined schema and transactional consistency, the results returned from queries are crisp and accurate.
New systems have humongous amounts of data content, change rates, and querying rates and take lots of computers to hold and process. The data quality and meaning are fuzzy. The schema, if present, is likely to vary across the data. The origin of the data may be suspect, and its staleness may vary.
Today's data systems coalesce data from many sources. The Internet, B2B, and enterprise application integration (EAI) combine data from different places. No computer is an island. This large amount of interconnectivity and interdependency has led to a relaxation of many database principles. Let's consider the ways in which today's answers differ from what we used to expect.
Talk 2: Immutability Changes Everything
For a number of decades, I've been saying "Computing Is Like Hubble's Universe, Everything Is Getting Farther Away from Everything Else". It used to be that everything you cared about ran on a single database and the transaction system presented you the abstraction of a singularity; your transaction happened at a single point in space (the database) and a single point in time (it looked like it was before or after all other transactions).
Now, we see a more complicated world. Across the Internet, we put up HTML documents or send SOAP calls and these are not in a transaction. Within a cluster, we typically write files in a file system and then read them later in a big map-reduce job that sucks up read-only files, crunches, and writes files as output. Even inside the emerging many-core systems, we see high-performance computation on shared memory but increasing cost to using semaphores. Indeed, it is clear that "Shared Memory Works Great as Long as You Don't Actually SHARE Memory".
There are emerging solutions which are based on immutable data. It seems we need to look back to our grandparents and how they managed distributed work in the days before telephones. We realize that "Accountants Don't Use Erasers" but rather accumulate immutable knowledge and then offer interpretations of their understanding based on the limited knowledge presented to them. This talk will explore a number of the ways in which our new distributed systems leverage write-once and read-many immutable data.
Tuesday, 2/26/13 Fatma Özcan, Research Staff Member, IBM Research - Almaden
(hosted by Phokion Kolaitis and the Database Group)
Title: Improving Query Processing on Hadoop
Abstract: Hadoop has become an attractive platform for large-scale data analytics. An increasingly important analytics scenario for Hadoop involves multiple (often ad hoc) grouping and aggregation queries with selection predicates over a slowly changing dataset. These queries are typically expressed via high-level query languages such as Jaql, Pig, and Hive, and are used either directly for business-intelligence applications or to prepare the data for statistical model building and machine learning. Despite Hadoop’s popularity, it still suffers from major performance bottlenecks. In this seminar, I will talk about some techniques, borrowed from parallel databases, to speed up query processing on Hadoop.
The first of these techniques addresses the lack of ability to colocate related data on the same set of nodes in an HDFS cluster. To overcome this bottleneck, I will describe CoHadoop, a lightweight extension of Hadoop that allows applications to control where data are stored. Colocation can be used to improve the efficiency of many operations, including indexing, grouping, aggregation, columnar storage, joins, and sessionization. Next, I will present the Eagle-Eyed-Elephant (E3) framework for boosting the efficiency of query processing in Hadoop by avoiding accesses of data splits that are irrelevant to the query at hand. Using novel techniques involving inverted indexes over splits, domain segmentation, materialized views, and adaptive caching, E3 avoids accessing irrelevant splits even in the face of evolving workloads and data.
Thursday, 2/28/13 Vishy Vishwanathan, Associate Professor for Statistics and Computer Science, Purdue University
Title: Challenges in Scaling Machine Learning to Big Data
Abstract: We will start with a very innocuous sounding question: How do you estimate a multinomial distribution given some data? We will then show that this fundamental question underlies many applications of machine learning. We will survey some learning algorithms and challenges in scaling them to massive datasets. The last part of the class will be an interactive thought session on how we can bring together ideas from systems and Machine Learning to attack this problem.
Tuesday, 3/5/13 Bob Felderman, Principal Engineer, Platforms Hardware, Google
Title: Warehouse Scale Computing and the Perils and Pitfalls of Optimization
Abstract: Lots of the action in computer system design has migrated to the ends of the scale spectrum. Today, warehouse scale computing and mobile computing get a lot of attention. We'll present WSC from the Google perspective, then dive in to get a better idea on what it takes to create optimal systems.