Lecture: External Memory Systems for Data Science and Machine Learning
In modern computer architectures, the movement of data from storage or memory to the processor limits the performance and scale of scientific data analysis and machine learning. This bottleneck has grown much more acute as we use multicore processors and GPUs. This talk covers a decade of research in my lab that redesigned computer systems to move data through the memory hierarchy and applied these systems to make data science (graph analytics and sparse linear algebra) and machine learning (k-means and random forests) more efficient.
The lecture is self-contained and designed for the computational scientist that is familiar with using data science and machine learning programming tools. It will lightly review computer science concepts, including the memory hierarchy, external memory algorithms, and non-uniform memory architectures.
Dia: 21 de Janeiro, 2 da tarde, no Auditório do CEA II.
Lecture: Reproducible Data Science with Gigantum
Best practices in software engineering have defined programming environments for reproducibility and code sharing based on a combination of versioning (git), containers (docker), and tools for documentation, continuous integration and code review. Data science development environments (Jupyter labs and R Studio) have become literate, mixing code and markdown, but they do not provide meaningul support for versioning and sharing.
This talk presents the Gigantum open-source data science work environment that automates the best practices and skill-intensive tasks that are crucial to good data science. The data scientist works in familiar tools, such as RStudio and Jupyter and Gigantum makes sure that all aspects of a data science project–code, data, and environment–portable, shareable, and continuously versioned. Gigantum runs on locally (on laptops) as well as the cloud so that the data scientist can work without incurring cloud computing costs. Users can collaborate in groups or on public projects, exploring by launching on the cloud, contributing in their own branch, or customizing with new code or private data in their own fork.
Dia: 23 de Janeiro, 2 da tarde, no Auditório do CEA II.
A mesma palestra será apresentada no CPTEC no dia 22 de Janeiro, 2 da tarde.
Bio: Randal Burns is a Professor of Computer Science at Johns Hopkins University and has served as Department Chair since 2018. Randal’s research has pushed the scalability limits of data science based on emergent storage technologies. This has ranged from engineering file systems for storage area networks in the 1990s, building scientific Web services on scale-out cloud storage in the 2000s, and developing graph and sparse-matrix engines for machine learning in the 2010s. His work has been inspired by high-throughput science, including numerical simulations for turbulence, neuroscience microscopy, and observational astronomy.
Randal earned his PhD in Computer Science from the University of California Santa Cruz in 2000 and a BS in Geophysics from Stanford in 1993. Prior to joining the faculty at Johns Hopkins in 2002, he was a Research Staff Member at IBM’s Almaden Research Center where he won and Outstanding Innovation Award. Randal is a recipient of the NSF Career Award and was a DOE Early Career Principal Investigator. He is a Kavli Fellow and served as a member of the Defense Science Study Group class of 2012-2013.