Email Article

Creating an Active Learning Environment using Reproducible Data Science Tools

By Randal Burns / June 2020

TYPE: DESIGN FOR LEARNING, HIGHER EDUCATION

Print Email

Comments Instapaper

At Johns Hopkins University, I have taught a computer science course called “Parallel Programming” for more than 10 years. The projects in the course provide a broad introduction to modern parallel architectures and the tools, languages, and runtimes use to program them. The course helps the student understand how to encode parallelism in the solution to a performance-oriented programming task. The course is designed for undergraduate CS majors and graduate students from all disciplines that do computational science.

This class exposes students to a diverse set of technologies. Hardware parallelism includes multicore, GPU accelerators, cloud computing, and high-performance computing. Software tools and runtimes include OpenMP, Java threads, MPI, CUDA, PyTorch, TensorFlow, Spark, Hadoop, Dask, Hive, and PIG. Students program in C/C++, Java, Python, R, and Scala. Breadth of experience is an important goal. Undergraduates gain access to industry-relevant tools that provide inroads into internships. Graduate students are exposed to the programming techniques that they need to conduct data science and numerical computing research.

The diversity of technologies makes the class onerous to organize and teach. Each combination of hardware, software, and tools creates its own set of software dependencies. Instructors and TAs must prepare technologies with usability in mind. We have attempted to solve this in various ways. In the early years (2008-2010), we ran a dedicated cluster that we reconfigured and reinstalled. In 2011, we moved the course to the Amazon cloud and provided virtual machines and used services, such as Elastic-Map-Reduce for Hadoop. When using the Amazon cloud, students first developed on their laptops before running experiments in order to reduce cloud-computing costs. This meant they had also installed and configured these tools locally.

Instructors, TAs, and students spent over one-third of course time installing and configuring software, including operating systems and virtual machines. Additionally, every student needed to become a proficient system administrator to complete the course.

Meanwhile, I was tracking the evolution of virtual machines and software containers and the associated automated configuration tools, looking for a solution that translates software systems from a laptop to the cloud and requires minimal expertise.

Moving to Gigantum

In 2019, I undertook moving the entire course to Gigantum, a data science platform for reproducibility and sharing [1]. Gigantum captures the entire state of a data-science experiment and records a complete history of all changes to code, data, and environment. It is an automation tool that integrates Docker for software configuration, Git for versioning and storage, and JupyterLab or RStudio as a user environment for data science and machine learning. I also migrated lectures and lecture notes from PowerPoint slides to Jupyter notebooks, making the entire course an active-learning experience in the spirit of boot camps and massively open online courses (MOOCs).

For coursework, Gigantum allowed instructors and TAs to build programming projects that included example source code, input data, and a software container. We delivered an entire project that a student could run on the cloud or on their laptop. Students started programming without installing software, launching instances on the cloud, or building virtual machines. Jupyter notebooks support Python, R, C/C++, and Java programming environments. Projects contained complex software stacks such as Dask, Hadoop, OpenMP, MPI, CUDA, and Spark.

Presenting lectures in Gigantum was a profound transformation for the course. Treating lectures as data science projects made them interactive. Before each class, I would publish lecture notes and programming examples as a Jupyter notebook in a Gigantum project. Students would download or pull the changes to the Gigantum project at the start of class, which would rebuild the software environments on their laptops. They could then follow the lecture in the Jupyter notebook and run and modify examples during the lecture. When reviewing material for assignments or exams, students could reproduce all previous work exactly. If an example or solution was modified or broken, they could revert the project to any state in its history to recover their work.

The Learning Outcome

Jupyter notebooks changed the course into an active-learning experience for students, who went from listening passively to writing and running programs during lecture [2]. Active learning has been shown to increase performance in STEM disciplines [3]. A community of educators drew the conclusion that teaching in Jupyter has allowed them to “increase student engagement with material and their participation in class” [4].

Figure 1. LaTeX markdown of Amdahl's law in a Jupyter notebook.

[click to enlarge]

When “porting” the course from slides to notebooks, I converted graphs, equations, and analysis to interactive code that the student manipulates. A good example is the unit on Amdahl’s law, which is the fundamental concept in parallel efficiency. The equations behind Amdahl’s law were typeset as LaTeX in markdown (see Figure 1), the equation was implemented in a notebook cell in the R programming language (see Figure 2), and figures were generated on sequences of data and parameters using the ggplot package (see Figure 3). In class, we varied the parameter p that describes what fraction of the code is optimized to study the effect on speedup. Students executed the example during the presentation and gained intuition that is difficult to glean when passively consuming data.

Figure 2. R snippet to graph Amdahl's law using ggplot.

[click to enlarge]

Figure 3. Speedup plot output by code snippet.

[click to enlarge]

The process of lecturing with interactive examples provides a nice balance between traditional lectures and the flipped classroom. Computer science education has been exploring new teaching methods that are experiential and scalable. In universities, this is the flipped classroom (Maher, 2015) in which large classes are broken into smaller sections, lectures are recorded and consumed offline, and course time is spent team programming with guidance from instructors and TAs [5]. We conducted an informal review of introductory programming courses at Johns Hopkins that compared flipped classrooms with the same material in traditional lectures offered in prior years. Learning outcomes, measured by grades and preparedness for subsequent courses, were good. However, student satisfaction was down significantly. Feedback indicated students feel they are teaching themselves and are not benefitting from the expertise of faculty. Lectures remain a good way to maximize interaction with faculty. Interactive notebooks preserve the lecture in its traditional format while injecting active learning to improve performance and outcomes [3].

A Course in Jupyter Notebooks

The student course view was modeled after idioms for presenting Jupyter notebooks in GitHub used by the machine learning community [6]. Figure 4 shows the view during the lecture on OpenMP. This matches the student’s view on their laptop. The organizing principles are:

Each lecture is a Jupyter notebook that interleaves markdown and contains speaking points with runnable examples, in this case C/C++ code.
Lectures are numbered sequentially and organized in a single repository to create a syllabus view.

The versioning capabilities of Git allow students to customize content with their own notes. Gigantum merges changes to projects, such as adding new lectures. Merging changes to the Jupyter notebooks does not work well—this is a limitation of Jupyter [7], not Git or Gigantum.

Figure 4. Course lectures in Jupyter. Each lecture is a notebook that mixes markdown, code, and output.

[click to enlarge]

Course Workflow, Running Gigantum

The parallel programming course hosted a cloud for students to run Gigantum. This had two advantages: (1) the students could access material and interface with lectures without installing software on their laptops, and (2) we could provide specific hardware, e.g. multicore servers, so that parallel performance was predictable, made sense, and matched the lesson. A student needed only a laptop with a browser to follow the lectures and examples.

We provided a single project for all lectures (https://gigantum.com/randalg/jhupp-lectures) linked on the course Web page. Students import the project, which downloads, builds, and opens the project page. This is the automation process that clones a Git repository and builds a Docker image that replicates the environment. This puts the student into the project home page (see Figure 5) where they can browse the environment to see what packages are installed, view the README, and inspect code and data.

Figure 5. Project page in Gigantum used to launch and interact with course content.

[click to enlarge]

To get started, the student clicks “Launch Jupyter Lab,” which opens a new window and places them in Jupyter (see Figure 4) to follow lectures, run individual examples, or do their own work. The launch button automates the process of starting a Docker container, launching Jupyter, and opening the lectures directory. At this point, two button clicks have automated more than 10 command lines with many options and flags in three different tools, Git, Docker, and Jupyter.

As one works, Gigantum captures a record of activity that it presents as a sequential history of inputs, outputs, and actions. Gigantum monitors the Jupyter (or R Studio) runtime so that it can present plots and graphs (see Figure 6) alongside the code used to generate them (see Figure 7). This history can be browsed even when the project is not running. Each entry also marks a point in time that can be recovered. Students can rollback to the code, data, and configuration that generated a result. This includes rolling back to prior versions of software, which is particularly valuable when installing new software breaks an example.

Figure 6. Activity record that captures a plot.

[click to enlarge]

Figure 7. Activity records that shows the R code that generated the plot.

[click to enlarge]

The course relied heavily on the automated versioning of Gigantum to manage changing content, allowing instructors to produce and deliver new lectures, corrections, and course content continuously. I prefer to change and evolve content as I prepare a lecture. I would modify and complete a lecture on the day of the class and then push it to Gigantum. In class, students open the project and see new updates in the synchronization toolbar (see Figure 8), which shows the number of new activity records available from the published branch. Clicking sync brings these changes to the local project and rebuilds the project as necessary. For example, when I pushed a lecture for Spark, the updates installed the Spark runtime, Scala language and PySpark package.

Figure 8. Toolbar showing that 14 new updates are ready to be pulled into the project.

[click to enlarge]

From a Laptop to the Cloud

Before Gigantum, the course used the Amazon cloud through education grants and free education usage. We had to monitor and be cautious with compute time and Amazon credits. Assignments used a lot of resources; measuring scaleup required many cores and Hadoop is only interesting over multiple nodes. We asked students to develop on their laptops and only run on the cloud when collecting performance data for experiments. This was tedious and error-prone because students had to replicate their environment.

Gigantum solves these problems, allowing us to distribute complex environments that work seamlessly across laptops and the cloud. Students install the Gigantum client on their laptop and then have no further work to do. Syncing projects from the cloud to their laptop will install and build environments. Syncing from their laptop pushes local changes so that they can be run on the cloud instance.

Transparently moving from a laptop to cloud made it so that students did not spend as much time debugging or developing on cloud instances. It also mitigates a challenging problem. Every semester a few students forgot to turn off cloud instances and consumed hundreds of dollars of compute time. Projects in Gigantum close in a single step in the UI with no need to go into a management console to stop or terminate instances.

Conclusion

For the instructors and TAs, the use of Gigantum was an outright success. It eliminated most of the course time spent on the installation and configuration of software. We spent time early in the semester helping students install Gigantum on their laptops and there were issues regarding the installation of Docker, particularly on Windows. Having Gigantum available made me more ambitious with course topics; I could add examples that had complex dependencies. I included a new section on the Dask parallel programming framework and expanded Spark to demonstrate both Scala and PySpark bindings.

The students had an overall positive experience with Gigantum once they understood its value. Initially, students did not appreciate the tool because they did not comprehend the complexity of the tasks that it automated. As the semester progressed, students gravitated to the platform because it allowed them to access course content and did away with software installation and configuration when doing homework.

I liked the process of lecturing in Jupyter, but it revealed some limitations of using a data science platform as an educational tool. Notebooks seem modern and interactive, particularly when compared with PowerPoint. However, Gigantum and Jupyter were not designed for teaching courses and could be customized to improve the experience. Some features that would help include:

Integration with course management software, such as autograders and plagiarism detection;
A way to connect Gigantum with cloud platforms to use educational grants or credits.

Although the student experience is not as seamless as courses built in Google Colaboratory [8] or in custom Web applications, such as DataCamp [9] Gigantum is a powerful and flexible tool for delivering complex software to students with a wide variety of skills.

References

[1] Gigantum. Gigantum—A way to create and share reproducible data science and research. July 23, 2018.

[2] Bonwell, C. C. and Eison, J. A. Active learning: creating excitement in the classroom. ASHE-ERIC Higher Education Report No. 1. Washington, DC: The George Washington University, School of Education and Human Development. 1991.

[3] Freeman, S., Eddy, S. L., McDonough, M, Smith, M. K., Okoroafor, N., Jordt, H. and Wenderoth, M. P. Active learning increases student performance in science, engineering, and mathematics. Proceedings of the National Academies of Science 111, 23 (2013), 8410-8415.

[4] Barba, L. A., et al. Teaching and Learning with Jupyter. 2019.

[5] Maher, M. L., Latulipe, C., Lipford, H., and Rorrer, A. Flipped classroom strategies for CS education. In Proceedings of the 46th ACM Technical Symposium on Computer Science Education, 218-223 (2015).

[6] Jupyter. A gallery of interesting Jupyter notebooks. 2019. GithHub.

[7] Stein, W. Real-time collaboration with Jupyter notebooks using CoCalc. Jupytercon. 2018.

[8] Frontiera, P. Google Colaboratory as a data science learning environment. Oct. 30, 2018.

[9] Sheehy, R. Get DataCamp for the classroom for free. DataCamp. Aug. 20, 2019.

About the Author

Randal Burns is the department chair for the Computer Science Department. His research interests lie in building the high-performance, scalable data systems that allow scientists to make discoveries through the exploration, mining, and statistical analysis of big data. Recently, he has focused primarily on high-throughput neuroscience, but retains a vigorous interest in high-performance computing numerical simulations. Burns is both a member of and on the steering committee of the Kavli Neuroscience Discovery Institute. He is a member of the Institute for Data-Intensive Science and Engineering. He is on the Steering Committee of the Science of Learning Institute. He was a research staff member in storage systems at IBM’s Almaden Research Center from 1996-2002. He earned his bachelor’s in geophysics at Stanford University (1993), and his master’s (1997) and doctorate (2000) in computer science at the University of California, Santa Cruz.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

© ACM 2020. 1535-394X/2020/06-3403400 $15.00

Sign In

Articles

Reviews

About

Archives

Email Article

Creating an Active Learning Environment using Reproducible Data Science Tools

Moving to Gigantum

The Learning Outcome

A Course in Jupyter Notebooks

Course Workflow, Running Gigantum

From a Laptop to the Cloud

Conclusion

References

About the Author

Comments

FOLLOW US