BigData
This portfolio captures the work I completed for a course, Big Data and Large Scale Computing, at Carnegie Mellon University in Fall 2021. The work done here involves hands-on experience with MapReduce and Apache Spark using real-world datasets. Each of the assignment below encapsulates a thorough grounding in the technologies and best practices used in Big Data Machine Learning. To view my course repository on GitHub, please click here.
Key Learnings
From the course, Big Data & Large Scale Computing, I garnered the knowledge and practical skills to develop big data/ machine learning solutions with state-of-the-art tools, particularly those in the Spark environment, with a focus on programming models in MLlib, GraphX, and SparkSQL.
Portfolio
Here are the assignments that I completed during the course of this class.
Assignments
To view code for each, please click on the hyperlinks below.
I. Introduction to PySpark and RDDs
The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this homework, we will write code that calculates the most common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg.
This assignment covers a common supervised learning pipeline, using a modified version of the Million Song Dataset from the UCI Machine Learning Repository. Our goal is to train a linear regression model to predict the release year of a song given a set of audio features.
III. Click-Through Rate Prediction
This assignment covers the steps for creating a click-through rate (CTR) prediction pipeline. We will work with the Criteo Labs dataset that was used for a recent Kaggle competition.
IV. Principal Component Analysis
This assignment delves into exploratory analysis of neuroscience data, specifically using principal component analysis (PCA) and feature-based aggregation. We will use a dataset of light-sheet imaging recorded by the Ahrens Lab at Janelia Research Campus, and hosted on the CodeNeuro data repository.
Final Project
Brief Description: The final project delves into exploratory analysis and building predictive models using the Yelp academic dataset. We will explore machine learning tasks in the context of a real-world data set using big data analysis tools. It contains the following parts:
- Part 1: Exploratory Data Analysis
- Part 2: Prediction using tree ensemble methods
- Part 3: Collaborative filtering for recommendation
- Part 4: Topic modeling for text reviews
- Part 5: Word2Vec for text reviews
- Part 6: Frequent pattern mining using FP-Growth algorithm