BigData

This portfolio captures the work I completed for a course, Big Data and Large Scale Computing, at Carnegie Mellon University in Fall 2021. The work done here involves hands-on experience with MapReduce and Apache Spark using real-world datasets. Each of the assignment below encapsulates a thorough grounding in the technologies and best practices used in Big Data Machine Learning. To view my course repository on GitHub, please click here.

Key Learnings

From the course, Big Data & Large Scale Computing, I garnered the knowledge and practical skills to develop big data/ machine learning solutions with state-of-the-art tools, particularly those in the Spark environment, with a focus on programming models in MLlib, GraphX, and SparkSQL.

Portfolio

Here are the assignments that I completed during the course of this class.

Assignments

To view code for each, please click on the hyperlinks below.

I. Introduction to PySpark and RDDs

The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool for analyzing this type of data. In this homework, we will write code that calculates the most common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg.

II. Linear Regression

This assignment covers a common supervised learning pipeline, using a modified version of the Million Song Dataset from the UCI Machine Learning Repository. Our goal is to train a linear regression model to predict the release year of a song given a set of audio features.

III. Click-Through Rate Prediction

This assignment covers the steps for creating a click-through rate (CTR) prediction pipeline. We will work with the Criteo Labs dataset that was used for a recent Kaggle competition.

IV. Principal Component Analysis

This assignment delves into exploratory analysis of neuroscience data, specifically using principal component analysis (PCA) and feature-based aggregation. We will use a dataset of light-sheet imaging recorded by the Ahrens Lab at Janelia Research Campus, and hosted on the CodeNeuro data repository.

Final Project

Brief Description: The final project delves into exploratory analysis and building predictive models using the Yelp academic dataset. We will explore machine learning tasks in the context of a real-world data set using big data analysis tools. It contains the following parts:

Part 1: Exploratory Data Analysis
Part 2: Prediction using tree ensemble methods
Part 3: Collaborative filtering for recommendation
Part 4: Topic modeling for text reviews
Part 5: Word2Vec for text reviews
Part 6: Frequent pattern mining using FP-Growth algorithm