Unstructured Data Analytics
This portfolio captures the work I completed for a course, Unstructured Data Analytics, at Carnegie Mellon University in Spring 2022. Since companies, governments, and other organizations now collect massive amounts of data such as text, images, audio, and video, turning this heterogeneous mess of data into actionable insights has become a challenge. This course builds up on addressing this problem of first examining the possible structure present in the data via visualization and other exploratory methods. Following this, it leverages upon several Machine Learning techniques to make predictions by exploiting this structure. To view my course repository on GitHub, please click here.
Key Learnings
From the course, Unstructured Data Analytics, I gained an extensive understanding of Natural Language Processing, Machine Learning and Neural Networks. Key focus areas included frequency and co-occurrence analysis using metrics such as Phi-squared; Topic modeling using Latent Dirichlet Allocation; Dimensionality Reduction techniques such as Principal Component Analysis, Isomap and t-Distributed Stochastic Neighbor Embedding; Clustering methods such as K-Means, Gaussian Mixture Models (including hyperparameter optimization) and DP-means; Classification models such as K-NN and Random Forests; Image analysis with CNNs and time series analysis with RNNs.
Portfolio
Different tools in Python e.g. NLTK, Skicit-learn, spaCy, PyTorch were used for completion of projects and assignments below, using real world data.
Assignments
To view the Jupyter Notebook for each assignment and a subset of the data used, please click on the folder hyperlinks below.
I. Text Analysis, Entity Recognition and Co-Occurrence Analysis (spaCy) on 100 Books
II. Spam Identification (Clustering) & Identifying Latent Purposes in Mobile Apps (LDA)
III. Spam Email Classification (K-NN & Random Forests)
Final Project
Brief Description: “Facilitating Community-Informed Opioid Prescription Guidelines.” This project focuses on conducting an in-depth analysis of public comments on the draft of CDC’s Clinical Practice Guideline for Prescribing Opioids 2022. These comments will illuminate the stories, concerns, and experiences from the patient, provider, and community side to delineate the impact of the change in guidelines. Therefore, we utilized advanced NLP techniques to conduct a sentiment analysis and reveal the priorities and preferences of those impacted by the clinical guidelines.
Please click on the hyperlinks below to view the project details: