puzzle-1152800_1280.png

Learn

I've had a lot of engineering students ask how to move into the field of data science.  Here, I will gather resources to help get people started.

 

     What is data science- I know, "what is data science" posts are a dime a dozen, but establishing a working definition helps provide scope and motivation for future discussions.

     Linear regression- Linear regression is often introduced as a first machine learning project.  The reasons for this are obvious- most people understand it intuitively, it can be compared against the least squares method of linear regression, and it provides a solid foundation to introduce terminology.

     A different look at neural networks- This page is a quick write-up targeted towards people who are comfortable with a bit of linear algebra. It demonstrates how to "unwrap" a neural network so that once a user trains a NN, he or she can implement it as a series of low-level operations rather than the typical "feed forward perceptron" that everyone likes to visualize.

    A semi-analytic approach to Support Vector Machines.  SVM is a great technique for classifying data; here we try to bridge the gap between is intuitive definition and the rigor behind the definition.

    Decision trees and random forests.  A decision tree is is a type of machine learning algorithm that consists of a tree-like graph. While they are easy to understand and computationally inexpensive, they are prone to high variance. Random forests are a powerful ensemble technique that can be used to address their shortcomings. Here, we review how decision trees work, introduce random forests and discuss their benefits and limitations, and also show how a random forest can be used to evaluate feature importance. We assume basic knowledge of the vocabulary of graph theory, which can be reviewed at https://en.wikipedia.org/wiki/Graph_theory.

    K-Means Clustering is an iterative process that uses a user-defined number of categories to place all data in a dataset into one of K categories. This is often a preliminary step for data-naiive clients (clients who did not develop a plan before collecting data, and may not know what they want out of their data), but can also be helpful in getting a better idea of the structure of your data in general. Here we will define k-means clustering (KMC) in a technical and qualitative manner, review the advantages and disadvantages of KMC, and cover an example with synthetic data.

    Judging the performance of a machine learning algorithm is vital to ensure confidence in future predictions. Machine learning is a method of analysis that allows us to create models using data. Unless we know how it will perform on new data, training a machine learning (ML) model is a pointless exercise. Having confidence in your algorithm’s performance is vital when making future predictions.

Here we will discuss a few techniques to measure error, and different types of errors.

    Working with missing dataWith the growing amount of data in our world, it is inevitable that data become corrupt or are failed to be reported. Often it is impossible to apply an algorithm to these data and they are the source of errors. Here, I discuss 4 strategies to deal with missing data including deleting the entire row, replacing with placeholder values, interpolating from nearby values, and interpolating with a machine learning solution. This writeup is presented as a Jupyter Notebook available for running or download on Google Colab.