What is data science?

What is data science?


You can ask a dozen people that question, and probably get two dozen answers! Sometimes in the form of lists, overly complicated Venn diagrams, trendy infographics, or the aesthetically pleasing but relatively uninformative word cloud. But it is an essential question to answer. Not so much to establish some overarching philosophy on the topic, but to help motivate and provide context for future discussions. This post will discuss what kind of questions data science can answer, what kind of training data scientists have, and will finish by surveying a few basic data science techniques.

Data scientists can answer the following general types of questions: Is this X or Y? How much? And is this weird? The first question is a classification question. Essentially lumping a sample into one of two or more categories- an example question could be “is this email spam?” The second question, “how much” is a regression problem.  Rather than predicting a category we predict a quantity. Something along the lines of “given ambient conditions, engine speed, and current fuel/oil conditions, what is the rate of pollutant production?”  Lastly, anomaly detection tells us if a sample is aberrant.  This may be questions such as cancer detection, or credit card fraud detection.

Most bloggers will insist that I also mention that data science can answer questions about clustering or reinforcement.  While this is true, these are advanced techniques that can wait until the reader is ready for further training.

Data science training is another complex issue. There is no official licensing, training, or certification.  I’ve heard stories from recruiters where candidates have identified themselves as data scientists and had only limited Excel skills!  On the other hand, PhD physicists have made lucrative careers in the private sector by applying their skills in statistics, advanced math, and computer programming towards improving services similar to Amazon, or GrubHub.  If you want to be a data scientist, you are taking a good step forward by trying to figure out just what a data scientist is! Other subjects to learn include computer science (programming concepts, familiarity with linux, etc), math (linear algebra, calculus, statistics), and communication skills (creating visuals, summarizing data, and drawing conclusions). 

As far as obtaining formal training, there are many books and online courses.  There are some certifications offered, but there is no accrediting board for these.  You may be paying a lot of money for a piece of paper that means very little!  One thing I can recommend is to specialize in a certain industry. Knowledge of data science skills is vital, but domain knowledge can give you an edge that differentiates you from the competition.

In addition to a strong foundation in computer science and mathematics, a few tools every data scientist should have in their toolbox are basic statistical tools, principal component analysis, and support vector machine.

Statistical tools include things such as confidence intervals, correlation matrices, Bayes’ theorem, regression, and knowledge of certain probability distributions.

Principal component analysis (PCA) allows a data scientist to reduce the size of a data set.  Just starting out, you might have no more than a 100-dimensional vector of input values.  However, some data sets have thousands of dimensions!  This results in increased computer time, and analytic complexity.  PCA helps reduce the dimensionality with only minimal information loss.

Support vector machine (SVM) is a powerful, yet easy to explain technique that divides the data with the “best” hyperplane possible.  It can be used for classification, or regression.  This may not sound too exciting when you consider 2-D data separated by a line; however with techniques such as ‘kernel tricks’ and higher dimensional data, SVM can quickly provide high quality predictions.

Like many complicated subjects, data science is hard to explain in only 600-some words.  However, we tried to scratch the surface here.  We discussed what kind of questions data science can answer, what kind of training data scientists have, and surveyed a few basic data science techniques.  Future posts will go in to further detail about these and other topics, so check back soon!