Blog posts

The blog posts are also shared through on R-bloggers.

Principal Component Analysis Mark Zwart • 10-12-2017

Principal Component Analysis (PCA) is a method for reducing a data-set with a high number of variables to a smaller set of new variables, ‘juicing’ the most of the same information out of the whole set of variables. In the data science realm it is mostly used to achieve one or more of the following goals:

Reducing the number of variables in a dataset reduces the number of degrees of freedom of a statistical model, which in turn reduces the risk of overfitting the model.
Machine learning algorithms perform significantly faster when less variables are included.
It can simplify the interpretation of data, by showing which variables play the biggest role in describing the data set.

In this tutorial I’ll explain the concept behind Principal Component Analysis, and with an example I’ll show you how to perform a PCA, how to choose the principal components and how to interpret them. Read more…

You can download the script here

Mining Alice's Wonderland Mark Zwart • 31-10-2017

This tutorial we’ll be text-mining Lewis Carol’s Alice’s Adventures in Wonderland by using the gutenbergr, tidytext and ggplot2 libraries. I’ve assumed that you know some basic stuff about the tidyverse and ggplot2 libraries. First I’ll discuss the concepts that drove the script, after which I’ll jump into the scripting and of these concepts and their results. Read more…

You can download the script here.