Sometime around 1950, Quality Guru Ishikawa proposed seven basic quality tools. One of these seven tools was the Scatter Plot.
Scatter plot is used to show the relationship between two variables visually. The simple academic example could be to show the relationship between the outside temperature and the ice cream sale. The scatter plot could visually show that as temperature increases, the ice cream sale increases.
What has changed since 1950?
Since 1950, things have changed drastically, when it comes to computing power. However, even today, the text books and quality literature is still stuck with the concept which was developed in 1950 for hand drawing the plot.
Excel and Minitab are commonly used by quality professionals to visualize data. These are commercial software. Besides, there is open source software such as R, which can help us in achieving our objective of analyzing and visualizing data.
Using R for Scatter Plots:
Why use R? The most important thing is that R is free. R has a steep learning curve, but once you realize the power of R, you will start loving this software. Let’s take an example to see what R can do when it comes to Scatter Plot.
Example:
Let’s take an example of a data set, which provides the population, GDP per Capita and the Life Expectancy of all countries. All these values are available for 12 years for each country. See below a snapshot of this data below:
Based on this you are expected to conclude if there is any relationship between these.
Conventionally, in Scatter Plot you just show two variables. Let say in this case; we could make a Scatter Plot between GDP per Capita and the Life Expectancy, to show that richer countries have higher life expectancy. So conventionally we would have created the following Scatter Plot.
On X axis we have per capita GDP, which varies from minimum $241 to $113,523. On Y axis the age is in the number of years.
Since the above plot includes 12 years data for each country, it really does not provide any conclusion. Lets use the data for just one year (that is 2007) and see if there is any relationship between the GDP per capita and the life expectancy. See below the scatter plot for that.
The above graph does show the relationship between GDP per capita and the life expectancy. Richer countries have higher life expectancy.
What about adding more variables to the plot:
Using packages such as ggplot, we can make this scatter plot to show more than two variables. We can add color to each point to show the continent, and we can change the size of each dot to show the population of the country. That way can show four variables in the plot.
In the below chart the data was filtered for the year 2007 only, and for Europe and Africa continents. Green dots are for Europe and Brown dots for Africa. The below graph very powerfully conveys the message that in the year 2007, African countries have lower GDP and life expectancy compared to Europe.
….. to be continued …