Some tips to make your visualizations in R more insightful

Original Source Here

I am going to explain the plots using the diamonds dataset that comes with ggplot in R. You can use the dataset directly in Rstudio using the following statements:

tibble [53,940 x 10] (S3: tbl_df/tbl/data.frame)
$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
$ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
$ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
$ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
$ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
$ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

The dataset has 10 variables, describing features of diamonds like their cut, color, clarity, depth and price etc. There are some other variables like x, y, z, depth and table which are explained in the figure below.

1. Plot multiple columns in a graph

More often than not, we need to compare the values of multiple columns at the same time. In ggplot in R, we can specify only one x and y axis. So if we want to visualize multiple columns together, we have to create a work around by reshaping our data.

For example, if we want to create bar charts of depth, carat and table of diamonds by their color, we can do this by first reshaping our data.

new_diamonds <- reshape2::melt(diamonds[, c('color', 'depth', 'carat', 'table')], id.vars = 1)
color variable value
1 E depth 61.5
2 E depth 59.8
3 E depth 56.9
4 I depth 62.4
5 J depth 63.3
6 J depth 62.8

The data is now reshaped to contain 3 variables. Every row has a color value, a variable(depth, carat or table) and its corresponding value.

This data can now be used to create a plot, which compares the values of different columns.

ggplot(new_diamonds) + geom_col(aes(x =color, y = value, fill = variable), position = "dodge" )+ labs(title = "Comparison of depth, caret and table size of diamonds by color")

We can see in the above plot that it becomes a lot easier to compare the 3 variables when we can see them all together. We can conclude from the above plot that maximum depth is found in diamonds of color ‘E’ and maximum table value is found in diamonds with color ‘F’.

2. Plot two set of values on the same plot with different ordinate scales on the left and right

If we try to plot a bar graph of price and carat columns of the diamond dataset on the same plot, the bars of carat would be very tiny in comparison to the price values. To compare the two properly, we can plot the data for both price and carat on the same plot but with different y axes.

Since the data has lots of data points and it makes the plot too crowded. I decided to take mean values of price and carats of diamonds by their clarity. This is done as follows:

library(dplyr) <- diamonds %>% group_by(clarity) %>% summarise(mean_price = mean(price), mean_carat = mean(carat))

To plot 2 different columns on two y axes, we can use thefunction twoord.plot from the library plotrix. This function allows us to plot two set of values on two different scales. It is possible to plot bar graphs as well as line plots on different y axes. More details about this function can be found here.

twoord.plot($clarity,$mean_price,$clarity,$mean_carat, lcol = 4, rcol = 2, ylab = "price", rylab = "carat",xlab = "Clarity", = 5, = 10,type = "l", main = "Price and Carat comparison by Clarity",

We can see that in the same chart, we have plotted 2 columns with 2 different scales on the y axes. These plots must be used with caution though. Because sometimes the interpretation might get confusing. For example, at the 7th clarity it seems like the price and carat values are same. But this is not so because they have both been plotted on different scales. These plots should be used to understand the general trend of data.

3. Add more dimensions to your plot

The ggplot library in R provides us many functionalities to make our visualizations more meaningful. We can specify many parameters in the graphs to get more insights from the data.

For example, in a scatter plot we can specify the size and color of the dots, adding more dimensions. Let’s start by plotting a simple scatter plot between carat and price of diamonds.

ggplot(diamonds) +geom_point(aes(x= price, y =carat)) + labs(y= "Carat", x = "Price in USD", title = "Price vs Carat of diamonds")

Let’s now add another dimension in the plot. We can do this by adding colour in the dots by their cuts. This is done as below:-

ggplot(diamonds) +geom_point(aes(x= price, y =carat, color = cut)) + labs(y= "Carat", x = "Price in USD", title = "Price vs Carat and cut of diamonds")

We can see that purple dots are concentrated towards the top of the chart, indicating that most diamonds in the dataset with a fair cut have higher carat values than the diamonds with other types of cuts in the same price range. The trend is in general increasing, meaning a diamond with more carat value would generally have more price.

We can add another dimension to this plot by adding a parameter size.

ggplot(diamonds) +geom_point(aes(x= price, y =carat, color = cut, size = depth, alpha = 0.3)) + 
scale_size(limits = range(55, 75), breaks = c(55,60,64,68,70))+
labs(y= "Carat", x = "Price in USD", title = "Price vs Carat, Cut and Depth of diamonds")

Here I have decided to focus on the range of depth between 55–75 and given custom-defined break points in the range because depth of most of the diamonds fall in the range 60–70 and I would like to better understand the trend in those depth values.

The plot now represents 4 dimensions in 1 plot – Price on x axis, carat on y axis, color by cut and size of dots by the depth of diamonds. We can conclude from the above plot that diamonds with more depth tend to have more carat value. Towards the right of the plot, we can see many dots with small size, indicating that depth might not be a deciding factor for calculating the price of diamonds. (More analysis need to be done to say anything conclusively though)

We can add yet another dimension to the plot using facets. Facets partition a plot into matrix of panels, each panel showing a subset of data. We decided to facet the above plot based on color of diamonds.

ggplot(diamonds) +geom_point(aes(x= price, y =carat, color = cut, size = depth, alpha = 0.3)) + 
scale_size(limits = range(55, 75), breaks = c(55,60,64,68,70))+
labs(y= "Carat", x = "Price in USD", title = "Price vs Carat, Cut, Depth and color of diamonds")+

We can add yet another dimension in the plot by adding one more field in facet as below:

ggplot(diamonds) +geom_point(aes(x= price, y =carat, color = cut, size = depth, alpha = 0.3)) + 
scale_size(limits = range(55, 75), breaks = c(55,60,64,68,70))+
labs(y= "Carat", x = "Price in USD", title = "Price vs Carat, Cut, Depth, Color and Clarity of diamonds")+

This plot now represents 6 columns in the same plot. I1, SI1, SI2 etc are the clarity values and D, E, F etc are the color codes.

We can make the following conclusions from the above plot:

  1. Most diamonds of the clarity I1 have low prices and fair cut is quite common in this clarity group.
  2. Not a lot of diamonds have clarity IF and color I or J.
  3. Plots for clarity SI1 and SI2 are quite dense indicating that they are common clarities to have in diamonds.
  4. For the clarity values WS1 and IF, Ideal and Premium cuts are most commonly available.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: