A Powerful Data Analysis Tool for one-click EDA



Original Source Here

A Powerful Data Analysis Tool for one-click EDA and Inferential Statistics

Why do you need a free tool like CyberDeck again?

1. What is the actual problem of Data Analysis and Data Science?

Are you tired of writing hundreds of lines of code again and again? Are you tired of not being able to separate the wheat from the chaff? Tired of seeing Data Science projects fail due to too many dependencies and lack of communication between the different teams?

Then you are indeed lacking your inventory of a powerful Data Analysis tool that can perform end-to-end Data Science. In this article, we will explore one of the best Data Analysis tools out there.

Whether you are a Data Scientist, Product Manager, Professor, or Advisor, check out this fact by Mckinsey, Gartner, and Venturebeat: “87% of Data Science projects fail.

And do you know why? It’s simply too time-consuming. When you plan the project, make necessary preparations and implement it, it’s simply too late.

If this scenario sounds familiar to you, then read on. Otherwise, this article is not for you, and I suggest you don’t let it waste your time.

Previously, we saw how to use the CyberDeck platform to perform end-to-end Machine Learning and Time Series Forecasting (univariate and multivariate). If you have missed them, you can find them here.

  1. End-to-End AutoML and Model Explainability
  2. End to End Time Series EDA and Forecasting

In the first article, we performed some basic EDA/Dashboarding and then served end-to-end Machine Learning with the click of a mouse. We also visualized interpretable AI in feature importance, model analysis, and What-If analysis.

But those who have spent their time in Data Science know that Machine Learning is one of the minor parts of any Data Science project.

Most of the time goes for Data Cleaning, Exploratory Data Analysis, Data Processing, Statistical tests, etc. We will cover two of them in this article.

First — A gift for you

A handy reference book is useful if you are new to the Data Science/Analytics domain. We have compiled an A-Z list of Data Science and Machine Learning glossaries in this e-book. Download it for free.

Overview of CyberDeck — The free, no-click Data Analysis tool

This article will see how we can use CyberDeck, the powerful Data Analysis tool, to perform basic and advanced Exploratory Data Analysis. Last but not least, we will perform multiple Inferential statistical tests like T-tests, ANOVA, Chi-Square, etc., again with the click of a mouse.

For this example, I will primarily use the Titanic Data as everyone knows this data by heart and has used it at some point in time. When we reach the Inferential Statistical tests section, I might need to change the data to have the tests make some sense.

Let’s jump right in.

2. Exploratory Data Analysis

a) Variables tab

As I have shown how to load the data, the Overview and Pivot Chart tabs in the EDA section in the previous articles, I will go to the Variables section of this Data Analysis tool directly and start the Data Analysis from there.

EDA Univariate Analysis (Ref: CyberDeck app)

So here, it’s pretty basic stuff. You select any column, showing you the distribution of that variable. The bottom table can show you aggregated value at any level. So now, let’s move on to the Interactions tab for the next step of our Data Analysis.

b) Interactions Tab

2d Scatter plot (Ref: CyberDeck app)
3d Scatter plot (Ref: CyberDeck app)
2d Histogram and Boxplots (Ref: CyberDeck app)

So you see that in the Interactions tab, we can plot all kinds of interactions in one go just by selecting the appropriate columns. Next, let’s move to the Correlations tab in this Data Analysis tool.

c) Correlations tab

In the Correlations tab, we can select multiple variables at once and see their correlation matrix in this Data Analysis tool. We can choose Pearson, Kendall, and Spearman for the correlation coefficient.

Correlation matrix (Ref: CyberDeck app)

d) Summary Statistics

In the first table, we can see every kind of summary possible for a particular column which is a helpful starting point for further Data Analysis. The possibilities will change depending on the type of column selected (numeric or categorical).

Titanic Age Statistics (Ref: CyberDeck app)

Here we get all the summary stats for the Age variable like mean, standard deviation, different percentiles, etc. If I expand the dropdown, it contains a lot of other options.

Options for other numeric statistics (Ref: CyberDeck app)

Now, if I had selected a Categorical column like Sex, the options would have changed.

Options for Categorical columns (Ref: CyberDeck app)

e) Outlier Detection

In the next section, we can detect outliers for any column. I will select the Parch column (Number of Parent or Children) in this example and see any outliers.

Outlier Detection (Ref: CyberDeck app)

So we see that values 7 and 10 are treated as outliers for the number of parents or children.

f) Text/Categorical column Data Analysis

I want to understand how the mean survival changed for the different cabins. This one is one of the advanced ones. Now note that some cabins of the Titanic were very close to the boats and for 1st class passengers. Some cabins were below the deck and were very far from the lifeboats. Does that make a difference?

We select Cabin as the categorical column, Survived as the value column, and Mean as the aggregation function.

Titanic Survival vs. Cabin number (Ref: CyberDeck app)

So we see that all the passengers in the cabins on the left have very slim chances of survival. The survival probability increases as we move to the right. It becomes around 50% for the cabins like C22, A14, etc. The mean survival rises to 100% for the cabins on the right-hand side like C32, E34, etc.

g) Individual Word Data Analysis

The categorical columns for this data set are mostly multiword like Name, Cabin Number, etc. But what if we want to use this Data Analysis tool at a word level? Let’s ask this question (probably it’s a ludicrous one). If the passengers have specific words in their names, does that increase their chances of survival?

Titanic Survival with Name (Ref: CyberDeck app)

Wow! It’s indeed something, isn’t it? Do you see how survival goes up after a certain point? We see names with Mrs, Miss, Barbara, and Aurora had higher survival. It’s evident we mostly see female characters in the region of higher survival because: “Women and Children first!”

h) Kernel Density Estimator (KDE)

We can select any numerical variable in the following plot and simultaneously see its histogram and KDE plot.

Histogram and KDE Plot (Ref: CyberDeck app)

i) Q-Q plot

Lastly, we have the Q-Q plot, a graphical method for comparing two probability distributions by plotting their quantiles against each other.

Q-Q plot (Ref: CyberDeck app)

That was quite some extensive EDA we performed here! Of course, I could have done way more extensive EDA with this Data Analysis tool. Still, my intention is not to bore you but to show you the possibilities of the CyberDeck Platform as a Data Science and Data Analysis tool. If you have made it until here, that’s fantastic! Let’s move on to the next part, a rather theoretical one — Conducting Inferential Statistical tests at the click of a mouse.

3. Inferential Statistical Tests for Data Analysis

Inferential statistics is one of the essential branches of Data Science where you are trying to reach conclusions that extend beyond the immediate data alone. For instance, we use inferential statistics to infer what the population might think from the sample data. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions and descriptive statistics to describe what’s going on in our data.

Now I will show you what type of tests you can perform with the CyberDeck Data Analysis tool with the click of a mouse. Please note that I might need to change the datasets for the tests to make some sense for some tests.

a) T-Test for Data Analysis

We use a t-test to compare the means of two groups. It is often used in hypothesis testing to determine whether a process or treatment affects the population of interest or whether two groups are different. Also, we assume that the data has a normal distribution while performing a t-test. But as this example is just for demonstration purposes, I will select the SibSp and the Parch columns for this test.

T-test (Ref: CyberDeck app)

We see that the p-value is unsurprisingly 0.002. So we can reject the Null Hypothesis that the mean of these two groups is the same — which is very evident! We also see that the T-value is not so large, indicating that even though these groups have different means, they are not vastly different. Apart from these two parameters, we also get the

  1. Degrees of Freedom (DOF)
  2. What type of t-test was this (Two-sided)
  3. 95% Confidence Interval (CI95%)
  4. The Cohen-d value: Cohen’s D measures explicitly the effect size of the difference between two means
  5. The Bayesian Factor: The Bayes factor is a likelihood ratio of the marginal likelihood of two competing hypotheses, usually a null and an alternative
  6. The power of this test: It is the probability that the test correctly rejects the null hypothesis when a specific alternative hypothesis is true

b) Pearson’s Correlation for Data Analysis

Pearson correlation is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations. Let’s measure the correlation between Age and SibSp with this Data Analysis tool. My hypothesis is with the increase in Age, the number of Siblings should decrease.

Pearson Correlation (Ref: CyberDeck app)

So indeed, we see that Pearson’s correlation coefficient (r-value) is negative (-0.30), indicating some negative correlation.

c) Robust Correlation for Data Analysis

Pearson’s correlation computation is often impacted by the non-normality of the data and the presence of outliers. We use The robust Correlation method to eradicate this problem. Let’s see the correlation between Age and P-class with this method.

Robust Correlation (Ref: CyberDeck app)

So we see that there is a negative correlation between the variables. So more older people will be in the 1st class, and more younger people will be in the 3rd class.

d) Shapiro–Wilk univariate Normality Test

Often, we need to validate if the data follows a Gaussian/normal distribution or not in our Data Analysis journey. So, Normal distributions have many unique properties which statisticians can exploit, and it is also much easier to handle than other complicated distributions.

Null Hypothesis: The random sample was drawn from a normal population.

Alternate Hypothesis: The random sample does not follow a normal distribution.

e) Henze-Zirkler (HZ) multivariate Normality test

We perform this test to see if multivariate normality distribution is present.

Normality test (Ref: CyberDeck app)

In the univariate normality test, Age does not follow a normal distribution.

The multivariate normality test between Age and SibSp doesn’t follow a normal distribution together either.

f) One Way ANOVA Test

The one-way ANOVA compares the means between the groups you are interested in and determines whether any of those means are statistically significantly different from each other.

Null Hypothesis: The means between the groups are identical.

Alternative Hypothesis: The means between the groups are different.

For this part, I will use different data. It looks like this.

ANOVA Data

We see several subjects whose Scores are measured in 3 months (August, January, and June). We also divide the subjects into two groups: Control and Meditation.

First, let us find if the mean scores for the months are significantly different.

One Way ANOVA test (Ref: CyberDeck App)

Before interpreting this table, let’s understand the terminologies.

  1. SS: Sum of Squares
  2. DF: Degrees of Freedom
  3. MS : mean squares (= SS / DF)
  4. F : F-value (test statistic)
  5. p-unc : uncorrected p-values
  6. np2: partial eta-square effect size

The p-value here is 0.027 and the F(2,177) = 3.685. If we consider the alpha level to be 0.05, we can reject the null hypothesis (i.e., the means of the time groups are different). But if the alpha level was 0.02, we fail to reject the null hypothesis.

g) Repeated Measures ANOVA

The one-way repeated measures ANOVA is the equivalent of the one-way ANOVA, but for related, not independent groups. It is sometimes called within-subjects ANOVA.

In this example, we will again measure if the means differ between the months. But we will specify the Subject column as the related groups.

Repeated Measures ANOVA (Ref: CyberDeck App)

Here we can again reject the null hypothesis at the alpha level of 0.05. So within subjects also, the means of the distinct periods are different.

h) Pairwise t-test (Parametric)

An ANOVA test does not tell us which groups’ means are different from each other. It only tells us whether the groups’ standards are different. To understand the pairwise difference between each group, we can perform various posthoc tests. One such test is the pairwise t-test.

Pairwise t-test (Ref: CyberDeck app)

Here we see the comparison between each month with the other. We see that the p-value of August-January and January-June is significant enough. Only for August-June, the p-value is 0.008, which is statistically significant; hence we can conclude that the means between these two months have a statistically significant difference. So we can only reject the null hypothesis for August-June and fail to reject the null hypothesis for the remaining two pairs.

Pairwise t-test is a paramteric test. Hence it has some assumptions about the distribution of the data. But if we are unsure about the distribution, we should carry out a non-parametric test. Pairwise Wilcoxon t-test is one such test.

i) Pairwise Wilcoxon t-test (Non-Parametric)

We put the same columns in this test and got the following results.

Pairwise Wilcoxon t-test (Ref: CyberDeck app)

Here, in the 2nd table(Pairwise Wilcoxon t-test), we get very similar results to the pairwise t-test.

j) Pairwise Correlation and Multiple Linear Regression

Here, we can measure the dataset’s pairwise correlation between multiple columns. We can also compute the regression equation between various variables. We only have two numeric columns here, so we get only one row.

Pairwise correlation and Multiple Linear Regression (Ref: CyberDeck app)

We see that Score and Subject have a very weak correlation. We also get the

  1. linear regression equation (the slope and the intercept coefficients)
  2. The Standard error
  3. T-value (coefficient/standard error)
  4. p-value
  5. r2 value
  6. adjusted r2 value

h) Chi-Square test of Independence

The Chi-square test of independence checks whether two variables are likely to be related or not.

Null hypothesis: there is no association between the two variables.

Alternative hypothesis: there is an association between the two variables.

We will use different data for this (Heart disease). The data looks like this.

Heart Disease Data

Let’s try to understand if there’s any association between Sex and target (Whether Sex has any association with whether the person had a heart attack or not).

Chi-Square test (Ref: CyberDeck app)

We see from the different tests that the p-value is significant at an alpha level of 0.05. So indeed, Sex has a powerful association with whether that person will have a heart attack or not.

Now you know why you need this free Data analysis tool.

We saw how we could use the CyberDeck platform for various Exploratory Data Analysis tasks and perform different inferential statistical tests. Even though we only scratched the surface of this app in this article, I believe it’s enough to understand the power of this platform and how much time this Data Analysis tool can save you in your Data Science journey.

Sign up Today for CyberDeck, the no-code Data Science platform. It has a free tier, where no credit cards are required. So don’t wait for some miracle to change the course of this beautiful journey you have embarked on. Take the leash in your hand and change it by yourself once and for all.

Important Resources

  1. CyberDeck: Sign Up
  2. Request a Demo Today
  3. Youtube Channel
  4. Blogs
  5. CyberDeck explores COVID data and predicts the number of deaths
  6. CyberDeck Time Series Forecasting
  7. Wiki

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: