IBM Data Science Certification: Course 1 — What is Data Science?

https://miro.medium.com/max/1200/0*knGmmXIJvdkgRhe2

Original Source Here

IBM Data Science Certification: Course 1 — What is Data Science?

I recently summarized my experience completing the IBM Data Science Certification (read it here!) and received tons of awesome feedback that some wished to learn more about each course. I decided I will break down each of the 10 courses and discusses important concepts and definitions discussed within those courses. My goal is to provide information outside of the course which can help strengthen your knowledge on the objectives laid out by the course creators. While these publications will not necessarily give you all of the answers, they will provide foundational knowledge for completing the data science specialization and for further development in your own career.

Disclaimer: I have no affiliation or financial incentives with IBM.

Photo by Kaleidico on Unsplash

Course 1: What is Data Science?

Course 1 of the IBM Data Certification is directed towards introducing the student to what data science is and some of the important concepts pertaining to data science. This course is not coding intensive but rather provides definitions and infomration to the user so they can have knowledge of the tools used by data scienctists. The course begins with explaining data scinece and how one becomes a data scientist, and ends with real world applications of data science as well as a report strucutre for presenting data to a client.

Week 1

The learning objectives for this week are directed toward understanding three concepts:

  1. What is Data Science?
  2. Paths that can lead to careers in Data Science
  3. Advice from Data science?

What is Data Science?

There are many definitions of data science.

Wikipedia:

“An interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from noisy, structured and unstructured data, and apply knowledge and actionable insights from data across a broad range of application domains.”

IBM (From the course):

“Data Science is a process, not an event. It is the process of using data to understand different things, to understand the world.” — “Data science is the art of uncovering the insights and trends that are hiding behind data. It’s when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution.”

IBM (Website):

“Data science is a multidisciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data collected and created by today’s organizations. Data science encompasses preparing data for analysis and processing, performing advanced data analysis, and presenting the results to reveal patterns and enable stakeholders to draw informed conclusions.”

While Data Science can have many meanings, the main takeaway is that Data Science encompasses many different disciplines (really any industry can utilize data science tools). For example. my background is an undergraduate degree in economics with chinese and a Masters in Operations Research.

Paths that can lead to careers in Data Science

There are many paths to becoming a data scientist. Whether you self-taught or enrolled in a data science curriculum, everyone has the chance to make a difference for clients using data science. As reported by careerkarma, the following industries currently provide the most support for an entryway and path into a data science career:

  • Financial Services
  • Information Technology
  • Healthcare
  • Retail
  • Media/Entertainment

Advice from Data Scientist

Course Advice:

The characteristics exhibited by the best data scientists are those who are curious, ask good questions, and are O.K. dealing with unstructured situations.

My advice:

While I am still early in my data science career. my advice is to constantly just keep pushing forward, learning, reading, and researching everything about and related to data science (Artificial Intelligence, Machine Learning, etc). Additionally, I am constantly trying to meet and ask successful data scientists what they recommend for me to keep doing. Finally, I continuously am reaching out to businesses to see if they have any projects which need a data science approach to gain more experience as well as build a project portfolio with more breadth.

Week 2

The learning objectives for week 2 are directed toward understanding three concepts:

  1. Big Data and its characteristics (Volume, Velocity, Veracity, Value)
  2. Big Data Tools (The course focuses on Hadoop)
  3. Skills Required to be a Data Scientist as well as to work with Big Data

Big Data

Definition (Oxford):

Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.

The following definitions from TechTarget help understand the V’s of big data:

Volume: The amount of data existing within a given set of data.

Velocity: The speed at which data is generated and moved across platforms.

Veracity: The amount of quality and accuracy present within a dataset (i.e. is there a lot of missing entries?).

Value: The value Big Data can bring to an organization.

Big Data Tools

Apache Hadoop: — From the website:

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

MongoDB — From the website:

Get your ideas to market faster with an application data platform built on the leading modern database. Support transactional, search, analytics, and mobile use cases while using a common query interface and the data model developers love.

Apache Cassandra — From the website:

Apache Cassandra is an open source NoSQL distributed database trusted by thousands of companies for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.

Cloudera — From the website:

Building a data-driven culture across the enterprise no longer has to add layers of complexity that impact business agility. As the growth and distribution of data continues, businesses must provide employees easy access to the data needed to make the right decisions. It’s essential. Because where data flows, ideas follow.

OpenRefine — From the website:

OpenRefine (previously Google Refine) is a powerful tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.

OpenRefine always keeps your data private on your own computer until YOU want to share or collaborate. Your private data never leaves your computer unless you want it to. (It works by running a small server on your computer and you use your web browser to interact with it)

Data Science Skills

Skills discussed by the course:

As outlined by zipreporting, there are 7 steps in the data science process:

  1. Data Cleaning ~ Clean the data so it can be read in the program being used (fill in missing entries, etc).
  2. Data Integration ~ Combining different types of data and datasets together.
  3. Data Reduction for Data Quality ~ Decreasing the size of the dataset by discarding unimportant features and information.
  4. Data Transformation ~ Depending on the analysis, the data may need to be transformed (For example, a log transformation was used on the dataset in a regression analysis).
  5. Data Mining ~The implementation of applications by organizations to collect big data and extract latent patterns.
  6. Pattern Evaluation ~ The discussion and gathering of insights on patterns of a dataset by a subject matter expert and data scientist.
  7. Representing Knowledge in Data Mining ~ The use of visualizations, reports, and presentations to inform the client of the information uncovered through data mining.
  • How to use Machine Learning.

There are many ways to learn how to use and implement machine learning techniques on a dataset. My advice is to learn the math behind the machine learning algorithms you implement before using them. Once you know what a specific algorithm does, I recommend picking a coding language and sticking to it. For example, I code in Python and some of the main Python machine learning libraries are:

  1. TensorFlow ~ Useful for programming machine learning algorithms which will become operational in a given organization’s day to day workflow.
  2. Pytorch ~Useful for dynamic programming of machine learning algorithms.
  3. Scikit-Learn ~ Useful for implementing classification, regression, and clustering machine learning algorithms.
  4. Keras ~An API that is useful for implementing machine learning algorithms such as neural networks and random forests.
  5. Natural Language Tool Kit (NLTK) ~ Useful for Natural Language Processing (NLP).
  6. Pandas ~ Useful for creating data frames and manipulating the data within a dataset.
  7. OpenCv ~ Useful for computer vision projects.
  • How to use Deep Learning.

Learning how to actually understand and implement Deep Learning is definitely beyond the scope of this publication. The definition of Deep Learning, defined by Wikipedia, is the use of artificial neural networks (or variations of neural networks) for representation learning. This can include supervised, semi-supervised, and unsupervised learning tasks. The main idea is Deep Learning uses neural networks.

Week 3

The learning objectives for Week 3 are directed toward understanding three concepts:

  1. Applications of Data Science (The course specifically looks at health care)
  2. How a company can start its data science journey
  3. Consumer Data Generation

Applications for Data Science

There are MANY applications of data science in various industries. The examples given in the course include

  • Amazon, Netflix, Spotify, Google~ Recommendation systems
  • Siri ~ creates answers through the use of data science
  • Fitbits, Apple Watches, Andriod Watches ~ Biometric tracking and analysis

Consumer-Generated Data

The reseason for this objective in the course is for the student to understand that a lot of the data used by companies for training their machine learning algorthms is produced by the consumer. Consumer-genenerated data includes:

  • Search history
  • Order history
  • Time on screen
  • Advertisment clicks
  • Data from wearable devices

Report Structure

Week 3 also provides the layout for a data science report.

  1. Cover Page ~ Title, authors’ names, authors’ affiliations, authors’ contacts, publisher, publication date
  2. Table of Contents ~ Chapter page numbers, section page numbers, list of tables, list of figures
  3. Abstract ~ provides a summary of the techniques implement as well as the main conclusion from the research

4. Methodology Section ~ Provides the reader with the dataset organization, processes, and research methods used.

5. Results Section~ Key findings

6. Discussion Section ~ Talk about the results and what they mean.

7. Conclusion Section ~ Reiterate important findings from the analysis, answer any questions previously stated, and provide recommendations for future research.

8. Appendix ~ Supplementary Information

While the structure of a report will vary from project to project, this is a great general guideline to follow when creating the final document for your research!

Conclusion

For being this first of ten coures in the IBM Data Science Certificaiton, this course does a great job of provide background information to the user. Understanding the background of data science is important so the user knows why they may be implementing a certain machine learning algorthm later in the course. I highly recommend this course for anyone who wants to gain a general understanding of what data sceince is as well as to see what companies are implementing data science in their day to day to operations.

If you enjoyed today’s reading, please give me a follow and let me know if there is another topic you would like me to explore! Additionally, add me on LinkedIn, or feel free to reach out! Thanks for reading!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: