Data Science Problems Solving Series Day-01

Zohaib Ahmed | Kaggle Master
5 min readDec 16, 2022

I am starting this series for data scientists who want to change their thinks and ideas into code using different data science and machine learning tools.

Day 01 — Agenda

  • What is Data Science
  • Basic Libraries used for data science and machine learning
  • Series Vs DataFrame
  • Go with some basic code

What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract or extrapolate knowledge and insights from noisy, structured, and unstructured data, and apply knowledge from data across a broad range of application domains. Data science is related to data mining, machine learning, big data, computational statistics, and analytics.

Skills required in data science
In a field like data science, there are a number of technical skills that are helpful to have before diving in, such as:

  • Deep knowledge and familiarity with statistical analysis
  • Machine learning
  • Deep learning
  • Data visualization
  • Mathematics
  • Programming
  • Ability to manage unstructured data
  • Familiarity with SAS, Hadoop, Spark, Python, R, and other data analysis tools
  • Big data processes, systems, and networks
  • Software engineering

A career in data science is not limited to technical knowledge. You’ll work on teams with other engineers, developers, coders, analysts, and business managers. These workplace skills will help take you further:

  • Communication skills
  • Storytelling
  • Critical thinking and logic
  • Business acumen

Basic Libraries used for Data Science and Machine Learning

Python is the most widely used programming language today. When it comes to solving data science tasks and challenges, Python never ceases to surprise its users. Most data scientists are already leveraging the power of Python programming every day. Python is an easy-to-learn, easy-to-debug, widely used, object-oriented, open-source, high-performance language, and there are many more benefits to Python programming. Python has been built with extraordinary Python libraries for data science that are used by programmers every day in solving problems. Here are the top Python libraries for data science:

NumPy

NumPy (Numerical Python) is the fundamental package for numerical computation in Python; it contains a powerful N-dimensional array object. It has around 18,000 comments on GitHub and an active community of 700 contributors. It’s a general-purpose array-processing package that provides high-performance multidimensional objects called arrays and tools for working with them. NumPy also addresses the slowness problem partly by providing these multidimensional arrays as well as providing functions and operators that operate efficiently on these arrays.

Features:
Provides fast, precompiled functions for numerical routines
Array-oriented computing for better efficiency
Supports an object-oriented approach
Compact and faster computations with vectorization

Applications:
Extensively used in data analysis
Creates powerful N-dimensional array
Forms the base of other libraries, such as SciPy and scikit-learn
Replacement of MATLAB when used with SciPy and matplotlib

Pandas

Pandas (Python data analysis) is a must in the data science life cycle. It is the most popular and widely used Python library for data science, along with NumPy in matplotlib. With around 17,00 comments on GitHub and an active community of 1,200 contributors, it is heavily used for data analysis and cleaning. Pandas provide fast, flexible data structures, such as data frame CDs, which are designed to work with structured data very easily and intuitively.

Features:
Eloquent syntax and rich functionalities that gives you the freedom to deal with missing data
Enables you to create your own function and run it across a series of data
High-level abstraction
Contains high-level data structures and manipulation tools

Applications:
General data wrangling and data cleaning
ETL (extract, transform, load) jobs for data transformation and data storage, as it has excellent support for loading CSV files into its data frame format
Used in a variety of academic and commercial areas, including statistics, finance, and neuroscience
Time-series-specific functionality, such as date range generation, moving window, linear regression, and date shifting.

Matplotlib

Matplotlib has powerful yet beautiful visualizations. It’s a plotting library for Python with around 26,000 comments on GitHub and a very vibrant community of about 700 contributors. Because of the graphs and plots that it produces, it’s extensively used for data visualization. It also provides an object-oriented API, which can be used to embed those plots into applications.

Features:
Usable as a MATLAB replacement, with the advantage of being free and open source
Supports dozens of backends and output types, which means you can use it regardless of which operating system you’re using or which output format you wish to use
Pandas itself can be used as wrappers around MATLAB API to drive MATLAB like a cleaner
Low memory consumption and better runtime behavior

Applications:
Correlation analysis of variables
Visualize 95 percent confidence intervals of the models
Outlier detection using a scatter plot etc.
Visualize the distribution of data to gain instant insights

Scikit-learn

Next in the list of the top python libraries for data science comes Scikit-learn, a machine learning library that provides almost all the machine learning algorithms you might need. Scikit-learn is designed to be interpolated into NumPy and SciPy.

Applications:
clustering
classification
regression
model selection
dimensionality reduction

Series Vs DataFrame

Series is a type of list in pandas that can take integer values, string values, double values, and more. But in Pandas Series we return an object in the form of a list, having an index starting from 0 to n, Where n is the length of values in the series. Later in this article, we will discuss dataframes in pandas, but we first need to understand the main difference between Series and Dataframe. Series can only contain a single list with an index, whereas dataframe can be made of more than one series or we can say that a dataframe is a collection of series that can be used to analyze the data.

Code #1: Creating a simple Series

#importing pandas library
import pandas as pd

#Creating a list
author = ['Jitender', 'Purnima', 'Arpit', 'Jyoti']

#Creating a Series by passing list variable to Series() function
auth_series = pd.Series(author)

#Printing Series
print(auth_series)
simple series in python code

Code #2: Creating Dataframe from multiple Series

#Importing Pandas library
import pandas as pd

#Creating two lists
author = ['Jitender', 'Purnima', 'Arpit', 'Jyoti']
article = [210, 211, 114, 178]

#Creating two Series by passing lists
auth_series = pd.Series(author)
article_series = pd.Series(article)

#Creating a dictionary by passing Series objects as values
frame = { 'Author': auth_series, 'Article': article_series }

#Creating DataFrame by passing Dictionary
result = pd.DataFrame(frame)

#Printing elements of Dataframe
print(result)
Creating Dataframe from multiple Series

In this tutorial, we look at how data science works and what its main libraries are which will help us to write code for data and machine learning in the next tutorial we will start properly coding to solve problems. thanks see you next tutorial.

--

--

Zohaib Ahmed | Kaggle Master

Kaggle Master - Highly interested in data science and machine learning.