Data Science Practical days basics day01/100

Zohaib Ahmed | Kaggle Master
6 min readOct 2, 2023

Introduction

Data science is a field that deals with the collection, analysis, and interpretation of data. It is used in a wide variety of industries, including finance, healthcare, and marketing.

Some of the basic tasks in data science include:

  • Calculating the mean, median, and mode of a dataset. These are measures of the central tendency of a dataset and can be used to understand the overall distribution of the data.
  • Calculating the standard deviation and variance of a dataset. These are measures of the spread of a dataset and can be used to understand how much the data values vary from the mean.
  • Calculating the correlation coefficient between two variables. This is a measure of the strength and direction of the relationship between two variables.
  • Fitting a linear regression model to a dataset. This is a simple machine learning algorithm that can be used to predict the value of one variable based on the value of another variable.
  • Implementing a simple decision tree classifier. This is a machine learning algorithm that can be used to classify data points into different categories.

These tasks are essential for understanding and interpreting data. They can be used to identify patterns and trends in data, to make predictions, and to classify data points.

Practical

Task 01

Calculate the mean, median, and mode of a dataset. This is a basic task in data science, and it is often the first step in analyzing a new dataset. To do this in Python, you can use the numpy library.

import pandas as pd
import numpy as np

# Create a Pandas DataFrame from the list of ages
data = pd.DataFrame({"Age":[34,35,23,44,23,34,23,34,23,34,45,34,44,34,34,23,34]})

# Calculate the mean of the ages
print("dataset Mean = ",np.mean(data['Age']))

# Calculate the mode of the ages
print("dataset Mode = ",data['Age'].mode()[0])

# Calculate the median of the ages
print("dataset Median = ",data['Age'].median())

Output:

dataset Mean = 32.64705882352941

dataset Mode = 34

dataset Median = 34.0

Task 02

Calculate the standard deviation and variance of a dataset. The standard deviation and variance are measures of the spread of a dataset. To calculate these in Python, you can also use the numpy library.

import pandas as pd
import numpy as np

# Import the Pandas and NumPy libraries.

data = pd.DataFrame({"Age":[34,35,23,44,23,34,23,34,23,34,45,34,44,34,34,23,34]})

# Create a Pandas DataFrame from the list of ages.

print("dataset std is = ",np.std(data['Age']))

# Calculate the standard deviation of the ages using the np.std() function.
# The standard deviation is a measure of how spread out the values in a dataset are.

print("dataset variance is = ",np.var(data['Age']))

# Calculate the variance of the ages using the np.var() function.
# The variance is the square of the standard deviation.

Output:

dataset std is = 7.259405067752884

dataset variance is = 52.69896193771626

Task 03

Calculate the correlation coefficient between two variables. The correlation coefficient is a measure of the strength and direction of the relationship between two variables. To calculate this in Python, you can use the scipy.stats library.

import numpy as np

# Import the NumPy library.

x = np.array([1, 2, 3, 4, 5])
y = np.array([0, -1, -2, -3, -4])

# Create two NumPy arrays, one for x and one for y.

correlation_matrix = np.corrcoef(x, y)

# Calculate the correlation matrix between x and y using the np.corrcoef() function.
# The correlation matrix is a square matrix that shows the correlation between each pair of variables.

print(correlation_matrix)

# Print the correlation matrix.

Output:

[[ 1. -1.]

[-1. 1.]]

Task 04

Fit a linear regression model to a dataset. Linear regression is a simple machine learning algorithm that can be used to predict the value of one variable based on the value of another variable. To fit a linear regression model in Python, you can use the sklearn.linear_model library.

import numpy as np

# Create a list of data points
# Each data point is a list of two numbers, representing the x and y values
data = [
[1, 2],
[2, 3],
[3, 4],
[4, 5],
[5, 6]
]

# Convert the list to a NumPy array
# This will create two separate arrays, one for the x values and one for the y values
X = np.array(data)[:, 0]
y = np.array(data)[:, 1]

# Reshape the X and y arrays into 2D arrays
# This is necessary because the LinearRegression class expects 2D arrays as input
X = X.reshape(-1, 1)
y = y.reshape(-1, 1)

# Import the LinearRegression class from the sklearn.linear_model library
from sklearn.linear_model import LinearRegression

# Create an instance of the LinearRegression class
reg = LinearRegression()

# Fit the model to the data
# This will train the model to predict the y values for the given x values
reg.fit(X, y)

# Print the score of the model
# This is a measure of how well the model fits the data
print(reg.score(X, y))

# Import the matplotlib.pyplot library for plotting the data
import matplotlib.pyplot as plt

# Make predictions
# This will predict the y values for the given x values using the trained model
y_pred = reg.predict(X)

# Create a scatter plot of the data
# This will plot the actual y values (blue dots) and the predicted y values (red line)
plt.scatter(X, y)
plt.plot(X, y_pred, color='red')

# Set the labels and title of the plot
plt.xlabel('X')
plt.ylabel('y')
plt.title('Fit Line for Linear Regression Model')

# Display the plot
plt.show()

Output:

Task 04 output visual

Task 05

Implement a simple decision tree classifier. A decision tree classifier is a machine learning algorithm that can be used to classify data points into different categories. To implement a simple decision tree classifier in Python, you can use the sklearn.tree library.

import numpy as np

# Create a dummy data set
X = np.array([
[1, 2],
[3, 4],
[5, 6],
[7, 8]
])

y = np.array([0, 1, 0, 1])
from sklearn.tree import DecisionTreeClassifier

# Create a decision tree classifier object
clf = DecisionTreeClassifier()

# Fit the classifier to the dummy data
clf.fit(X, y)
# Make predictions on new data
X_new = np.array([
[9, 10],
[11, 12]
])

y_pred = clf.predict(X_new)
y_pred

Output:

array([1, 1])

Conclusion

The tasks described above are just a few of the many different tasks that data scientists perform. By understanding and mastering these basic tasks, you can develop the foundation for a successful career in data science.

Here are some specific examples of how these tasks can be used in the real world:

  • Calculating the mean, median, and mode of a dataset of customer purchase amounts can be used to understand the average customer purchase amount and the distribution of purchase amounts. This information can be used to develop marketing strategies and to set pricing.
  • Calculating the standard deviation and variance of a dataset of test scores can be used to understand how much the test scores vary from the mean. This information can be used to identify students who are struggling and to develop interventions to help them improve.
  • Calculating the correlation coefficient between customer satisfaction and customer churn can be used to understand the relationship between these two variables. This information can be used to develop strategies to reduce customer churn.
  • Fitting a linear regression model to a dataset of sales data can be used to predict future sales. This information can be used to make business decisions such as how much inventory to order and how much to spend on marketing.
  • Implementing a simple decision tree classifier to classify medical images can be used to help doctors diagnose diseases. This information can be used to provide better care to patients.

These are just a few examples of how the basic tasks of data science can be used to solve real-world problems. By understanding and mastering these tasks, you can make a significant impact on the world around you.

--

--

Zohaib Ahmed | Kaggle Master

Kaggle Master - Highly interested in data science and machine learning.