Data Science/U20CSCJ11
UNIT-1
PART-A
1. Outline of NumPy
NumPy (Numerical Python) is a powerful library for numerical computing in Python. It provides:
- Multidimensional array objects (ndarray).
- Functions for performing mathematical, logical, shape manipulation, and statistical operations.
- Tools for integrating C/C++ and Fortran code.
- Support for large datasets with fast operations due to optimized C code.
2. Creating 1D and 2D Arrays in NumPy
1D Array:
import numpy as np
array_1d = np.array([1, 2, 3, 4, 5])
print(array_1d)
2D Array:
array_2d = np.array([[1, 2, 3], [4, 5, 6]])
print(array_2d)
3. Outline of Pandas
Pandas is a data analysis and manipulation library in Python. It offers:
- Data structures: Series (1D) and DataFrame (2D).
- Tools for data cleaning, transformation, and aggregation.
- Support for handling missing data.
- Built-in functions for merging, joining, and reshaping data.
Key Benefits:
- Simplifies handling structured data like spreadsheets and databases.
- Integrates seamlessly with NumPy for numerical operations.
4. Creating 3D Arrays in NumPy
import numpy as np
array_3d = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])
print(array_3d)
5. Sketch about Data Science
Data Science involves extracting insights and knowledge from data using:
- Statistics and mathematics.
- Programming languages like Python and R.
- Machine Learning and AI for predictive analysis.
- Tools like Pandas, NumPy, TensorFlow, and more.
Components of Data Science:
- Data Collection
- Data Cleaning
- Data Analysis
- Model Building
- Visualization and Reporting
6. Benefits of Data Science
- Enhanced Decision-Making: Data-driven insights lead to better business strategies.
- Automation: Predictive models reduce human intervention.
- Personalization: Tailors customer experiences (e.g., recommendation systems).
- Optimization: Improves efficiency across industries like healthcare, finance, and logistics.
7. Uses of Data Science
- Healthcare: Predicting diseases, personalized treatments.
- E-Commerce: Recommendations, trend analysis.
- Finance: Fraud detection, credit scoring.
- Marketing: Customer segmentation, sentiment analysis.
- Transportation: Traffic optimization, autonomous vehicles.
8. Types of Data
- Structured Data: Tabular format, easily stored in databases.
- Unstructured Data: Free-form like text, images, audio.
- Semi-Structured Data: Contains elements of both, like JSON, XML.
- Time-Series Data: Data points indexed in time order.
9. Operations using NumPy
Mathematical Operations:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
print(a + b) # Element-wise addition
Matrix Multiplication:
a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6], [7, 8]])
print(np.dot(a, b))
Statistical Operations:
arr = np.array([1, 2, 3, 4, 5])
print(np.mean(arr), np.std(arr))
10. Pandas DataFrame
DataFrame is a 2D data structure with labeled rows and columns.
Creating a DataFrame:
import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)
Viewing Data:
print(df.head()) # First few rows
print(df.tail()) # Last few rows
Viewing Data:
print(df.head()) # First few rows
print(df.tail()) # Last few rows
Filtering:
print(df[df['Age'] > 25])
UNIT-1
PART-B
1.Summarize the Pandas in python and Write a sample program.
Pandas is a popular Python library used for data manipulation and analysis. It provides two main data structures:
- Series: One-dimensional labeled array capable of holding any data type (integer, string, float, etc.).
- DataFrame: Two-dimensional labeled data structure, similar to a spreadsheet or SQL table.
Key Features of Pandas:
- Handles missing data effectively.
- Data alignment for operations on data with differing indexes.
- Built-in grouping and aggregation functions.
- Tools for merging, reshaping, and filtering data.
- Excellent for data cleaning and preprocessing.
import pandas as pd
# Create a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
# Display the DataFrame
print("Original DataFrame:")
print(df)
# Add a new column
df['Salary'] = [70000, 80000, 75000, 85000]
# Filter rows where Age > 30
filtered_df = df[df['Age'] > 30]
# Display the modified DataFrame
print("\nModified DataFrame:")
print(df)
# Display the filtered rows
print("\nFiltered DataFrame (Age > 30):")
print(filtered_df)
# Save the DataFrame to a CSV file
df.to_csv('sample_data.csv', index=False)
o/p
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 Diana 40 Houston
2.Outline of Box-Plot and write syntax to create it.
A Box Plot (or Box-and-Whisker Plot) is a graphical representation of data distribution based on five summary statistics:
- Minimum: The smallest value in the dataset.
- First Quartile (Q1): The median of the lower half of the data.
- Median (Q2): The middle value of the dataset.
- Third Quartile (Q3): The median of the upper half of the data.
- Maximum: The largest value in the dataset.
Additionally:
- The Interquartile Range (IQR) is Q3−Q1.
- Whiskers extend to the smallest and largest values within 1.5×IQR from the quartiles.
- Outliers are points outside the whiskers, often shown as individual dots.
Purpose of a Box Plot:
- Summarizes data distribution.
- Identifies outliers.
- Compares distributions across different groups.
sample code:
import matplotlib.pyplot as plt
# Sample data
data = [7, 8, 8, 5, 4, 6, 7, 8, 3, 9, 10, 6, 4]
# Create a box plot
plt.boxplot(data)
plt.title('Box Plot Example')
plt.ylabel('Values')
plt.show()
3.Outline of NumPy in python? Write a sample program.
NumPy (Numerical Python) is a fundamental library for numerical computing in Python, offering support for arrays, matrices, and many mathematical functions. Below is a structured outline:
. Core Features
- N-dimensional Arrays: Homogeneous multidimensional arrays (
ndarray). - Mathematical Operations: Element-wise operations, linear algebra, and random number generation.
- Broadcasting: Apply operations on arrays of different shapes.
- Indexing and Slicing: Access array elements using slicing and indexing.
. Key Functions
- Array Creation:
numpy.array() - Create arrays from lists/tuples.numpy.zeros() / numpy.ones() - Initialize arrays with zeros/ones.numpy.arange() - Create evenly spaced values.numpy.linspace() - Create linearly spaced values.
- Array Operations:
- Addition, subtraction, multiplication, division, dot products.
- Reshaping Arrays:
numpy.reshape(), numpy.transpose().
- Statistical Operations:
numpy.mean(), numpy.median(), numpy.std().
- Linear Algebra:
numpy.linalg.inv(), numpy.dot(), numpy.linalg.eig().
- Random Numbers:
numpy.random.rand(), numpy.random.randint().
sample code:
import numpy as np
# 1. Creating Arrays
a = np.array([1, 2, 3])
b = np.array([[1, 2], [3, 4]])
# 2. Array Operations
c = a + 5 # Element-wise addition
d = b * 2 # Element-wise multiplication
# 3. Mathematical Operations
sum_array = np.sum(a)
mean_array = np.mean(b)
product = np.dot([1, 2], [3, 4]) # Dot product
# 4. Array Reshaping
reshaped = b.reshape(4, 1)
# 5. Random Numbers
random_array = np.random.rand(3, 3)
# Display Results
print("Array a:", a)
print("Array b:\n", b)
print("Array c (a + 5):", c)
print("Array d (b * 2):\n", d)
print("Sum of a:", sum_array)
print("Mean of b:", mean_array)
print("Dot Product:", product)
print("Reshaped Array b:\n", reshaped)
print("Random Array:\n", random_array)
o/p:
Array a: [1 2 3]
Array b:
[[1 2]
[3 4]]
Array c (a + 5): [6 7 8]
Array d (b * 2):
[[ 2 4]
[ 6 8]]
Sum of a: 6
Mean of b: 2.5
Dot Product: 11
Reshaped Array b:
[[1]
[2]
[3]
[4]]
Random Array:
[[0.41677682 0.93526423 0.04242474]
[0.71555686 0.48251262 0.8344023 ]
[0.08776882 0.9741319 0.33967011]]
4.Develop a data frame using list.
In Python, you can use the pandas library to create a DataFrame from a list. A DataFrame is a two-dimensional, tabular data structure that is ideal for handling structured data. Below is an example:
import pandas as pd
# Single-dimensional list
data = [10, 20, 30, 40, 50]
# Create DataFrame from list
df_single = pd.DataFrame(data, columns=['Numbers'])
print("DataFrame from Single-dimensional List:")
print(df_single)
# Two-dimensional list (list of lists)
data_2d = [
[1, 'Alice', 23],
[2, 'Bob', 27],
[3, 'Charlie', 22]
]
# Create DataFrame from 2D list
df_2d = pd.DataFrame(data_2d, columns=['ID', 'Name', 'Age'])
print("\nDataFrame from Two-dimensional List:")
print(df_2d)
o/p:
DataFrame from Single-dimensional List:
Numbers
0 10
1 20
2 30
3 40
4 50
DataFrame from Two-dimensional List:
ID Name Age
0 1 Alice 23
1 2 Bob 27
2 3 Charlie 22
5.Finding the Transpose of a NumPy array - transpose() method and Reshaping a
NumPy array
A.
Finding the Transpose of a NumPy Array
The numpy.transpose() method or .T attribute is used to get the transpose of an array. The transpose swaps the array's rows and columns.
Syntax -
numpy.transpose(a, axes=None)
a: The array to be transposed.axes: Optional; specify the order of axes. Default is to reverse axes.
Example: Transpose of a NumPy Array
import numpy as np
# Original Array
array = np.array([[1, 2, 3], [4, 5, 6]])
# Transpose using .T
transpose1 = array.T
# Transpose using numpy.transpose()
transpose2 = np.transpose(array)
print("Original Array:")
print(array)
print("\nTranspose using .T:")
print(transpose1)
print("\nTranspose using numpy.transpose():")
print(transpose2)
output -
Original Array:
[[1 2 3]
[4 5 6]]
Transpose using .T:
[[1 4]
[2 5]
[3 6]]
Transpose using numpy.transpose():
[[1 4]
[2 5]
[3 6]]
Reshaping a NumPy Array
The numpy.reshape() method changes the shape of an array without altering its data.
Syntax
numpy.reshape(a, newshape)
- a: The array to be reshaped.
- newshape: A tuple specifying the desired shape. The product of dimensions must match the size of the original array.
Example: Reshaping a NumPy Array
import numpy as np
# Original Array
array = np.array([1, 2, 3, 4, 5, 6])
# Reshape into a 2x3 array
reshaped1 = array.reshape(2, 3)
# Reshape into a 3x2 array
reshaped2 = array.reshape(3, 2)
# Reshape using -1 (automatic dimension calculation)
reshaped_auto = array.reshape(2, -1)
print("Original Array:")
print(array)
print("\nReshaped into 2x3 Array:")
print(reshaped1)
print("\nReshaped into 3x2 Array:")
print(reshaped2)
print("\nReshaped with Automatic Calculation (2x-1):")
print(reshaped_auto)
Output
Original Array:
[1 2 3 4 5 6]
Reshaped into 2x3 Array:
[[1 2 3]
[4 5 6]]
Reshaped into 3x2 Array:
[[1 2]
[3 4]
[5 6]]
Reshaped with Automatic Calculation (2x-1):
[[1 2 3]
[4 5 6]]
Key Points
- Transpose: Rearranges rows and columns.
- Reshape: Changes the dimensionality of the array while preserving data.
- Use
-1 in reshape() to let NumPy calculate one dimension automatically.
6.Finding Mean, Median and Standard deviation on NumPy arrays
Explanation:
- Mean: Average of all elements in the array.
- Median: Middle value when the elements are sorted.
- Standard Deviation: Measure of the amount of variation or dispersion of data values.
Program:
import numpy as np
# Create a NumPy array
data = np.array([4, 2, 7, 5, 9])
# Calculate Mean
mean = np.mean(data) # Sum of all elements divided by total count
# Calculate Median
median = np.median(data) # Middle value in sorted array
# Calculate Standard Deviation
std_dev = np.std(data) # Square root of the average of squared deviations from the mean
print("Array:", data)
print("Mean:", mean)
print("Median:", median)
print("Standard Deviation:", std_dev)
Output:
Array: [4 2 7 5 9]
Mean: 5.4
Median: 5.0
Standard Deviation: 2.280350850198276
7.Write a program to print the eigen values and eigen vectors of a matrix
Explanation:
- Eigenvalues: Scalars associated with a matrix that provide insights into its properties.
- Eigenvectors: Non-zero vectors that change only in scale when a linear transformation is applied
Program:
import numpy as np
# Create a square matrix
matrix = np.array([[4, 2],
[3, 1]])
# Compute Eigenvalues and Eigenvectors
eigen_values, eigen_vectors = np.linalg.eig(matrix)
print("Matrix:")
print(matrix)
print("\nEigenvalues:")
print(eigen_values)
print("\nEigenvectors:")
print(eigen_vectors)
Output:
Matrix:
[[4 2]
[3 1]]
Eigenvalues:
[5.37228132 -0.37228132]
Eigenvectors:
[[ 0.82456484 -0.41597356]
[ 0.56576746 0.90937671]]
8.Compute and display the sine and cosine of a matrix.
Explanation:
- Sine (
sin) and cosine (cos) are trigonometric functions. - These are applied element-wise to a matrix.
program
import numpy as np
# Create a matrix
matrix = np.array([[0, np.pi/2],
[np.pi, 3*np.pi/2]])
# Compute Sine
sine_matrix = np.sin(matrix) # Sine of each element in the matrix
# Compute Cosine
cosine_matrix = np.cos(matrix) # Cosine of each element in the matrix
print("Matrix:")
print(matrix)
print("\nSine of Matrix:")
print(sine_matrix)
print("\nCosine of Matrix:")
print(cosine_matrix)
output
Matrix:
[[0. 1.57079633]
[3.14159265 4.71238898]]
Sine of Matrix:
[[ 0.0000000e+00 1.0000000e+00]
[ 1.2246468e-16 -1.0000000e+00]]
Cosine of Matrix:
[[ 1.0000000e+00 6.1232340e-17]
[-1.0000000e+00 -1.8369702e-16]]
1.i) Illustrate the Data Science with its types.
ii) Take any two list, construct a program to plot a scatter plot along with a grid
with appropriate tiles for the graph, X -axis & Y-axis
i) Illustrating Data Science and its Types
What is Data Science?
Data Science is an interdisciplinary field that combines domain expertise, programming skills, and knowledge of mathematics and statistics to extract meaningful insights from structured and unstructured data. It involves processes such as data collection, cleaning, exploration, modeling, and visualization to solve real-world problems.
Types of Data Science
- Descriptive Analytics: Focuses on summarizing historical data to understand what happened in the past. Example: Dashboards showing sales trends.
- Diagnostic Analytics: Explores data to determine why a certain event occurred. Example: Identifying reasons for a decline in website traffic.
- Predictive Analytics: Uses statistical models and machine learning algorithms to forecast future outcomes. Example: Predicting stock prices.
- Prescriptive Analytics: Suggests actions to achieve desired outcomes. Example: Recommending products on e-commerce platforms.
- Exploratory Data Analysis (EDA): Involves visualizing data to uncover patterns and relationships. Example: Scatter plots, histograms.
- Machine Learning/AI Models: Algorithms used for classification, regression, clustering, and more. Example: Fraud detection systems.
ii) Python Program to Plot a Scatter Plot with a Grid
Below is a program that demonstrates how to create a scatter plot using two lists. It includes adding a grid and setting appropriate titles for the graph, X-axis, and Y-axis.
Code Example
import matplotlib.pyplot as plt
# Data for scatter plot
x = [10, 20, 30, 40, 50] # X-axis data
y = [15, 30, 10, 45, 25] # Y-axis data
# Creating the scatter plot
plt.scatter(x, y, color='blue', label='Data Points', s=100) # 's' for marker size
# Adding grid to the plot
plt.grid(True, linestyle='--', linewidth=0.5, alpha=0.7)
# Adding titles and labels
plt.title('Scatter Plot Example with Grid', fontsize=14, fontweight='bold')
plt.xlabel('X-Axis Label', fontsize=12)
plt.ylabel('Y-Axis Label', fontsize=12)
# Adding legend
plt.legend()
# Displaying the plot
plt.show()
Explanation of the Code
Importing Libraries:
matplotlib.pyplot is imported for plotting the graph.
Data Initialization:
- Two lists,
x and y, contain the data points for the X and Y axes.
Scatter Plot:
plt.scatter(x, y, ...) creates the scatter plot.- The color of points is set to blue, and the size (
s) is increased for better visibility.
Grid Addition:
plt.grid() adds a grid to the background with customizable line style, width, and transparency (alpha).
Labels and Titles:
plt.title() sets the graph's title.plt.xlabel() and plt.ylabel() add labels to the X and Y axes.
Legend:
plt.legend() provides a label for the data points.
Display:
plt.show() renders the scatter plot.
Output:
The plot will feature:
- A set of scattered points.
- A grid with dashed lines for clarity.
- Title: Scatter Plot Example with Grid.
- X-axis label: X-Axis Label.
- Y-axis label: Y-Axis Label.
Let me know if you'd like assistance running this or customizing it further!
2.Explain SciPy and construct a program to calculate the Inverse and Psuedo-Inverse of a input Matrix.
SciPy: Overview
What is SciPy?
SciPy is a Python library used for scientific and technical computing. It builds on the NumPy library and provides a wide range of modules for optimization, integration, interpolation, eigenvalue problems, algebraic equations, and other standard scientific computing tasks.
Key Features of SciPy:
- Scientific Functions: Advanced math functions like integration, optimization, and solving differential equations.
- Linear Algebra: Matrix operations, eigenvalues, singular value decomposition (SVD), and solving linear systems.
- Statistical Functions: Probability distributions, statistical tests, and random sampling.
- Signal and Image Processing: Tools for filtering, Fourier transforms, and image manipulation.
Program to Calculate the Inverse and Pseudo-Inverse of a Matrix
The inverse of a square matrix exists only if the matrix is non-singular (determinant ≠ 0). The pseudo-inverse (Moore-Penrose inverse) is a generalized inverse that can be computed for any matrix, including non-square or singular matrices.
Code Example
import numpy as np
from scipy.linalg import inv, pinv # Importing inverse and pseudo-inverse functions
# Input matrix
matrix = np.array([[1, 2], [3, 4]])
# Calculate inverse
try:
inverse_matrix = inv(matrix) # Compute inverse using SciPy
print("Inverse of the Matrix:")
print(inverse_matrix)
except np.linalg.LinAlgError:
print("Matrix is singular and does not have an inverse.")
# Calculate pseudo-inverse
pseudo_inverse_matrix = pinv(matrix) # Compute pseudo-inverse using SciPy
print("\nPseudo-Inverse of the Matrix:")
print(pseudo_inverse_matrix)
Explanation of the Code
Importing Libraries:
numpy is used to create and handle matrices.scipy.linalg.inv computes the matrix inverse.scipy.linalg.pinv computes the pseudo-inverse of the matrix.
Input Matrix:
- A 2x2 matrix is initialized as
[[1, 2], [3, 4]].
Matrix Inverse:
inv(matrix) computes the inverse.- If the matrix is singular (non-invertible), an exception is raised, which is handled using a
try-except block.
Matrix Pseudo-Inverse:
pinv(matrix) computes the Moore-Penrose pseudo-inverse, which exists for all matrices.
Output:
- The program prints the inverse (if possible) and the pseudo-inverse of the given matrix.
Output Example
For the input matrix:
The output will be:
Inverse of the Matrix:
[[-2. 1. ]
[ 1.5 -0.5]]
Pseudo-Inverse of the Matrix:
[[-2. 1. ]
[ 1.5 -0.5]]
If the matrix is singular (e.g., [1224]):
Matrix is singular and does not have an inverse.
Pseudo-Inverse of the Matrix:
[[0.04 0.08]
[0.08 0.16]]
3.i) Illustrate the Data Science process.
ii) Explain MatplotLib and construct a sample program to draw a line
A.
(i) Illustrating the Data Science Process
The Data Science process involves a series of steps aimed at extracting meaningful insights and creating actionable outputs from data. Below is an overview of the process:
- Define the Problem: Understand the business problem or objective that the data science project aims to address.
- Data Collection: Gather data from various sources, including databases, APIs, web scraping, or user-generated data.
- Data Cleaning: Remove inconsistencies, handle missing data, and correct errors to prepare the dataset.
- Exploratory Data Analysis (EDA): Analyze the dataset to find patterns, relationships, and insights using statistics and visualizations.
- Feature Engineering: Create meaningful features, normalize data, and encode categorical variables to improve model performance.
- Modeling: Select algorithms, train machine learning models, and tune hyperparameters.
- Evaluation: Assess the model's performance using metrics like accuracy, precision, recall, and F1-score.
- Deployment: Deploy the model into production or generate reports for stakeholders.
- Monitoring and Maintenance: Continuously monitor model performance and update as necessary.
Define Problem --> Collect Data --> Clean Data --> EDA --> Feature Engineering --> Model Building --> Evaluation --> Deployment --> Monitoring
(ii) Matplotlib Explanation and Sample Program
Matplotlib: Matplotlib is a widely used Python library for creating static, interactive, and animated visualizations. It is highly versatile and provides functions for creating a variety of plots, including line plots, bar charts, histograms, scatter plots, and more.
Key Features:
- Easy integration with NumPy and Pandas.
- Customizable plots with options for titles, labels, colors, etc.
- Ability to save plots in multiple formats (e.g., PNG, PDF).
Sample Program to Draw a Line
import matplotlib.pyplot as plt
# Data for the plot
x = [0, 1, 2, 3, 4, 5]
y = [0, 1, 4, 9, 16, 25]
# Creating the line plot
plt.plot(x, y, label='y = x^2', color='blue', linestyle='-', marker='o')
# Adding labels and title
plt.title('Line Plot Example')
plt.xlabel('X-axis')
plt.ylabel('Y-axis')
plt.legend()
# Display the plot
plt.show()
4.Explain NumPy and construct programs for the following i) Printing the
Dimensions of NumPy arrays ii) Displaying the Shape of NumPy array iii) Finding
the Size of NumPy array iv) Flattening a NumPy array
A.NumPy (Numerical Python) is a powerful Python library used for numerical computing. It provides support for:
- Multidimensional arrays and matrices.
- A large collection of mathematical functions to operate on arrays.
- Efficient handling of data due to its optimized C implementation.
Programs for the Requested Tasks
i) Printing the Dimensions of NumPy Arrays
The dimensions of a NumPy array can be obtained using the .ndim attribute.
import numpy as np
# Example array
array_1 = np.array([[1, 2, 3], [4, 5, 6]])
# Print dimensions
print("Dimensions of the array:", array_1.ndim)
ii) Displaying the Shape of a NumPy Array
The shape of a NumPy array can be obtained using the .shape attribute.
# Display the shape
print("Shape of the array:", array_1.shape)
iii) Finding the Size of a NumPy Array
The size (total number of elements) in a NumPy array can be obtained using the .size attribute.
# Find the size
print("Size of the array:", array_1.size)
iv) Flattening a NumPy Array
Flattening an array converts a multidimensional array into a 1D array using the .flatten() method.
# Flatten the array
flattened_array = array_1.flatten()
print("Flattened array:", flattened_array)
5.Explain Pandas Data Frame and llustrate the way to create it with a program.
A Pandas DataFrame is a two-dimensional, mutable, and heterogeneous data structure in Python that is similar to a table in databases or an Excel spreadsheet. It is part of the Pandas library and provides labeled axes (rows and columns) to store and manipulate data efficiently.
Key features:
- Supports labeled data (row and column labels).
- Can hold data of different types (integers, floats, strings, etc.).
- Offers powerful operations for data analysis and manipulation.
Creating a Pandas DataFrame
A Pandas DataFrame can be created from various data structures:
- Dictionary of lists/arrays.
- List of dictionaries.
- NumPy arrays.
- External data sources like CSV or Excel files.
Example: Creating a Pandas DataFrame
Here is a program illustrating multiple ways to create a DataFrame.
import pandas as pd
# 1. Creating a DataFrame from a dictionary of lists
data = {
"Name": ["Alice", "Bob", "Charlie"],
"Age": [25, 30, 35],
"City": ["New York", "Los Angeles", "Chicago"]
}
df1 = pd.DataFrame(data)
print("DataFrame from a dictionary of lists:")
print(df1)
# 2. Creating a DataFrame from a list of dictionaries
data_list = [
{"Name": "David", "Age": 40, "City": "Seattle"},
{"Name": "Eve", "Age": 28, "City": "Boston"}
]
df2 = pd.DataFrame(data_list)
print("\nDataFrame from a list of dictionaries:")
print(df2)
# 3. Creating a DataFrame from a NumPy array
import numpy as np
array_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
df3 = pd.DataFrame(array_data, columns=["Column1", "Column2", "Column3"])
print("\nDataFrame from a NumPy array:")
print(df3)
Output
DataFrame from a dictionary of lists:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
DataFrame from a list of dictionaries:
Name Age City
0 David 40 Seattle
1 Eve 28 Boston
DataFrame from a NumPy array:
Column1 Column2 Column3
0 1 2 3
1 4 5 6
2 7 8 9
Notes
- Custom Indexing: You can specify custom row labels (index) using the
index parameter. - Custom Column Names: Specify column names using the
columns parameter. - Integration: DataFrames are highly compatible with NumPy and other Python libraries, making them essential for data science and machine learning tasks.
6.Explain Seaborn and construct a sample program to make pairplot.
Seaborn is a Python data visualization library built on top of Matplotlib. It provides:
- A high-level interface for creating attractive and informative statistical graphics.
- Built-in themes and color palettes to make charts visually appealing.
- Simplified creation of complex visualizations, like pair plots, heatmaps, violin plots, and more.
What is a Pair Plot?
A pair plot is a grid of scatter plots and histograms used to visualize pairwise relationships in a dataset. It is particularly useful for exploring the relationships between multiple variables and their distributions.
Sample Program to Create a Pair Plot
Here is an example using Seaborn's pairplot() function.
import seaborn as sns
import matplotlib.pyplot as plt
# Load a sample dataset
data = sns.load_dataset('iris')
# Create a pair plot
sns.pairplot(data, hue='species', diag_kind='kde', palette='Set2')
# Show the plot
plt.show()
Explanation of the Code
Loading Dataset:
sns.load_dataset('iris'): Loads the famous Iris dataset, which contains measurements of sepal and petal dimensions for three species of iris flowers.
Creating the Pair Plot:
hue='species': Colors the points based on the species of the iris.diag_kind='kde': Displays kernel density estimation (KDE) plots on the diagonal to show the distribution of each variable.palette='Set2': Sets a predefined color palette for better visual appeal.
Displaying the Plot:
plt.show(): Displays the plot in the output window.
Output
The pair plot will:
- Show scatter plots for pairwise combinations of all numerical variables (sepal length, sepal width, petal length, and petal width).
- Color code the points based on the species of the iris.
- Show KDE plots on the diagonal for each variable.
Key Features of Seaborn Pair Plot
- It’s customizable, with options to change the types of plots (
diag_kind='kde' or diag_kind='hist'), color palettes, markers, etc. - Automatically handles categorical data via the
hue parameter. - Can add regression lines or other overlays for advanced analysis.
7.Explain Data Frame methods with atleast 12 Functions and respective descriptions
A.
Pandas DataFrame Methods with Descriptions and Examples
Here’s a list of 12 commonly used Pandas DataFrame methods with brief explanations and examples:
1. head(n)
- Description: Returns the first
n rows of the DataFrame. - Use Case: Quickly view the top rows of your dataset
import pandas as pd
df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
print(df.head(2)) # First 2 rows
2. tail(n)
- Description: Returns the last
n rows of the DataFrame. - Use Case: Quickly inspect the bottom rows of your dataset.
print(df.tail(1)) # Last row
3. info()
- Description: Displays a concise summary of the DataFrame, including data types, non-null counts, and memory usage.
- Use Case: Understand the structure of your dataset.
4. describe()
- Description: Generates descriptive statistics (count, mean, min, max, etc.) for numerical columns.
- Use Case: Quickly summarize the central tendency and dispersion of data.
5. shape
- Description: Returns the dimensions of the DataFrame as a tuple (rows, columns).
- Use Case: Determine the size of your dataset.
6. columns
- Description: Returns a list of column labels in the DataFrame.
- Use Case: Check column names for further manipulation.
7. sort_values(by)
- Description: Sorts the DataFrame by the specified column(s).
- Use Case: Reorganize data based on specific criteria.
sorted_df = df.sort_values(by="B")
print(sorted_df)
8. drop(columns)
- Description: Removes specified columns or rows.
- Use Case: Eliminate unnecessary data from your DataFrame.
df_dropped = df.drop(columns=["A"])
print(df_dropped)
9. isnull()
- Description: Returns a DataFrame of the same shape with boolean values indicating
NaN (missing) values. - Use Case: Identify missing data in your dataset.
10. fillna(value)
- Description: Fills missing values (
NaN) with the specified value or method. - Use Case: Handle missing data.
df_filled = df.fillna(0)
print(df_filled)
11. groupby(by)
- Description: Groups the DataFrame by the specified column(s) for aggregation.
- Use Case: Perform aggregate calculations like sum, mean, etc., on grouped data.
grouped = df.groupby("A").sum()
print(grouped)
12. merge()
- Description: Merges two DataFrames based on a key column or index.
- Use Case: Combine datasets for analysis
df1 = pd.DataFrame({"Key": [1, 2], "Value1": ["A", "B"]})
df2 = pd.DataFrame({"Key": [1, 2], "Value2": ["C", "D"]})
merged_df = pd.merge(df1, df2, on="Key")
print(merged_df)