Python Data Visualization and Analysis Guide

Step 1: Install Necessary Libraries

Install the required libraries using pip:

pip install pandas matplotlib seaborn

Step 2: Import Required Libraries

import pandas as pd # For data manipulation
import matplotlib.pyplot as plt # For basic plotting
import seaborn as sns # For advanced statistical visualizations

Step 3: Load Data into a DataFrame

df = pd.read_csv('your_dataset.csv')

If your data is from another source (Excel, SQL, etc.), you can modify the code accordingly.

If you do not have your data you can use Sample data

Step 4: Explore the Data

4.1 View the first few rows of the DataFrame:

df.head()

4.2 Get basic statistics:

df.describe()

4.3 Check for missing values:

df.isnull().sum()

4.4 View data types:

df.dtypes

Step 5: Data Cleaning (Optional)

5.1 Drop missing values:

df_clean = df.dropna()

5.2 Fill missing values:

df['column_name'].fillna(df['column_name'].mean(), inplace=True)

5.3 Drop duplicate rows:

df_clean = df.drop_duplicates()

Step 6: Data Visualization

6.1 Line Plot (for Time Series or Continuous Data)

plt.figure(figsize=(10, 6))
plt.plot(df['date_column'], df['value_column'])
plt.title('Line Plot of Values Over Time')
plt.xlabel('Date')
plt.ylabel('Value')
plt.show()

6.2 Bar Plot (for Categorical Data)

plt.figure(figsize=(10, 6))
sns.barplot(x='category_column', y='value_column', data=df)
plt.title('Bar Plot of Categories vs Values')
plt.xlabel('Category')
plt.ylabel('Value')
plt.show()

6.3 Histogram (for Distribution of Data)

plt.figure(figsize=(10, 6))
plt.hist(df['numeric_column'], bins=20, color='blue', alpha=0.7)
plt.title('Histogram of Numeric Column')
plt.xlabel('Values')
plt.ylabel('Frequency')
plt.show()

6.4 Box Plot (for Detecting Outliers)

plt.figure(figsize=(10, 6))
sns.boxplot(x='category_column', y='value_column', data=df)
plt.title('Box Plot of Values by Category')
plt.xlabel('Category')
plt.ylabel('Value')
plt.show()

6.5 Scatter Plot (for Relationships Between Two Variables)

plt.figure(figsize=(10, 6))
plt.scatter(df['numeric_column_1'], df['numeric_column_2'])
plt.title('Scatter Plot of Numeric Column 1 vs Numeric Column 2')
plt.xlabel('Numeric Column 1')
plt.ylabel('Numeric Column 2')
plt.show()

6.6 Pair Plot (for Multivariate Data Exploration)

sns.pairplot(df[['numeric_column_1', 'numeric_column_2', 'numeric_column_3']])
plt.show()

6.7 Correlation Heatmap (for Correlation Between Numerical Variables)

plt.figure(figsize=(10, 6))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

Step 7: Data Analysis

7.1 Grouping Data (GroupBy)

grouped_data = df.groupby('category_column')['value_column'].mean()
print(grouped_data)

7.2 Aggregating Data

aggregated_data = df.groupby('category_column').agg({
'value_column': ['mean', 'sum', 'count']
})
print(aggregated_data)

7.3 Pivot Table

pivot = df.pivot_table(values='value_column', index='category_column', columns='another_category_column', aggfunc='mean')
print(pivot)

7.4 Crosstab (Contingency Table)

crosstab = pd.crosstab(df['category_column_1'], df['category_column_2'])
print(crosstab)

7.5 Statistical Analysis (Correlation)

correlation = df['numeric_column_1'].corr(df['numeric_column_2'])
print(f'Correlation: {correlation}')

Step 8: Advanced Visualization

8.1 Violin Plot (for Distribution Comparison)

plt.figure(figsize=(10, 6))
sns.violinplot(x='category_column', y='numeric_column', data=df)
plt.title('Violin Plot of Numeric Column by Category')
plt.show()

8.2 Count Plot (for Frequency of Categorical Data)

plt.figure(figsize=(10, 6))
sns.countplot(x='category_column', data=df)
plt.title('Count Plot of Categories')
plt.show()

8.3 Facet Grid (Multiple Plots for Different Subsets of Data)

g = sns.FacetGrid(df, col='category_column', height=5)
g.map(plt.hist, 'numeric_column', bins=20)
plt.show()

Step 9: Save Your Visualizations

9.1 Saving a Plot to File

plt.figure(figsize=(10, 6))
sns.scatterplot(x='numeric_column_1', y='numeric_column_2', data=df)
plt.title('Scatter Plot Example')
plt.savefig('scatter_plot.png')

Step 10: Export Processed Data (Optional)

10.1 Save DataFrame to CSV

df_clean.to_csv('processed_data.csv', index=False)

10.2 Save DataFrame to Excel

df_clean.to_excel('processed_data.xlsx', index=False)