data-cleaning

Pandas for Data Analysts: The Complete Practical Guide

June 08, 2026 ยท 3 min read

Pandas is the core tool of Python data analysis. Every analyst uses it. But most people learn it reactively โ€” Googling each operation as they need it โ€” and never build a complete mental model of how it works.

This guide covers everything a working data analyst needs to know about Pandas, from loading data to building automation scripts. No machine learning, no statistics theory โ€” just the operations you'll use every single day.

Loading Data: Where Every Analysis Starts

import pandas as pd

# From files
df = pd.read_csv('sales_data.csv')
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
df = pd.read_json('data.json')

# First look at your data โ€” always do this first
print(df.shape)          # (rows, columns) e.g. (1250, 8)
df.head(10)              # first 10 rows
df.dtypes                # data type of each column
df.describe()            # count, mean, std, min, max, percentiles
df.isnull().sum()        # missing values per column
df.duplicated().sum()    # number of duplicate rows

Selecting and Filtering

# Select columns
df['salary']                           # single column โ†’ Series
df[['name', 'dept', 'salary']]         # multiple columns โ†’ DataFrame

# Filter rows
df[df['salary'] > 60000]
df[(df['dept'] == 'Sales') & (df['salary'] > 50000)]
df[df['region'].isin(['North', 'West'])]
df[df['name'].str.contains('Smith', case=False)]

# loc (label-based) vs iloc (position-based)
df.loc[df['dept'] == 'Finance', 'salary']  # Finance salaries only
df.iloc[0:5, 0:3]                          # first 5 rows, first 3 columns

Cleaning Data โ€” The 80% of Real Work

# Handle missing values
df.dropna()                                    # drop rows with any null
df.dropna(subset=['salary', 'dept'])           # drop only if these are null
df['salary'].fillna(df['salary'].median())     # fill with median
df.fillna({'dept': 'Unknown', 'salary': 0})    # fill per column

# Fix data types
df['date'] = pd.to_datetime(df['date'])
df['salary'] = df['salary'].astype(float)
df['id'] = df['id'].astype(str)

# Remove duplicates
df.drop_duplicates(subset=['employee_id'], keep='first')

# Rename columns
df.rename(columns={'emp_name': 'name', 'dept_code': 'dept'}, inplace=True)

# Clean string columns
df['name'] = df['name'].str.strip().str.title()
df['email'] = df['email'].str.lower().str.strip()

GroupBy โ€” The SQL GROUP BY Equivalent

# Basic groupby
df.groupby('dept')['salary'].mean()

# Multiple aggregations at once โ€” the most useful pattern
result = df.groupby('dept').agg(
    avg_salary=('salary', 'mean'),
    total_employees=('name', 'count'),
    max_salary=('salary', 'max'),
    min_salary=('salary', 'min')
).reset_index()

# Sort the result
result.sort_values('avg_salary', ascending=False)

# GroupBy with multiple columns
df.groupby(['dept', 'region'])['revenue'].sum().reset_index()

Merging DataFrames โ€” The SQL JOIN Equivalent

# Inner join (default) โ€” only matching rows
result = pd.merge(orders, customers, on='customer_id')

# Left join โ€” all orders, matching customer data
result = pd.merge(orders, customers, on='customer_id', how='left')

# Join on different column names
result = pd.merge(orders, customers,
                  left_on='cust_id', right_on='id',
                  how='left')

# Stack DataFrames vertically (same columns, more rows)
all_data = pd.concat([df_2023, df_2024, df_2025], ignore_index=True)

Pivot Tables

# Revenue by product and region
pivot = df.pivot_table(
    values='revenue',
    index='product',
    columns='region',
    aggfunc='sum',
    fill_value=0
)

# Add row/column totals
pivot_with_totals = df.pivot_table(
    values='revenue',
    index='product',
    columns='region',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)

Working with Dates

df['date'] = pd.to_datetime(df['date'])

# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['month_name'] = df['date'].dt.strftime('%b %Y')  # 'Jan 2025'
df['day_of_week'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter

# Filter by date
df[df['date'] >= '2025-01-01']
df[(df['date'] >= '2025-01-01') & (df['date'] < '2025-04-01')]

# Group by month
df.groupby(df['date'].dt.to_period('M'))['revenue'].sum()

Exporting Results

# Save to files
df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', sheet_name='Analysis', index=False)

# Multiple sheets
with pd.ExcelWriter('report.xlsx') as writer:
    summary.to_excel(writer, sheet_name='Summary', index=False)
    detail.to_excel(writer, sheet_name='Detail', index=False)
    pivot.to_excel(writer, sheet_name='Pivot')

The Pattern That Separates Good Analysts from Great Ones

The difference isn't knowing more functions โ€” it's structuring your analysis clearly. Always start with a business question. Always validate your data before analysing it (check nulls, dtypes, duplicates). Always document your assumptions. And when in doubt, check your work against a simpler method โ€” if your Pandas groupby result doesn't match a manual Excel pivot, something is wrong.

For deeper Python practice, our Python for Data Analysis ebook covers three full end-to-end projects using these exact patterns on real datasets.

โ† Back to blog
data-cleaningintermediatepandaspythontutorial