Pandas is the core tool of Python data analysis. Every analyst uses it. But most people learn it reactively โ Googling each operation as they need it โ and never build a complete mental model of how it works.
This guide covers everything a working data analyst needs to know about Pandas, from loading data to building automation scripts. No machine learning, no statistics theory โ just the operations you'll use every single day.
Loading Data: Where Every Analysis Starts
import pandas as pd
# From files
df = pd.read_csv('sales_data.csv')
df = pd.read_excel('report.xlsx', sheet_name='Sheet1')
df = pd.read_json('data.json')
# First look at your data โ always do this first
print(df.shape) # (rows, columns) e.g. (1250, 8)
df.head(10) # first 10 rows
df.dtypes # data type of each column
df.describe() # count, mean, std, min, max, percentiles
df.isnull().sum() # missing values per column
df.duplicated().sum() # number of duplicate rowsSelecting and Filtering
# Select columns
df['salary'] # single column โ Series
df[['name', 'dept', 'salary']] # multiple columns โ DataFrame
# Filter rows
df[df['salary'] > 60000]
df[(df['dept'] == 'Sales') & (df['salary'] > 50000)]
df[df['region'].isin(['North', 'West'])]
df[df['name'].str.contains('Smith', case=False)]
# loc (label-based) vs iloc (position-based)
df.loc[df['dept'] == 'Finance', 'salary'] # Finance salaries only
df.iloc[0:5, 0:3] # first 5 rows, first 3 columnsCleaning Data โ The 80% of Real Work
# Handle missing values
df.dropna() # drop rows with any null
df.dropna(subset=['salary', 'dept']) # drop only if these are null
df['salary'].fillna(df['salary'].median()) # fill with median
df.fillna({'dept': 'Unknown', 'salary': 0}) # fill per column
# Fix data types
df['date'] = pd.to_datetime(df['date'])
df['salary'] = df['salary'].astype(float)
df['id'] = df['id'].astype(str)
# Remove duplicates
df.drop_duplicates(subset=['employee_id'], keep='first')
# Rename columns
df.rename(columns={'emp_name': 'name', 'dept_code': 'dept'}, inplace=True)
# Clean string columns
df['name'] = df['name'].str.strip().str.title()
df['email'] = df['email'].str.lower().str.strip()GroupBy โ The SQL GROUP BY Equivalent
# Basic groupby
df.groupby('dept')['salary'].mean()
# Multiple aggregations at once โ the most useful pattern
result = df.groupby('dept').agg(
avg_salary=('salary', 'mean'),
total_employees=('name', 'count'),
max_salary=('salary', 'max'),
min_salary=('salary', 'min')
).reset_index()
# Sort the result
result.sort_values('avg_salary', ascending=False)
# GroupBy with multiple columns
df.groupby(['dept', 'region'])['revenue'].sum().reset_index()Merging DataFrames โ The SQL JOIN Equivalent
# Inner join (default) โ only matching rows
result = pd.merge(orders, customers, on='customer_id')
# Left join โ all orders, matching customer data
result = pd.merge(orders, customers, on='customer_id', how='left')
# Join on different column names
result = pd.merge(orders, customers,
left_on='cust_id', right_on='id',
how='left')
# Stack DataFrames vertically (same columns, more rows)
all_data = pd.concat([df_2023, df_2024, df_2025], ignore_index=True)Pivot Tables
# Revenue by product and region
pivot = df.pivot_table(
values='revenue',
index='product',
columns='region',
aggfunc='sum',
fill_value=0
)
# Add row/column totals
pivot_with_totals = df.pivot_table(
values='revenue',
index='product',
columns='region',
aggfunc='sum',
margins=True,
margins_name='Total'
)Working with Dates
df['date'] = pd.to_datetime(df['date'])
# Extract components
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['month_name'] = df['date'].dt.strftime('%b %Y') # 'Jan 2025'
df['day_of_week'] = df['date'].dt.day_name()
df['quarter'] = df['date'].dt.quarter
# Filter by date
df[df['date'] >= '2025-01-01']
df[(df['date'] >= '2025-01-01') & (df['date'] < '2025-04-01')]
# Group by month
df.groupby(df['date'].dt.to_period('M'))['revenue'].sum()Exporting Results
# Save to files
df.to_csv('output.csv', index=False)
df.to_excel('output.xlsx', sheet_name='Analysis', index=False)
# Multiple sheets
with pd.ExcelWriter('report.xlsx') as writer:
summary.to_excel(writer, sheet_name='Summary', index=False)
detail.to_excel(writer, sheet_name='Detail', index=False)
pivot.to_excel(writer, sheet_name='Pivot')The Pattern That Separates Good Analysts from Great Ones
The difference isn't knowing more functions โ it's structuring your analysis clearly. Always start with a business question. Always validate your data before analysing it (check nulls, dtypes, duplicates). Always document your assumptions. And when in doubt, check your work against a simpler method โ if your Pandas groupby result doesn't match a manual Excel pivot, something is wrong.
For deeper Python practice, our Python for Data Analysis ebook covers three full end-to-end projects using these exact patterns on real datasets.