Let’s dive into Python DataFrames! The go-to library for DataFrames in Python is Pandas. It’s incredibly powerful and widely used for data manipulation and analysis.
This tutorial will cover:
- What is a Pandas DataFrame?
- Basic Installation
- Creating a DataFrame
- Reading a DataFrame from a File (CSV example)
- Exploring a DataFrame (Basic Operations)
- Filtering Data within a DataFrame
Let’s get started!
1. What is a Pandas DataFrame?
Think of a Pandas DataFrame as a table or spreadsheet. It’s a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).
Key characteristics:
- Rows and Columns: Data is organized into rows and columns, similar to an Excel sheet or a database table.
- Labeled Axes: Both rows (index) and columns have labels, making it easy to access specific data.
- Heterogeneous Data: Different columns can hold different data types (e.g., one column can be numbers, another text, another dates).
- Tabular Structure: It’s ideal for structured data analysis.
It’s essentially a collection of Pandas Series objects, where each Series represents a column of the DataFrame.
2. Basic Installation
If you don’t have Pandas installed, you can easily install it using pip:
Bash
pip install pandas openpyxl matplotlib seaborn scikit-learn
pandas: The core library for DataFrames.openpyxl: (Optional) Needed if you want to read/write Excel files (.xlsx).matplotlib,seaborn,scikit-learn: (Optional) Very common libraries used alongside Pandas for plotting and machine learning, good to have them if you’re doing data science.
3. Creating a DataFrame
You can create DataFrames from various data structures, such as dictionaries, lists of lists, NumPy arrays, etc. The most common way is from a dictionary where keys become column names and values become column data.
Python
import pandas as pd
# 1. Create a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [24, 27, 22, 32, 29],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
'Score': [85, 92, 78, 95, 88]
}
df_from_dict = pd.DataFrame(data)
print("DataFrame from Dictionary:")
print(df_from_dict)
print("\n" + "="*30 + "\n")
# 2. Create a DataFrame from a list of lists (explicitly naming columns)
data_list = [
['Frank', 30, 'Miami', 90],
['Grace', 25, 'Dallas', 80],
['Heidi', 28, 'New York', 85]
]
columns_list = ['Name', 'Age', 'City', 'Score']
df_from_list = pd.DataFrame(data_list, columns=columns_list)
print("DataFrame from List of Lists:")
print(df_from_list)
print("\n" + "="*30 + "\n")
Output:
DataFrame from Dictionary:
Name Age City Score
0 Alice 24 New York 85
1 Bob 27 Los Angeles 92
2 Charlie 22 Chicago 78
3 David 32 Houston 95
4 Eve 29 Phoenix 88
==============================
DataFrame from List of Lists:
Name Age City Score
0 Frank 30 Miami 90
1 Grace 25 Dallas 80
2 Heidi 28 New York 85
==============================
4. Reading a DataFrame from a File (CSV Example)
One of the most common operations is reading data from a file. Pandas supports various file formats like CSV, Excel, SQL databases, JSON, and more.
Let’s create a sample CSV file first to demonstrate:
students.csv
Code snippet
Name,Age,City,Score
Alice,24,New York,85
Bob,27,Los Angeles,92
Charlie,22,Chicago,78
David,32,Houston,95
Eve,29,Phoenix,88
Frank,30,Miami,90
Grace,25,Dallas,80
Heidi,28,New York,85
Ivy,23,Boston,91
Jack,31,Seattle,75
Now, let’s read it into a DataFrame:
Python
import pandas as pd
# Create a dummy CSV file for demonstration
csv_content = """Name,Age,City,Score
Alice,24,New York,85
Bob,27,Los Angeles,92
Charlie,22,Chicago,78
David,32,Houston,95
Eve,29,Phoenix,88
Frank,30,Miami,90
Grace,25,Dallas,80
Heidi,28,New York,85
Ivy,23,Boston,91
Jack,31,Seattle,75
"""
with open('students.csv', 'w') as f:
f.write(csv_content)
# Read the CSV file into a DataFrame
df = pd.read_csv('students.csv')
print("DataFrame read from 'students.csv':")
print(df)
print("\n" + "="*30 + "\n")
Output:
DataFrame read from 'students.csv':
Name Age City Score
0 Alice 24 New York 85
1 Bob 27 Los Angeles 92
2 Charlie 22 Chicago 78
3 David 32 Houston 95
4 Eve 29 Phoenix 88
5 Frank 30 Miami 90
6 Grace 25 Dallas 80
7 Heidi 28 New York 85
8 Ivy 23 Boston 91
9 Jack 31 Seattle 75
==============================
Common read_csv arguments:
sep=',': Specifies the delimiter (comma by default). Usesep='\t'for tab-separated values.header=0: Row number to use as the column names (0-indexed, default is 0). Set toNoneif your file has no header.names=['col1', 'col2']: Provide a list of column names if your file doesn’t have a header or you want to override existing names.index_col=0: Column to use as the row labels (index).skiprows=[1, 3]: Skip specific row numbers.na_values=['N/A', 'Unknown']: Additional strings to recognize asNaN(Not a Number) / missing values.
5. Exploring a DataFrame (Basic Operations)
Once you have a DataFrame, you’ll want to inspect it.
Python
import pandas as pd
# Assuming df is already loaded from 'students.csv'
csv_content = """Name,Age,City,Score
Alice,24,New York,85
Bob,27,Los Angeles,92
Charlie,22,Chicago,78
David,32,Houston,95
Eve,29,Phoenix,88
Frank,30,Miami,90
Grace,25,Dallas,80
Heidi,28,New York,85
Ivy,23,Boston,91
Jack,31,Seattle,75
"""
with open('students.csv', 'w') as f:
f.write(csv_content)
df = pd.read_csv('students.csv')
print("DataFrame head (first 5 rows):")
print(df.head()) # Shows the first 5 rows by default, df.head(n) for first n rows
print("\n" + "="*30 + "\n")
print("DataFrame tail (last 5 rows):")
print(df.tail()) # Shows the last 5 rows by default, df.tail(n) for last n rows
print("\n" + "="*30 + "\n")
print("DataFrame info (summary):")
df.info() # Provides a concise summary of the DataFrame, including data types and non-null values
print("\n" + "="*30 + "\n")
print("DataFrame description (statistical summary):")
print(df.describe()) # Generates descriptive statistics of numerical columns
print("\n" + "="*30 + "\n")
print("DataFrame columns:")
print(df.columns) # Lists all column names
print("\n" + "="*30 + "\n")
print("DataFrame shape (rows, columns):")
print(df.shape) # Returns a tuple representing the dimensionality of the DataFrame (rows, columns)
print("\n" + "="*30 + "\n")
print("Accessing a single column (as a Series):")
print(df['Name'])
print("\n" + "="*30 + "\n")
print("Accessing multiple columns (as a DataFrame):")
print(df[['Name', 'City']])
print("\n" + "="*30 + "\n")
Partial Output:
DataFrame head (first 5 rows):
Name Age City Score
0 Alice 24 New York 85
1 Bob 27 Los Angeles 92
2 Charlie 22 Chicago 78
3 David 32 Houston 95
4 Eve 29 Phoenix 88
==============================
DataFrame info (summary):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 10 non-null object
1 Age 10 non-null int64
2 City 10 non-null object
3 Score 10 non-null int64
dtypes: int64(2), object(2)
memory usage: 452.0+ bytes
==============================
DataFrame description (statistical summary):
Age Score
count 10.000000 10.000000
mean 27.100000 86.900000
std 3.314959 6.314278
min 22.000000 75.000000
25% 24.250000 80.000000
50% 27.500000 86.500000
75% 28.750000 91.000000
max 32.000000 95.000000
==============================
...
6. Filtering Data within a DataFrame
Filtering (or “querying” or “subsetting”) is one of the most powerful and frequent operations you’ll perform with DataFrames. You can select rows based on one or more conditions.
The general syntax for filtering is df[df['ColumnName'] condition]. This creates a boolean Series(True/False for each row), and when you pass that boolean Series to the DataFrame, it only returns the rows where the condition was True.
Let’s use our df from students.csv.
Python
import pandas as pd
# Assuming df is already loaded from 'students.csv'
csv_content = """Name,Age,City,Score
Alice,24,New York,85
Bob,27,Los Angeles,92
Charlie,22,Chicago,78
David,32,Houston,95
Eve,29,Phoenix,88
Frank,30,Miami,90
Grace,25,Dallas,80
Heidi,28,New York,85
Ivy,23,Boston,91
Jack,31,Seattle,75
"""
with open('students.csv', 'w') as f:
f.write(csv_content)
df = pd.read_csv('students.csv')
print("Original DataFrame:")
print(df)
print("\n" + "="*30 + "\n")
# --- Example 1: Filter by a single numerical condition (e.g., Score > 85) ---
high_scorers = df[df['Score'] > 85]
print("Students with Score > 85:")
print(high_scorers)
print("\n" + "="*30 + "\n")
# --- Example 2: Filter by a single categorical condition (e.g., City == 'New York') ---
new_york_students = df[df['City'] == 'New York']
print("Students from New York:")
print(new_york_students)
print("\n" + "="*30 + "\n")
# --- Example 3: Filter by multiple conditions (AND logic) ---
# Students from New York AND Score > 85
high_scoring_ny_students = df[(df['City'] == 'New York') & (df['Score'] > 85)]
print("Students from New York with Score > 85:")
print(high_scoring_ny_students)
print("\n" + "="*30 + "\n")
# --- Example 4: Filter by multiple conditions (OR logic) ---
# Students from New York OR Los Angeles
ny_la_students = df[(df['City'] == 'New York') | (df['City'] == 'Los Angeles')]
print("Students from New York OR Los Angeles:")
print(ny_la_students)
print("\n" + "="*30 + "\n")
# --- Example 5: Using .isin() for multiple categorical values ---
# Students from a list of cities
selected_cities = ['New York', 'Boston', 'Miami']
students_from_selected_cities = df[df['City'].isin(selected_cities)]
print(f"Students from {selected_cities}:")
print(students_from_selected_cities)
print("\n" + "="*30 + "\n")
# --- Example 6: Filtering based on text in a string column ---
# Students whose name starts with 'A'
# Note: .str accessor is used for string operations
name_starts_with_a = df[df['Name'].str.startswith('A')]
print("Students whose name starts with 'A':")
print(name_starts_with_a)
print("\n" + "="*30 + "\n")
# --- Example 7: Filtering and selecting specific columns ---
# Get names and ages of students older than 25
older_students_names_ages = df[df['Age'] > 25][['Name', 'Age']]
print("Names and Ages of students older than 25:")
print(older_students_names_ages)
print("\n" + "="*30 + "\n")
# --- Example 8: Chaining multiple filters (alternative to boolean operators) ---
# Students with age between 25 and 30 (inclusive)
age_filtered_students = df[df['Age'].between(25, 30)]
print("Students with Age between 25 and 30:")
print(age_filtered_students)
print("\n" + "="*30 + "\n")
Partial Output:
Original DataFrame:
Name Age City Score
0 Alice 24 New York 85
1 Bob 27 Los Angeles 92
2 Charlie 22 Chicago 78
3 David 32 Houston 95
4 Eve 29 Phoenix 88
5 Frank 30 Miami 90
6 Grace 25 Dallas 80
7 Heidi 28 New York 85
8 Ivy 23 Boston 91
9 Jack 31 Seattle 75
==============================
Students with Score > 85:
Name Age City Score
1 Bob 27 Los Angeles 92
3 David 32 Houston 95
5 Frank 30 Miami 90
8 Ivy 23 Boston 91
==============================
Students from New York:
Name Age City Score
0 Alice 24 New York 85
7 Heidi 28 New York 85
==============================
Students from New York with Score > 85:
Empty DataFrame
Columns: [Name, Age, City, Score]
Index: []
==============================
...
Important Notes on Filtering:
- Parentheses are Crucial: When combining multiple conditions with
&(AND) or|(OR), each individual condition must be enclosed in parentheses (e.g.,(df['Column'] > value)). This is because of operator precedence in Python. - Logical Operators:
&for logical AND|for logical OR~for logical NOT (e.g.,df[~(df['City'] == 'New York')]to get all students not from New York)
.locfor Label-based Indexing: Whiledf[condition]works well for row selection based on boolean arrays, for more complex selection involving both rows and columns by label,locis preferred and more explicit:Python# Select rows where Score > 85 AND only 'Name' and 'Score' columns high_scorers_subset = df.loc[df['Score'] > 85, ['Name', 'Score']] print("\nUsing .loc for filtering and column selection:") print(high_scorers_subset)
This tutorial provides a solid foundation for working with Python DataFrames in Pandas. The ability to read data and filter it efficiently is fundamental to almost any data analysis task!