Pandas Dataframes

Let’s dive into Python DataFrames! The go-to library for DataFrames in Python is Pandas. It’s incredibly powerful and widely used for data manipulation and analysis.

This tutorial will cover:

  1. What is a Pandas DataFrame?
  2. Basic Installation
  3. Creating a DataFrame
  4. Reading a DataFrame from a File (CSV example)
  5. Exploring a DataFrame (Basic Operations)
  6. Filtering Data within a DataFrame

Let’s get started!


1. What is a Pandas DataFrame?

Think of a Pandas DataFrame as a table or spreadsheet. It’s a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

Key characteristics:

  • Rows and Columns: Data is organized into rows and columns, similar to an Excel sheet or a database table.
  • Labeled Axes: Both rows (index) and columns have labels, making it easy to access specific data.
  • Heterogeneous Data: Different columns can hold different data types (e.g., one column can be numbers, another text, another dates).
  • Tabular Structure: It’s ideal for structured data analysis.

It’s essentially a collection of Pandas Series objects, where each Series represents a column of the DataFrame.

2. Basic Installation

If you don’t have Pandas installed, you can easily install it using pip:

Bash

pip install pandas openpyxl matplotlib seaborn scikit-learn
  • pandas: The core library for DataFrames.
  • openpyxl: (Optional) Needed if you want to read/write Excel files (.xlsx).
  • matplotlibseabornscikit-learn: (Optional) Very common libraries used alongside Pandas for plotting and machine learning, good to have them if you’re doing data science.

3. Creating a DataFrame

You can create DataFrames from various data structures, such as dictionaries, lists of lists, NumPy arrays, etc. The most common way is from a dictionary where keys become column names and values become column data.

Python

import pandas as pd

# 1. Create a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [24, 27, 22, 32, 29],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
    'Score': [85, 92, 78, 95, 88]
}
df_from_dict = pd.DataFrame(data)
print("DataFrame from Dictionary:")
print(df_from_dict)
print("\n" + "="*30 + "\n")

# 2. Create a DataFrame from a list of lists (explicitly naming columns)
data_list = [
    ['Frank', 30, 'Miami', 90],
    ['Grace', 25, 'Dallas', 80],
    ['Heidi', 28, 'New York', 85]
]
columns_list = ['Name', 'Age', 'City', 'Score']
df_from_list = pd.DataFrame(data_list, columns=columns_list)
print("DataFrame from List of Lists:")
print(df_from_list)
print("\n" + "="*30 + "\n")

Output:

DataFrame from Dictionary:
      Name       Age        City                   Score
0    Alice        24           New York         85
1     Bob         27            Los Angeles    92
2    Charlie    22            Chicago           78
3    David      32            Houston           95
4    Eve          29           Phoenix            88

==============================

DataFrame from List of Lists:
     Name  Age      City              Score
0  Frank   30         Miami          90
1  Grace   25         Dallas          80
2  Heidi    28         New York    85

==============================

4. Reading a DataFrame from a File (CSV Example)

One of the most common operations is reading data from a file. Pandas supports various file formats like CSV, Excel, SQL databases, JSON, and more.

Let’s create a sample CSV file first to demonstrate:

students.csv

Code snippet

Name,Age,City,Score
Alice,24,New York,85
Bob,27,Los Angeles,92
Charlie,22,Chicago,78
David,32,Houston,95
Eve,29,Phoenix,88
Frank,30,Miami,90
Grace,25,Dallas,80
Heidi,28,New York,85
Ivy,23,Boston,91
Jack,31,Seattle,75

Now, let’s read it into a DataFrame:

Python

import pandas as pd

# Create a dummy CSV file for demonstration
csv_content = """Name,Age,City,Score
Alice,24,New York,85
Bob,27,Los Angeles,92
Charlie,22,Chicago,78
David,32,Houston,95
Eve,29,Phoenix,88
Frank,30,Miami,90
Grace,25,Dallas,80
Heidi,28,New York,85
Ivy,23,Boston,91
Jack,31,Seattle,75
"""

with open('students.csv', 'w') as f:
    f.write(csv_content)

# Read the CSV file into a DataFrame
df = pd.read_csv('students.csv')

print("DataFrame read from 'students.csv':")
print(df)
print("\n" + "="*30 + "\n")

Output:

DataFrame read from 'students.csv':
      Name  Age         City  Score
0    Alice   24     New York     85
1      Bob   27  Los Angeles     92
2    Charlie   22      Chicago     78
3    David   32      Houston     95
4      Eve   29      Phoenix     88
5    Frank   30        Miami     90
6    Grace   25       Dallas     80
7    Heidi   28     New York     85
8      Ivy   23       Boston     91
9     Jack   31      Seattle     75

==============================

Common read_csv arguments:

  • sep=',': Specifies the delimiter (comma by default). Use sep='\t' for tab-separated values.
  • header=0: Row number to use as the column names (0-indexed, default is 0). Set to None if your file has no header.
  • names=['col1', 'col2']: Provide a list of column names if your file doesn’t have a header or you want to override existing names.
  • index_col=0: Column to use as the row labels (index).
  • skiprows=[1, 3]: Skip specific row numbers.
  • na_values=['N/A', 'Unknown']: Additional strings to recognize as NaN (Not a Number) / missing values.

5. Exploring a DataFrame (Basic Operations)

Once you have a DataFrame, you’ll want to inspect it.

Python

import pandas as pd

# Assuming df is already loaded from 'students.csv'
csv_content = """Name,Age,City,Score
Alice,24,New York,85
Bob,27,Los Angeles,92
Charlie,22,Chicago,78
David,32,Houston,95
Eve,29,Phoenix,88
Frank,30,Miami,90
Grace,25,Dallas,80
Heidi,28,New York,85
Ivy,23,Boston,91
Jack,31,Seattle,75
"""
with open('students.csv', 'w') as f:
    f.write(csv_content)
df = pd.read_csv('students.csv')


print("DataFrame head (first 5 rows):")
print(df.head()) # Shows the first 5 rows by default, df.head(n) for first n rows
print("\n" + "="*30 + "\n")

print("DataFrame tail (last 5 rows):")
print(df.tail()) # Shows the last 5 rows by default, df.tail(n) for last n rows
print("\n" + "="*30 + "\n")

print("DataFrame info (summary):")
df.info() # Provides a concise summary of the DataFrame, including data types and non-null values
print("\n" + "="*30 + "\n")

print("DataFrame description (statistical summary):")
print(df.describe()) # Generates descriptive statistics of numerical columns
print("\n" + "="*30 + "\n")

print("DataFrame columns:")
print(df.columns) # Lists all column names
print("\n" + "="*30 + "\n")

print("DataFrame shape (rows, columns):")
print(df.shape) # Returns a tuple representing the dimensionality of the DataFrame (rows, columns)
print("\n" + "="*30 + "\n")

print("Accessing a single column (as a Series):")
print(df['Name'])
print("\n" + "="*30 + "\n")

print("Accessing multiple columns (as a DataFrame):")
print(df[['Name', 'City']])
print("\n" + "="*30 + "\n")

Partial Output:

DataFrame head (first 5 rows):
      Name  Age         City  Score
0    Alice   24     New York     85
1      Bob   27  Los Angeles     92
2  Charlie   22      Chicago     78
3    David   32      Houston     95
4      Eve   29      Phoenix     88

==============================

DataFrame info (summary):
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   Name    10 non-null     object
 1   Age     10 non-null     int64
 2   City    10 non-null     object
 3   Score   10 non-null     int64
dtypes: int64(2), object(2)
memory usage: 452.0+ bytes

==============================

DataFrame description (statistical summary):
             Age      Score
count  10.000000  10.000000
mean   27.100000  86.900000
std     3.314959   6.314278
min    22.000000  75.000000
25%    24.250000  80.000000
50%    27.500000  86.500000
75%    28.750000  91.000000
max    32.000000  95.000000

==============================
...

6. Filtering Data within a DataFrame

Filtering (or “querying” or “subsetting”) is one of the most powerful and frequent operations you’ll perform with DataFrames. You can select rows based on one or more conditions.

The general syntax for filtering is df[df['ColumnName'] condition]. This creates a boolean Series(True/False for each row), and when you pass that boolean Series to the DataFrame, it only returns the rows where the condition was True.

Let’s use our df from students.csv.

Python

import pandas as pd

# Assuming df is already loaded from 'students.csv'
csv_content = """Name,Age,City,Score
Alice,24,New York,85
Bob,27,Los Angeles,92
Charlie,22,Chicago,78
David,32,Houston,95
Eve,29,Phoenix,88
Frank,30,Miami,90
Grace,25,Dallas,80
Heidi,28,New York,85
Ivy,23,Boston,91
Jack,31,Seattle,75
"""
with open('students.csv', 'w') as f:
    f.write(csv_content)
df = pd.read_csv('students.csv')


print("Original DataFrame:")
print(df)
print("\n" + "="*30 + "\n")

# --- Example 1: Filter by a single numerical condition (e.g., Score > 85) ---
high_scorers = df[df['Score'] > 85]
print("Students with Score > 85:")
print(high_scorers)
print("\n" + "="*30 + "\n")

# --- Example 2: Filter by a single categorical condition (e.g., City == 'New York') ---
new_york_students = df[df['City'] == 'New York']
print("Students from New York:")
print(new_york_students)
print("\n" + "="*30 + "\n")

# --- Example 3: Filter by multiple conditions (AND logic) ---
# Students from New York AND Score > 85
high_scoring_ny_students = df[(df['City'] == 'New York') & (df['Score'] > 85)]
print("Students from New York with Score > 85:")
print(high_scoring_ny_students)
print("\n" + "="*30 + "\n")

# --- Example 4: Filter by multiple conditions (OR logic) ---
# Students from New York OR Los Angeles
ny_la_students = df[(df['City'] == 'New York') | (df['City'] == 'Los Angeles')]
print("Students from New York OR Los Angeles:")
print(ny_la_students)
print("\n" + "="*30 + "\n")

# --- Example 5: Using .isin() for multiple categorical values ---
# Students from a list of cities
selected_cities = ['New York', 'Boston', 'Miami']
students_from_selected_cities = df[df['City'].isin(selected_cities)]
print(f"Students from {selected_cities}:")
print(students_from_selected_cities)
print("\n" + "="*30 + "\n")

# --- Example 6: Filtering based on text in a string column ---
# Students whose name starts with 'A'
# Note: .str accessor is used for string operations
name_starts_with_a = df[df['Name'].str.startswith('A')]
print("Students whose name starts with 'A':")
print(name_starts_with_a)
print("\n" + "="*30 + "\n")

# --- Example 7: Filtering and selecting specific columns ---
# Get names and ages of students older than 25
older_students_names_ages = df[df['Age'] > 25][['Name', 'Age']]
print("Names and Ages of students older than 25:")
print(older_students_names_ages)
print("\n" + "="*30 + "\n")

# --- Example 8: Chaining multiple filters (alternative to boolean operators) ---
# Students with age between 25 and 30 (inclusive)
age_filtered_students = df[df['Age'].between(25, 30)]
print("Students with Age between 25 and 30:")
print(age_filtered_students)
print("\n" + "="*30 + "\n")

Partial Output:

Original DataFrame:
      Name  Age         City  Score
0    Alice   24     New York     85
1      Bob   27  Los Angeles     92
2  Charlie   22      Chicago     78
3    David   32      Houston     95
4      Eve   29      Phoenix     88
5    Frank   30        Miami     90
6    Grace   25       Dallas     80
7    Heidi   28     New York     85
8      Ivy   23       Boston     91
9     Jack   31      Seattle     75

==============================

Students with Score > 85:
    Name  Age         City  Score
1    Bob   27  Los Angeles     92
3  David   32      Houston     95
5  Frank   30        Miami     90
8    Ivy   23       Boston     91

==============================

Students from New York:
     Name  Age      City  Score
0   Alice   24  New York     85
7   Heidi   28  New York     85

==============================

Students from New York with Score > 85:
Empty DataFrame
Columns: [Name, Age, City, Score]
Index: []

==============================
...

Important Notes on Filtering:

  • Parentheses are Crucial: When combining multiple conditions with & (AND) or | (OR), each individual condition must be enclosed in parentheses (e.g., (df['Column'] > value)). This is because of operator precedence in Python.
  • Logical Operators:
    • & for logical AND
    • | for logical OR
    • ~ for logical NOT (e.g., df[~(df['City'] == 'New York')] to get all students not from New York)
  • .loc for Label-based Indexing: While df[condition] works well for row selection based on boolean arrays, for more complex selection involving both rows and columns by labelloc is preferred and more explicit:Python# Select rows where Score > 85 AND only 'Name' and 'Score' columns high_scorers_subset = df.loc[df['Score'] > 85, ['Name', 'Score']] print("\nUsing .loc for filtering and column selection:") print(high_scorers_subset)

This tutorial provides a solid foundation for working with Python DataFrames in Pandas. The ability to read data and filter it efficiently is fundamental to almost any data analysis task!

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.