{"id":474,"date":"2025-07-11T10:30:25","date_gmt":"2025-07-11T00:30:25","guid":{"rendered":"https:\/\/www.mrmarkyoung.com\/oracle\/?p=474"},"modified":"2025-07-11T10:36:05","modified_gmt":"2025-07-11T00:36:05","slug":"pandas-dataframes","status":"publish","type":"post","link":"http:\/\/www.mrmarkyoung.com\/oracle\/2025\/07\/11\/pandas-dataframes\/","title":{"rendered":"Pandas Dataframes"},"content":{"rendered":"\n<p>Let&#8217;s dive into Python DataFrames! The go-to library for DataFrames in Python is\u00a0<strong>Pandas<\/strong>. It&#8217;s incredibly powerful and widely used for data manipulation and analysis.<\/p>\n\n\n\n<p>This tutorial will cover:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>What is a Pandas DataFrame?<\/strong><\/li>\n\n\n\n<li><strong>Basic Installation<\/strong><\/li>\n\n\n\n<li><strong>Creating a DataFrame<\/strong><\/li>\n\n\n\n<li><strong>Reading a DataFrame from a File (CSV example)<\/strong><\/li>\n\n\n\n<li><strong>Exploring a DataFrame (Basic Operations)<\/strong><\/li>\n\n\n\n<li><strong>Filtering Data within a DataFrame<\/strong><\/li>\n<\/ol>\n\n\n\n<p>Let&#8217;s get started!<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">1. What is a Pandas DataFrame?<\/h2>\n\n\n\n<p>Think of a Pandas DataFrame as a&nbsp;<strong>table or spreadsheet<\/strong>. It&#8217;s a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).<\/p>\n\n\n\n<p>Key characteristics:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Rows and Columns:<\/strong>\u00a0Data is organized into rows and columns, similar to an Excel sheet or a database table.<\/li>\n\n\n\n<li><strong>Labeled Axes:<\/strong>\u00a0Both rows (index) and columns have labels, making it easy to access specific data.<\/li>\n\n\n\n<li><strong>Heterogeneous Data:<\/strong>\u00a0Different columns can hold different data types (e.g., one column can be numbers, another text, another dates).<\/li>\n\n\n\n<li><strong>Tabular Structure:<\/strong>\u00a0It&#8217;s ideal for structured data analysis.<\/li>\n<\/ul>\n\n\n\n<p>It&#8217;s essentially a collection of Pandas&nbsp;<code>Series<\/code>&nbsp;objects, where each&nbsp;<code>Series<\/code>&nbsp;represents a column of the DataFrame.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Basic Installation<\/h2>\n\n\n\n<p>If you don&#8217;t have Pandas installed, you can easily install it using&nbsp;<code>pip<\/code>:<\/p>\n\n\n\n<p>Bash<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>pip install pandas openpyxl matplotlib seaborn scikit-learn\n<\/code><\/pre>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>pandas<\/code>: The core library for DataFrames.<\/li>\n\n\n\n<li><code>openpyxl<\/code>: (Optional) Needed if you want to read\/write Excel files (<code>.xlsx<\/code>).<\/li>\n\n\n\n<li><code>matplotlib<\/code>,\u00a0<code>seaborn<\/code>,\u00a0<code>scikit-learn<\/code>: (Optional) Very common libraries used alongside Pandas for plotting and machine learning, good to have them if you&#8217;re doing data science.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">3. Creating a DataFrame<\/h2>\n\n\n\n<p>You can create DataFrames from various data structures, such as dictionaries, lists of lists, NumPy arrays, etc. The most common way is from a dictionary where keys become column names and values become column data.<\/p>\n\n\n\n<p>Python<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\n# 1. Create a DataFrame from a dictionary\ndata = {\n    'Name': &#91;'Alice', 'Bob', 'Charlie', 'David', 'Eve'],\n    'Age': &#91;24, 27, 22, 32, 29],\n    'City': &#91;'New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],\n    'Score': &#91;85, 92, 78, 95, 88]\n}\ndf_from_dict = pd.DataFrame(data)\nprint(\"DataFrame from Dictionary:\")\nprint(df_from_dict)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\n# 2. Create a DataFrame from a list of lists (explicitly naming columns)\ndata_list = &#91;\n    &#91;'Frank', 30, 'Miami', 90],\n    &#91;'Grace', 25, 'Dallas', 80],\n    &#91;'Heidi', 28, 'New York', 85]\n]\ncolumns_list = &#91;'Name', 'Age', 'City', 'Score']\ndf_from_list = pd.DataFrame(data_list, columns=columns_list)\nprint(\"DataFrame from List of Lists:\")\nprint(df_from_list)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n<\/code><\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DataFrame from Dictionary:\n      Name       Age        City                   Score\n0    Alice        24           New York         85\n1     Bob         27            Los Angeles    92\n2    Charlie    22            Chicago           78\n3    David      32            Houston           95\n4    Eve          29           Phoenix            88\n\n==============================\n\nDataFrame from List of Lists:\n     Name  Age      City              Score\n0  Frank   30         Miami          90\n1  Grace   25         Dallas          80\n2  Heidi    28         New York    85\n\n==============================\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">4. Reading a DataFrame from a File (CSV Example)<\/h2>\n\n\n\n<p>One of the most common operations is reading data from a file. Pandas supports various file formats like CSV, Excel, SQL databases, JSON, and more.<\/p>\n\n\n\n<p>Let&#8217;s create a sample CSV file first to demonstrate:<\/p>\n\n\n\n<p><strong><code>students.csv<\/code><\/strong><\/p>\n\n\n\n<p>Code snippet<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Name,Age,City,Score\nAlice,24,New York,85\nBob,27,Los Angeles,92\nCharlie,22,Chicago,78\nDavid,32,Houston,95\nEve,29,Phoenix,88\nFrank,30,Miami,90\nGrace,25,Dallas,80\nHeidi,28,New York,85\nIvy,23,Boston,91\nJack,31,Seattle,75\n<\/code><\/pre>\n\n\n\n<p>Now, let&#8217;s read it into a DataFrame:<\/p>\n\n\n\n<p>Python<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\n# Create a dummy CSV file for demonstration\ncsv_content = \"\"\"Name,Age,City,Score\nAlice,24,New York,85\nBob,27,Los Angeles,92\nCharlie,22,Chicago,78\nDavid,32,Houston,95\nEve,29,Phoenix,88\nFrank,30,Miami,90\nGrace,25,Dallas,80\nHeidi,28,New York,85\nIvy,23,Boston,91\nJack,31,Seattle,75\n\"\"\"\n\nwith open('students.csv', 'w') as f:\n    f.write(csv_content)\n\n# Read the CSV file into a DataFrame\ndf = pd.read_csv('students.csv')\n\nprint(\"DataFrame read from 'students.csv':\")\nprint(df)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n<\/code><\/pre>\n\n\n\n<p><strong>Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DataFrame read from 'students.csv':\n      Name  Age         City  Score\n0    Alice   24     New York     85\n1      Bob   27  Los Angeles     92\n2    Charlie   22      Chicago     78\n3    David   32      Houston     95\n4      Eve   29      Phoenix     88\n5    Frank   30        Miami     90\n6    Grace   25       Dallas     80\n7    Heidi   28     New York     85\n8      Ivy   23       Boston     91\n9     Jack   31      Seattle     75\n\n==============================\n<\/code><\/pre>\n\n\n\n<p><strong>Common&nbsp;<code>read_csv<\/code>&nbsp;arguments:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><code>sep=','<\/code>: Specifies the delimiter (comma by default). Use\u00a0<code>sep='\\t'<\/code>\u00a0for tab-separated values.<\/li>\n\n\n\n<li><code>header=0<\/code>: Row number to use as the column names (0-indexed, default is 0). Set to\u00a0<code>None<\/code>\u00a0if your file has no header.<\/li>\n\n\n\n<li><code>names=['col1', 'col2']<\/code>: Provide a list of column names if your file doesn&#8217;t have a header or you want to override existing names.<\/li>\n\n\n\n<li><code>index_col=0<\/code>: Column to use as the row labels (index).<\/li>\n\n\n\n<li><code>skiprows=[1, 3]<\/code>: Skip specific row numbers.<\/li>\n\n\n\n<li><code>na_values=['N\/A', 'Unknown']<\/code>: Additional strings to recognize as\u00a0<code>NaN<\/code>\u00a0(Not a Number) \/ missing values.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5. Exploring a DataFrame (Basic Operations)<\/h2>\n\n\n\n<p>Once you have a DataFrame, you&#8217;ll want to inspect it.<\/p>\n\n\n\n<p>Python<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\n# Assuming df is already loaded from 'students.csv'\ncsv_content = \"\"\"Name,Age,City,Score\nAlice,24,New York,85\nBob,27,Los Angeles,92\nCharlie,22,Chicago,78\nDavid,32,Houston,95\nEve,29,Phoenix,88\nFrank,30,Miami,90\nGrace,25,Dallas,80\nHeidi,28,New York,85\nIvy,23,Boston,91\nJack,31,Seattle,75\n\"\"\"\nwith open('students.csv', 'w') as f:\n    f.write(csv_content)\ndf = pd.read_csv('students.csv')\n\n\nprint(\"DataFrame head (first 5 rows):\")\nprint(df.head()) # Shows the first 5 rows by default, df.head(n) for first n rows\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\nprint(\"DataFrame tail (last 5 rows):\")\nprint(df.tail()) # Shows the last 5 rows by default, df.tail(n) for last n rows\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\nprint(\"DataFrame info (summary):\")\ndf.info() # Provides a concise summary of the DataFrame, including data types and non-null values\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\nprint(\"DataFrame description (statistical summary):\")\nprint(df.describe()) # Generates descriptive statistics of numerical columns\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\nprint(\"DataFrame columns:\")\nprint(df.columns) # Lists all column names\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\nprint(\"DataFrame shape (rows, columns):\")\nprint(df.shape) # Returns a tuple representing the dimensionality of the DataFrame (rows, columns)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\nprint(\"Accessing a single column (as a Series):\")\nprint(df&#91;'Name'])\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\nprint(\"Accessing multiple columns (as a DataFrame):\")\nprint(df&#91;&#91;'Name', 'City']])\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n<\/code><\/pre>\n\n\n\n<p><strong>Partial Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DataFrame head (first 5 rows):\n      Name  Age         City  Score\n0    Alice   24     New York     85\n1      Bob   27  Los Angeles     92\n2  Charlie   22      Chicago     78\n3    David   32      Houston     95\n4      Eve   29      Phoenix     88\n\n==============================\n\nDataFrame info (summary):\n&lt;class 'pandas.core.frame.DataFrame'&gt;\nRangeIndex: 10 entries, 0 to 9\nData columns (total 4 columns):\n #   Column  Non-Null Count  Dtype\n---  ------  --------------  -----\n 0   Name    10 non-null     object\n 1   Age     10 non-null     int64\n 2   City    10 non-null     object\n 3   Score   10 non-null     int64\ndtypes: int64(2), object(2)\nmemory usage: 452.0+ bytes\n\n==============================\n\nDataFrame description (statistical summary):\n             Age      Score\ncount  10.000000  10.000000\nmean   27.100000  86.900000\nstd     3.314959   6.314278\nmin    22.000000  75.000000\n25%    24.250000  80.000000\n50%    27.500000  86.500000\n75%    28.750000  91.000000\nmax    32.000000  95.000000\n\n==============================\n...\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">6. Filtering Data within a DataFrame<\/h2>\n\n\n\n<p>Filtering (or &#8220;querying&#8221; or &#8220;subsetting&#8221;) is one of the most powerful and frequent operations you&#8217;ll perform with DataFrames. You can select rows based on one or more conditions.<\/p>\n\n\n\n<p>The general syntax for filtering is&nbsp;<code>df[df['ColumnName'] condition]<\/code>. This creates a boolean&nbsp;<code>Series<\/code>(True\/False for each row), and when you pass that boolean Series to the DataFrame, it only returns the rows where the condition was&nbsp;<code>True<\/code>.<\/p>\n\n\n\n<p>Let&#8217;s use our&nbsp;<code>df<\/code>&nbsp;from&nbsp;<code>students.csv<\/code>.<\/p>\n\n\n\n<p>Python<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>import pandas as pd\n\n# Assuming df is already loaded from 'students.csv'\ncsv_content = \"\"\"Name,Age,City,Score\nAlice,24,New York,85\nBob,27,Los Angeles,92\nCharlie,22,Chicago,78\nDavid,32,Houston,95\nEve,29,Phoenix,88\nFrank,30,Miami,90\nGrace,25,Dallas,80\nHeidi,28,New York,85\nIvy,23,Boston,91\nJack,31,Seattle,75\n\"\"\"\nwith open('students.csv', 'w') as f:\n    f.write(csv_content)\ndf = pd.read_csv('students.csv')\n\n\nprint(\"Original DataFrame:\")\nprint(df)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\n# --- Example 1: Filter by a single numerical condition (e.g., Score &gt; 85) ---\nhigh_scorers = df&#91;df&#91;'Score'] &gt; 85]\nprint(\"Students with Score &gt; 85:\")\nprint(high_scorers)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\n# --- Example 2: Filter by a single categorical condition (e.g., City == 'New York') ---\nnew_york_students = df&#91;df&#91;'City'] == 'New York']\nprint(\"Students from New York:\")\nprint(new_york_students)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\n# --- Example 3: Filter by multiple conditions (AND logic) ---\n# Students from New York AND Score &gt; 85\nhigh_scoring_ny_students = df&#91;(df&#91;'City'] == 'New York') &amp; (df&#91;'Score'] &gt; 85)]\nprint(\"Students from New York with Score &gt; 85:\")\nprint(high_scoring_ny_students)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\n# --- Example 4: Filter by multiple conditions (OR logic) ---\n# Students from New York OR Los Angeles\nny_la_students = df&#91;(df&#91;'City'] == 'New York') | (df&#91;'City'] == 'Los Angeles')]\nprint(\"Students from New York OR Los Angeles:\")\nprint(ny_la_students)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\n# --- Example 5: Using .isin() for multiple categorical values ---\n# Students from a list of cities\nselected_cities = &#91;'New York', 'Boston', 'Miami']\nstudents_from_selected_cities = df&#91;df&#91;'City'].isin(selected_cities)]\nprint(f\"Students from {selected_cities}:\")\nprint(students_from_selected_cities)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\n# --- Example 6: Filtering based on text in a string column ---\n# Students whose name starts with 'A'\n# Note: .str accessor is used for string operations\nname_starts_with_a = df&#91;df&#91;'Name'].str.startswith('A')]\nprint(\"Students whose name starts with 'A':\")\nprint(name_starts_with_a)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\n# --- Example 7: Filtering and selecting specific columns ---\n# Get names and ages of students older than 25\nolder_students_names_ages = df&#91;df&#91;'Age'] &gt; 25]&#91;&#91;'Name', 'Age']]\nprint(\"Names and Ages of students older than 25:\")\nprint(older_students_names_ages)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n\n# --- Example 8: Chaining multiple filters (alternative to boolean operators) ---\n# Students with age between 25 and 30 (inclusive)\nage_filtered_students = df&#91;df&#91;'Age'].between(25, 30)]\nprint(\"Students with Age between 25 and 30:\")\nprint(age_filtered_students)\nprint(\"\\n\" + \"=\"*30 + \"\\n\")\n<\/code><\/pre>\n\n\n\n<p><strong>Partial Output:<\/strong><\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>Original DataFrame:\n      Name  Age         City  Score\n0    Alice   24     New York     85\n1      Bob   27  Los Angeles     92\n2  Charlie   22      Chicago     78\n3    David   32      Houston     95\n4      Eve   29      Phoenix     88\n5    Frank   30        Miami     90\n6    Grace   25       Dallas     80\n7    Heidi   28     New York     85\n8      Ivy   23       Boston     91\n9     Jack   31      Seattle     75\n\n==============================\n\nStudents with Score &gt; 85:\n    Name  Age         City  Score\n1    Bob   27  Los Angeles     92\n3  David   32      Houston     95\n5  Frank   30        Miami     90\n8    Ivy   23       Boston     91\n\n==============================\n\nStudents from New York:\n     Name  Age      City  Score\n0   Alice   24  New York     85\n7   Heidi   28  New York     85\n\n==============================\n\nStudents from New York with Score &gt; 85:\nEmpty DataFrame\nColumns: &#91;Name, Age, City, Score]\nIndex: &#91;]\n\n==============================\n...\n<\/code><\/pre>\n\n\n\n<p><strong>Important Notes on Filtering:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Parentheses are Crucial:<\/strong>\u00a0When combining multiple conditions with\u00a0<code>&amp;<\/code>\u00a0(AND) or\u00a0<code>|<\/code>\u00a0(OR),\u00a0<strong>each individual condition must be enclosed in parentheses<\/strong>\u00a0(e.g.,\u00a0<code>(df['Column'] > value)<\/code>). This is because of operator precedence in Python.<\/li>\n\n\n\n<li><strong>Logical Operators:<\/strong>\n<ul class=\"wp-block-list\">\n<li><code>&amp;<\/code>\u00a0for logical AND<\/li>\n\n\n\n<li><code>|<\/code>\u00a0for logical OR<\/li>\n\n\n\n<li><code>~<\/code>\u00a0for logical NOT (e.g.,\u00a0<code>df[~(df['City'] == 'New York')]<\/code>\u00a0to get all students\u00a0<em>not<\/em>\u00a0from New York)<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong><code>.loc<\/code>\u00a0for Label-based Indexing:<\/strong>\u00a0While\u00a0<code>df[condition]<\/code>\u00a0works well for row selection based on boolean arrays, for more complex selection involving both rows and columns by\u00a0<em>label<\/em>,\u00a0<code>loc<\/code>\u00a0is preferred and more explicit:Python<code># Select rows where Score > 85 AND only 'Name' and 'Score' columns high_scorers_subset = df.loc[df['Score'] > 85, ['Name', 'Score']] print(\"\\nUsing .loc for filtering and column selection:\") print(high_scorers_subset)<\/code><\/li>\n<\/ul>\n\n\n\n<p>This tutorial provides a solid foundation for working with Python DataFrames in Pandas. The ability to read data and filter it efficiently is fundamental to almost any data analysis task!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let&#8217;s dive into Python DataFrames! The go-to library for DataFrames in Python is\u00a0Pandas. It&#8217;s incredibly powerful and widely used for data manipulation and analysis. This tutorial will cover: Let&#8217;s get started! 1. What is a Pandas DataFrame? Think of a Pandas DataFrame as a&nbsp;table or spreadsheet. It&#8217;s a two-dimensional, size-mutable, and potentially heterogeneous tabular data &#8230; <a title=\"Pandas Dataframes\" class=\"read-more\" href=\"http:\/\/www.mrmarkyoung.com\/oracle\/2025\/07\/11\/pandas-dataframes\/\" aria-label=\"Read more about Pandas Dataframes\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[63,30],"tags":[31],"class_list":["post-474","post","type-post","status-publish","format-standard","hentry","category-data-science","category-python","tag-python"],"_links":{"self":[{"href":"http:\/\/www.mrmarkyoung.com\/oracle\/wp-json\/wp\/v2\/posts\/474","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/www.mrmarkyoung.com\/oracle\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.mrmarkyoung.com\/oracle\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.mrmarkyoung.com\/oracle\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.mrmarkyoung.com\/oracle\/wp-json\/wp\/v2\/comments?post=474"}],"version-history":[{"count":1,"href":"http:\/\/www.mrmarkyoung.com\/oracle\/wp-json\/wp\/v2\/posts\/474\/revisions"}],"predecessor-version":[{"id":475,"href":"http:\/\/www.mrmarkyoung.com\/oracle\/wp-json\/wp\/v2\/posts\/474\/revisions\/475"}],"wp:attachment":[{"href":"http:\/\/www.mrmarkyoung.com\/oracle\/wp-json\/wp\/v2\/media?parent=474"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.mrmarkyoung.com\/oracle\/wp-json\/wp\/v2\/categories?post=474"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.mrmarkyoung.com\/oracle\/wp-json\/wp\/v2\/tags?post=474"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}