Home / Python / Day 8: Libraries & APIs / Introduction to Pandas

Introduction to Pandas

Pandas is the most popular Python library for data analysis, providing the DataFrame structure for working with tabular data.

What is Pandas?

Pandas builds on NumPy and provides two main data structures: Series (a labeled 1D array) and DataFrame (a labeled 2D table, like a spreadsheet). Install with pip install pandas.

Creating DataFrames

DataFrames can be created from dictionaries, lists of lists, or by reading files such as CSV with pd.read_csv().

Exploring Data

Use .head(), .tail(), .info(), .describe(), and .shape to quickly understand a dataset.

Selecting Data

Select columns with bracket notation (df["col"]), and rows with .loc[] (label-based) or .iloc[] (position-based).

Filtering

Boolean conditions filter rows, e.g. df[df["age"] > 18].

Common Operations

Sorting with .sort_values(), grouping with .groupby(), handling missing data with .dropna() and .fillna(), and adding new columns by assignment.

Saving Data

Export a DataFrame with .to_csv() or .to_json().

Syntax

<pre><code>import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    "name": ["Alice", "Bob", "Charlie"],
    "age": [25, 30, 35],
    "city": ["NYC", "LA", "Chicago"]
}
df = pd.DataFrame(data)

# Exploring data
print(df.head())     # first 5 rows
print(df.info())     # column types and non-null counts
print(df.describe()) # statistical summary of numeric columns
print(df.shape)      # (rows, columns)

# Selecting columns
print(df["name"])         # single column (Series)
print(df[["name", "age"]]) # multiple columns (DataFrame)

# Selecting rows
print(df.loc[0])      # row by label/index
print(df.iloc[0:2])   # rows by position

# Filtering
adults = df[df["age"] >= 30]

# Adding a new column
df["is_adult"] = df["age"] >= 18

# Sorting
sorted_df = df.sort_values("age", ascending=False)

# Grouping
grouped = df.groupby("city")["age"].mean()

# Handling missing data
df_clean = df.dropna()          # remove rows with NaN
df_filled = df.fillna(0)        # replace NaN with 0

# Reading and writing CSV
# df = pd.read_csv("data.csv")
# df.to_csv("output.csv", index=False)
</code></pre>

Revision Notes

• Series = 1D labeled array, DataFrame = 2D labeled table
• pd.DataFrame(dict) or pd.read_csv() to create data
• head(), info(), describe(), shape for exploration
• df["col"] for one column, df[["a","b"]] for multiple
• .loc[] is label-based, .iloc[] is position-based
• df[df["col"] > x] filters rows
• groupby(), sort_values(), dropna(), fillna() for transforms

Compute Column Average with Pandas

Easy

You are given a list of dictionaries representing rows of data, e.g. [{"name": "Alice", "score": 80}, {"name": "Bob", "score": 90}]. Write a function average_score(data) that builds a Pandas DataFrame from this data and returns the average of the "score" column.

Input:

[{"name": "Alice", "score": 80}, {"name": "Bob", "score": 90}, {"name": "Carl", "score": 70}]

Output:

80.0

Show Hint

Use pd.DataFrame(data) to build the DataFrame, then df["score"].mean() to compute the average.

Solve this Challenge

Show Solution

import pandas as pd

def average_score(data):
    df = pd.DataFrame(data)
    return df["score"].mean()

data = [{"name": "Alice", "score": 80}, {"name": "Bob", "score": 90}, {"name": "Carl", "score": 70}]
print(average_score(data))

Introduction to NumPy Back to Course