Data Science Basics
Analyze data with NumPy arrays, Pandas DataFrames, and Matplotlib visualizations.
NumPy Fundamentals
NumPy ndarray provides fast vectorized math on homogeneous arrays. Broadcasting applies operations across shapes without explicit loops.
Use np.array, np.zeros, np.arange to construct arrays. Statistical functions like mean, std, and dot product power numerical code.
NumPy underpins Pandas and scikit-learn—learn array slicing and boolean masking early.
- Prefer vectorization over Python loops on large arrays
- Watch dtype—float64 vs int affects memory and precision
- Use np.random.default_rng for reproducible randomness
import numpy as np sales = np.array([120, 95, 140, 88]) print(sales.mean(), sales[sales > 100])
Pandas DataFrames
Pandas DataFrames are tabular with labeled columns. read_csv loads data; groupby aggregates; merge joins tables.
Handle missing values with isna, fillna, dropna consciously. Chained assignment pitfalls are avoided with .loc row selection.
Export cleaned datasets to parquet for faster reloads than CSV.
- Set parse_dates in read_csv for time series
- Use .copy() when creating derived DataFrames
- Profile memory with df.info() on large files
import pandas as pd
df = pd.read_csv("orders.csv")
monthly = df.groupby("month")["total"].sum()Visualization and Cleaning
Matplotlib pyplot creates line, bar, and scatter plots. Seaborn builds statistical visuals on top for exploration.
Data cleaning dominates real projects: normalize text, parse dates, deduplicate keys, and document transformation pipelines.
Keep notebooks exploratory; promote stable transforms to tested Python modules for production.
- Label axes and titles for shareable charts
- Validate assumptions with df.describe() and value_counts()
- Version control small sample datasets, not multi-GB blobs