Back to Python tutorials
Intermediate18 min read

Data Science Basics

Analyze data with NumPy arrays, Pandas DataFrames, and Matplotlib visualizations.

NumPy Fundamentals

NumPy ndarray provides fast vectorized math on homogeneous arrays. Broadcasting applies operations across shapes without explicit loops.

Use np.array, np.zeros, np.arange to construct arrays. Statistical functions like mean, std, and dot product power numerical code.

NumPy underpins Pandas and scikit-learn—learn array slicing and boolean masking early.

  • Prefer vectorization over Python loops on large arrays
  • Watch dtype—float64 vs int affects memory and precision
  • Use np.random.default_rng for reproducible randomness
import numpy as np

sales = np.array([120, 95, 140, 88])
print(sales.mean(), sales[sales > 100])

Pandas DataFrames

Pandas DataFrames are tabular with labeled columns. read_csv loads data; groupby aggregates; merge joins tables.

Handle missing values with isna, fillna, dropna consciously. Chained assignment pitfalls are avoided with .loc row selection.

Export cleaned datasets to parquet for faster reloads than CSV.

  • Set parse_dates in read_csv for time series
  • Use .copy() when creating derived DataFrames
  • Profile memory with df.info() on large files
import pandas as pd

df = pd.read_csv("orders.csv")
monthly = df.groupby("month")["total"].sum()

Visualization and Cleaning

Matplotlib pyplot creates line, bar, and scatter plots. Seaborn builds statistical visuals on top for exploration.

Data cleaning dominates real projects: normalize text, parse dates, deduplicate keys, and document transformation pipelines.

Keep notebooks exploratory; promote stable transforms to tested Python modules for production.

  • Label axes and titles for shareable charts
  • Validate assumptions with df.describe() and value_counts()
  • Version control small sample datasets, not multi-GB blobs

Get In Touch


Ready to discuss your next project? Drop me a message.