//memwallbyAnsh-Sarkar

memwall

1
0
1
Python

🧱 MemWall: The Ultimate ML Memory & Performance Estimator

Ever stared at a CUDA Out of Memory error and wondered where it all went wrong? Or perhaps you’ve noticed your GPU utilization sitting at a measly 30% while training your shiny new Transformer? Welcome to MemWall.

MemWall is a comprehensive Python library designed to help ML practitioners plan, profile, and optimize their hardware utilization. We bridge the gap between abstract model architectures and the harsh reality of hardware physics.


🏔️ The Core Philosophy: Understanding the “Ridge Point”

At the heart of optimization lies the Roofline Model, an intuitive visual model that connects computational performance, memory bandwidth, and arithmetic intensity.

undefinedWhat is the Ridge Point?undefined
Every piece of hardware (like your A100 or RTX 4090) has two critical ceilings:

  1. undefinedPeak Memory Bandwidth (GB/s): How fast can it move data?
  2. undefinedPeak FLOPs (TFLOP/s): How fast can it crunch numbers?

The Ridge Point is the exact ratio where these two meet, calculated simply as:
Ridge Point = Peak FLOPs / Peak Bandwidth

This single number is your North Star. By calculating the Arithmetic Intensity (FLOPs / Bytes) of your operation, you can compare it to the Ridge Point:

  • undefinedArithmetic Intensity < Ridge Point: You are Memory-Bound. Your GPU is starved for data. Adding more compute power won’t help; you need faster memory or better data reuse.
  • undefinedArithmetic Intensity > Ridge Point: You are Compute-Bound. You are crunching numbers efficiently. Your bottleneck is the raw math power of the GPU.

MemWall makes understanding and visualizing this dynamic effortless.


🚀 Key Features

  • undefinedTransformer VRAM Estimator: Instantly compute memory needs for weights, KV cache, and activations.
  • undefinedRoofline Model Calculator: Automatically classify operations as memory-bound or compute-bound.
  • undefinedPyTorch Integration: Lightweight hooks to profile real-world peak memory and layer-by-layer usage.
  • undefinedOptimization Advisors: Actionable recommendations for mixed precision and batch sizes.

📦 Installation

pip install memwall

📖 Deep Dive: Exploring the Examples

To really understand what MemWall can do, let’s walk through the three core examples included in the library. Think of this as your interactive guide to solving ML performance bottlenecks.

🔍 Example 1: Basic VRAM Estimation (examples/basic_estimation.py)

The Question: “I want to run Llama-7B with a batch size of 4 and a sequence length of 2048. Will it fit on my GPU?”

Before you even load a single tensor or write a PyTorch script, MemWall can tell you exactly what you’re getting into. The basic_estimation.py script demonstrates how to load a preset estimator for models like Llama-7B.

undefinedWhat it does:undefined
It mathematically estimates the exact VRAM breakdown. It doesn’t just guess a single number; it breaks it down into:

  • undefinedWeights Memory: The static size of the model parameters.
  • undefinedKV Cache Memory: The dynamic memory required for generating tokens during inference.
  • undefinedActivation Memory: The memory needed for intermediate calculations.

Result: You get a clean readout of exactly how many Gigabytes you need, saving you from trial-and-error OOM crashes.

⏱️ Example 2: PyTorch Profiling (examples/pytorch_profiling.py)

The Question: “My training script is eating 40GB of VRAM, but my model is only 2GB. Where is the memory going?”

Sometimes math isn’t enough—you need to see what PyTorch is actually doing. The pytorch_profiling.py example highlights MemWall’s lightweight PyTorch hooks.

undefinedWhat it does:undefined
By wrapping your model with MemWall’s profile_model function and passing dummy data, the library tracks every single forward pass operation. It outputs:

  • undefinedPeak Memory: The absolute highest water-mark your GPU hit during the pass.
  • undefinedLayer Breakdown: A microscopic view of exactly how much memory (in MB) every single layer (fc1, relu, fc2) consumed.
  • undefinedIncremental VRAM: How much additional memory was allocated beyond the base model weights.

Result: You instantly spot the specific layers that are hoarding memory, allowing you to selectively apply gradient checkpointing or optimize your architecture.

📈 Example 3: Roofline Analysis (examples/roofline_analysis.py)

The Question: “Why is my matrix multiplication taking so long? Do I need a faster GPU, or just different hardware?”

This script brings the Ridge Point concept to life.

undefinedWhat it does:undefined

  1. You select a target GPU (e.g., A100_80GB_SXM), and MemWall loads its peak constraints.
  2. You define specific operations by their workload: how many FLOPs they execute and how many bytes they move.
  3. MemWall calculates the Arithmetic Intensity for each operation.
  4. Finally, it compares this to the GPU’s Ridge Point to print out a diagnosis: Is this layer Memory-Bound or Compute-Bound? What is the predicted utilization?

As a bonus, the script automatically generates a beautiful roofline_example.png plot, mapping your operations directly against the theoretical limits of your hardware.

Result: You stop guessing why your code is slow. You get a mathematical proof pointing you toward either memory optimization or algorithmic improvements.


📉 Supported GPUs

MemWall ships with precise hardware specifications for modern accelerators:

  • NVIDIA A100 (80GB/40GB SXM & PCIe)
  • NVIDIA H100
  • NVIDIA RTX 4090 / 3090
  • …or define your own custom peak FLOPs and Bandwidth!

📜 License

MIT License. Built for the community, by the community.

[beta]v0.14.0