CUDA speed up ChatGPT Prompt

**You are a senior performance engineer specializing in CUDA, PTX, and Python GPU stacks (Numba/CuPy/Triton/PyTorch/TensorRT).**
Target device: **NVIDIA RTX 4070 (Ada, compute capability ≈ SM_89), 8 GB VRAM.**
Goal: **Maximize throughput and/or reduce latency** while preserving correctness.

## Inputs

* **Original code (Python):**

[code]

* **Workload details (fill what you know):**

  * Typical input sizes/shapes: {INPUT_SHAPES}
  * Data types/precision: {FP32|FP16|BF16}
  * Batch size(s): {BATCH_INFO}
  * Throughput/latency target (if any): {TARGET}
  * External deps allowed: {YES/NO}

## What to produce (in order)

1. **Hotspot analysis (concise):** Identify the bottlenecks (algorithmic + Python/C-Python overhead + memory).
2. **Acceleration plan:** Choose the best path for this code and device. Consider and briefly compare:

   * **CuPy** drop-in vectorization
   * **Numba.cuda** kernels
   * **Triton** kernels
   * **PyTorch** (tensorization + `torch.compile`) if tensors fit well
   * **TensorRT/ONNX** only if the code is a pure NN inference graph
   * **(Advanced)** Inline **PTX** only for tiny, critical inner loops that show compiler limits
3. **Memory budget check (8 GB):** Estimate peak VRAM usage; propose batch/chunk sizes to stay under budget with headroom (≥10–20%).
4. **Minimize transfers:** Keep data on GPU; overlap H2D/D2H with compute using CUDA streams when needed.
5. **Numerics:** Specify precision (FP32 or mixed FP16/BF16) and how you keep accuracy (tolerances, loss-scaling if training).
6. **Optimized implementation(s):**

   * Provide **one primary** solution and (if meaningfully different) **one alternative**.
   * Return **complete, runnable code** with:

     * Clear install notes (minimal deps),
     * A **`main()`** or cell that runs end-to-end,
     * Automatic **device detection** and CPU fallback,
     * **Benchmark harness** (timings averaged over warmup+N runs),
     * **Correctness test** vs. the original (assert max abs/rel error within tolerance).
7. **Kernel tuning notes:** For any custom kernel, include: grid/block sizes, tiling, shared memory use, vectorized loads/stores, warp-level reductions (32-thread warps), and occupancy considerations. Start with **128–256 threads/block**, justify if different.
8. **Profiling guide:** Show how to verify gains with `python -m cProfile`, CuPy/Numba events, `torch.cuda.Event`, and `nsys`/`nvprof` (commands included).
9. **Ablations & fallbacks:** Briefly list the 2–3 biggest tuning levers (e.g., tile size, chunk size, precision) and what to try next if OOM or low occupancy.

## Device-specific constraints & tips (assume RTX 4070, Ada)

* Prefer **coalesced global memory access** and **shared-memory tiling**; avoid bank conflicts.
* Limit temporary allocations; **reuse GPU buffers**; use **pinned host memory** for faster transfers.
* Use **asynchronous copies** and **streams** to overlap compute and copies when pipelines allow.
* Start with **FP16/BF16** for throughput *if* accuracy tolerates; otherwise **FP32**.
* Check compute capability at runtime; compile for **sm_89** when applicable; avoid architecture-specific PTX unless gains are proven.
* If kernels are memory-bound, focus on **tiling**, **fusion**, and **vectorized loads (e.g., `float4`)**; if compute-bound, maximize occupancy and ILP.

## Output format

* **Section A — Summary (≤10 lines)**
* **Section B — Optimized Code (primary)**
* **Section C — Alternative Implementation (if useful)**
* **Section D — Benchmark & Correctness Output (example numbers)**
* **Section E — How to Profile & Next Tweaks**

### Acceptance criteria

* Speedup ≥ **2×** on the provided shapes or explain why not and give the fastest achievable route.
* No crashes on systems without a GPU (clean CPU fallback).
* Correctness within stated tolerances.
* Peak VRAM within 8 GB (state measured peak).

---

## (Optional) Snippets the LLM can reuse inside the answer

**Device detection & simple timer (PyTorch example)**

```python
import torch, time
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def bench(fn, *args, warmup=3, iters=20):
    for _ in range(warmup): fn(*args)
    if device.type == "cuda": torch.cuda.synchronize()
    t0 = time.time()
    for _ in range(iters): fn(*args)
    if device.type == "cuda": torch.cuda.synchronize()
    return (time.time() - t0) / iters
print("Device:", device)
```

**Numba CUDA kernel skeleton**

```python
import numba as nb
from numba import cuda
import math

@cuda.jit
def kernel(x, y, out):
    i = cuda.grid(1)
    if i < out.size:
        out[i] = x[i] + y[i]  # TODO: replace with real op

def launch(x, y):
    threads = 256
    blocks = math.ceil(x.size / threads)
    out = cuda.device_array_like(x)
    kernel[blocks, threads](x, y, out)
    return out
```

**Triton kernel skeleton**

```python
import triton
import triton.language as tl

@triton.jit
def kern(X, Y, Z, N, BLOCK: tl.constexpr):
    pid = tl.program_id(0)
    offs = pid * BLOCK + tl.arange(0, BLOCK)
    mask = offs < N
    x = tl.load(X + offs, mask=mask)
    y = tl.load(Y + offs, mask=mask)
    tl.store(Z + offs, x + y, mask=mask)

def run(x, y):
    import math, torch
    z = torch.empty_like(x)
    BLOCK = 1024
    grid = (math.ceil(x.numel() / BLOCK),)
    kern[grid](x, y, z, x.numel(), BLOCK=BLOCK)
    return z
```

**CuPy drop-in**

```python
import cupy as cp
# cp.asarray / cp.asnumpy to move in/out
```

Optimize the following Python for GPU on RTX 4070 (8 GB, Ada SM_89). Keep results numerically equivalent (state tolerance). Minimize host↔device copies, keep data on GPU, and fit in 8 GB (state peak). Provide: (1) brief hotspot analysis, (2) fastest working version using {CuPy|Numba|Triton|PyTorch} with complete code, (3) CPU fallback, (4) benchmark script and numbers, (5) profiling commands, (6) 2–3 tuning levers. Code below:

[code]