What Is a Tensor Core GPU?
NVIDIA Tensor Core GPUs are the de facto standard for AI computing, thanks to an architecture specifically designed for the types of operations used in neural networks.
Tensors are the core data structure in AI — multidimensional arrays of weights. Processing them requires massive matrix multiplication, which is accelerated by a specialized hardware unit called the Tensor Core. Unlike traditional CUDA cores, Tensor Cores perform mixed-precision matrix multiplication on entire blocks of numbers in a single clock cycle.
Tensor Cores first appeared in the Tesla V100 (Volta, 2017). Starting with the Ampere generation, NVIDIA moved away from the “Tesla” brand in favor of the broader “Tensor Core GPU” positioning — emphasizing that tensor performance and high-bandwidth HBM memory determine the real total cost of ownership (TCO) of an AI cluster.
Cloud4Y offers all major generations of these GPUs for rent: data center accelerators (from V100 to B300) and workstation-class cards (RTX 4090, RTX A6000 Ada, RTX 5090). Let’s examine how they differ and which GPU to choose for specific tasks.
Summary Table: All Cloud4Y GPUs
For server GPUs, specifications refer to SXM versions (HGX/DGX platforms). PCIe versions have reduced performance.
How to Read the Table
- FP32, FP16, FP8 — Compute performance at different precision formats. Higher values mean faster task execution. FP16 and FP8 are the primary formats for AI workloads.
- Memory — The amount of data that fits “on the card.” Determines the maximum model size that can be deployed.
- Memory bandwidth — The speed at which data is delivered to the compute units. Critical for large-model inference.
- NVLink — Interconnect between GPUs. Available only on server GPUs and enables multi-GPU clustering.
- TDP — Power consumption. Affects operating costs and cooling requirements.
Performance values are given in TFLOPS (trillions of floating-point operations per second). The higher the value, the faster the GPU performs computations at the specified precision.
| V100 | A100 | H100 | H200 | B200 | B300 | RTX 4090 | A6000 Ada | RTX 5090 | RTX 6000 Blackwell | |
|---|---|---|---|---|---|---|---|---|---|---|
| Architecture | Volta | Ampere | Hopper | Hopper | Blackwell | Blackwell Ultra | Ada Lovelace | Ada Lovelace | Blackwell | Blackwell |
| Year | 2017 | 2020 | 2022 | 2024 | 2025 | 2025 | 2022 | 2022 | 2025 | 2025 |
| Segment | Data Center | Data Center | Data Center | Data Center | Data Center | Data Center | Workstation | Workstation | Workstation | Workstation |
| FP64 | 7.8 TFLOPS | 9.7 TFLOPS | 34 TFLOPS | 34 TFLOPS | 37 TFLOPS | 1.2 TFLOPS | — | — | — | — |
| FP32 | 15.7 TFLOPS | 19.5 TFLOPS | 67 TFLOPS | 67 TFLOPS | 75 TFLOPS | 75 TFLOPS | 82.6 TFLOPS | 91.1 TFLOPS | 104.8 TFLOPS | 125 TFLOPS |
| Memory | 32 GB HBM2 | 80 GB HBM2e | 80 GB HBM3 | 141 GB HBM3e | 192 GB HBM3e | 288 GB HBM3e | 24 GB GDDR6X | 48 GB GDDR6 | 32 GB GDDR7 | 96 GB GDDR7 |
| Memory Bandwidth | 900 GB/s | 2 TB/s | 3.35 TB/s | 4.8 TB/s | 8 TB/s | 8 TB/s | 1.01 TB/s | 960 GB/s | 1.79 TB/s | 1.8 TB/s |
| NVLink | 300 GB/s | 600 GB/s | 900 GB/s | 900 GB/s | 1.8 TB/s | 1.8 TB/s | — | — | — | — |
| TDP | 300 W | 400 W | 700 W | 700 W | 1000 W | 1400 W | 450 W | 300 W | 575 W | 600 W |
*Sources: NVIDIA Datasheets (V100, A100, H100, H200, B200, RTX PRO 6000 Blackwell); NVIDIA Technical Blog “Inside NVIDIA Blackwell Ultra” (B300, January 2026); Exxact Corporation (A100–B200); TechPowerUp GPU Database (RTX 4090, A6000 Ada); Notebookcheck, Spheron, GPUPoet (RTX 5090 — 3352 AI TOPS FP4 sparse, converted to dense: FP16 = 419, FP8 = 838, FP4 = 1676 TFLOPS); WareDB (RTX PRO 6000 Blackwell — FP16 dense = 500, converted: FP8 = 1000, FP4 = 2000 TFLOPS); Leadtek (RTX PRO 6000 Blackwell — 4000 AI TOPS FP4 sparse).
When comparing floating-point performance at different precision levels, it becomes clear that the Blackwell generation sacrifices FP64 Tensor Core performance in favor of dramatically increasing performance at FP32 and lower precisions.
For example, the B300 delivers just 1.2 TFLOPS in FP64, but up to 15 PFLOPS in FP4.
Neural network training does not require 64-bit precision for weight and parameter calculations. By reducing FP64 Tensor Core resources, NVIDIA reallocates transistor budget toward FP32, FP16, FP8/FP6, and FP4 — the formats actually used in real-world AI workloads.
The performance of B300 and B200 in TF32, FP16, and FP8 is more than double that of the previous-generation H200. In addition, Blackwell introduces a new Transformer Engine with FP4 support. These reduced-precision formats are not used across entire computations but as part of mixed precision workflows, delivering significant performance gains.
The V100 and RTX series (4090, A6000 Ada, 5090) are not included in the original Exxact comparison; we added them because they are available in the Cloud4Y fleet. The V100 remains a reasonable option for workloads that fit within 125 TFLOPS FP16 and 32 GB of memory.
RTX cards do not support NVLink and use GDDR memory instead of HBM, but they offer an attractive price-to-FP32 ratio and are well suited for rendering, Stable Diffusion, and inference workloads.
The RTX 6000 Blackwell with 96 GB ECC memory occupies a unique niche between workstation and server GPUs. It is currently the only non-server card capable of running a 70B model in FP8 on a single accelerator.
Should You Upgrade?
“Newer means better” often applies to hardware. However, migrating to the latest Tensor Core GPU platform is a strategic decision that depends on your organization’s compute requirements, workload type, and scaling plans.
New architectures deliver clear performance gains, but real ROI appears only when the hardware aligns with workload priorities.
Deploying New AI Infrastructure → Blackwell
The B300 and B200 platforms provide significant gains in both training and inference compared to Hopper.
The B300 offers more than three times the memory capacity of the H100 (288 GB vs 80 GB).
Verified performance data for B300 and B200 shows up to 11–15× higher LLM throughput per GPU compared to Hopper. In multi-GPU configurations, this multiplier scales further.
The Blackwell architecture supports reduced-precision modes (FP8, FP4), which substantially improve efficiency in large-scale training and inference.
Upgrading an Existing H100 or H200 Fleet → Hybrid Strategy
Consider a hybrid workload distribution:
- B300 or B200 — for latency-critical inference tasks
- H200 — for background, resource-intensive workloads
Continue training large models on H100 or H200 — they remain strong in FP64 and FP8 for HPC and training workloads.
Use B200 or B300 for inference and production deployment — this is where Blackwell delivers the greatest gains in throughput and latency.
NVIDIA continues to evolve its lineup, and migration to new hardware can be performed gradually. Large-scale infrastructures require time for deployment and return on investment. Even after a new generation is released, the previous generation continues to deliver high performance.
Pricing
Final cost may vary depending on CPU, RAM, NVMe storage, network bandwidth, and certification requirements.
| GPU | ₽/hour | ₽/month | Typical Use Case |
|---|---|---|---|
| Tesla V100 32 GB | 147 | 68 814* | Computer vision, OCR, classical ML, rendering |
| Tesla A100 40 GB | 155 | 72 410* | Fine-tuning and inference up to 7B models, MIG, classical ML |
| Tesla H100 80 GB | 686 | 321 157* | Transformer training, 13–70B inference |
| Tesla H200 141 GB | 686 | 321 157* | 70B+ LLM inference, long-context workloads |
| Tesla B200 180 GB | 1 123 | 525 559* | Flagship models, HPC + AI |
| Tesla B300 288 GB | 1 116 | 803 306 | 100B+ inference with FP4, large KV cache |
| RTX 4090 24 GB | 100 | 72 061* | Stable Diffusion, inference up to 13B |
| RTX 5090 24 GB | 83 | 75 667* | FP4 inference up to 24B, rendering, Stable Diffusion |
| RTX A6000 Ada 48 GB | 105 | 81 967* | Production inference 13–30B, ECC |
| RTX 6000 Blackwell 96 GB | 137 | 98 364* | 70B FP8 inference on a single GPU, 96 GB ECC |
* Discounted price. Please refer to the current pricing and terms for details.
Efficiency Metrics
Comparing GPUs by hourly price alone is misleading — a GPU that costs twice as much may complete a task three times faster. The correct approach is to calculate the cost of the result.
Method 1 — Cost per TFLOPS
Divide the hourly rate by FP16 performance. The lower the cost per TFLOPS, the better the value.
Method 2 — Cost per Million Tokens
Using TensorRT-LLM benchmarks on Llama-3 70B FP8, calculate tokens per hour and divide by the hourly rate.
By this metric, the H200 outperforms the H100 despite a 25% higher price: a 1.9× throughput increase reduces token cost by 30–40%. The B200 and B300 outperform the H200 by another 2–3×.
Key takeaway: evaluate GPUs by cost per completed workload, not cost per hour.
GPU Servers for High-Performance Computing
Why Renting GPUs from Cloud4Y Is More Profitable Than Purchasing
For Russian businesses in 2026, purchasing GPU infrastructure is not just a major capital investment. It also involves parallel import logistics, delivery delays of several months, and warranty challenges.
- CapEx → OpEx. Pay only for the hours you actually use.
- Data centers in Russia and abroad. Moscow, Novosibirsk, Turkey, Germany, the Netherlands.
- Compliance: Federal Law 152-FZ, Federal Law 187-FZ, PCI DSS, CSA STAR — certifications that foreign cloud providers operating outside Russian regulation may not offer.
- Hourly billing. Pay for GPU usage, not idle hardware.
- Fast generation upgrades. Switch to new hardware without procurement, installation, or asset write-offs.
Conclusion
GPU selection should be based not on release date, but on cost per result.
The correct formula: choose the GPU with the lowest cost per unit of completed work.
- 70B+ models → H200 or Blackwell
- 70B inference on a single non-server GPU → RTX 6000 Blackwell
- 13–30B models → H100 or A6000 Ada
- Classical ML → V100 or A100
- Development and rendering → RTX 4090, 5090, A6000 Ada
Cloud4Y provides access to the full GPU range — from V100 to B300 and RTX 6000 Blackwell — with hourly billing and operation within the Russian legal framework.
To select a GPU server for rent, please follow the link.
This material is based on Exxact Corporation analytics (November 2025), expanded to cover the full Cloud4Y GPU fleet. B300 data is clarified according to the official NVIDIA technical blog (January 2026).