Blog

Blog

Fast 2-Simplicial Attention: Hardware-Efficient Kernels in TLX

In this blog post, we explore the kernel design details presented in the paper Fast…

Sijia Chen, Timothy Chou, Aurko Roy†, Hongtao Yu, Yuanwei (Kevin) Fang, Xiaodong Wang, Jiecao Yu, Tony CW Liu†, Chuanhao Zhuge, Josh Fromm, Ying Zhang†, Rohan Anil†, Ajit MathewsSeptember 5, 2025

Blog

PyTorch 2.8+TorchAO: Unlock Efficient LLM Inference on Intel® AI PCs

Large Language Models (LLMs) have transformed tasks across numerous industries, including drafting emails, generating code,…

Intel PyTorch TeamSeptember 3, 2025

Blog

Accelerating 2K scale pre-training up to 1.28x with TorchAO, MXFP8 and TorchTitan on Crusoe B200 Cluster

tldr: 1.22x - 1.28x training acceleration with MXFP8, equivalent convergence compared to BF16. We recently…

Less Wright, Vasiliy Kuznetsov, Daniel Vega-Myhre, Driss Guessous, Hamid Shojanazeri, Elias Ellison, Martin Cala, Ethan PetersenSeptember 3, 2025

Blog

A Primer on LLM Post-Training

Large Language Models (LLMs) have revolutionized how we write and consume documents. In the past…

Davide TestuggineAugust 26, 2025

Blog

DRAMA Model Inference Efficiency Boosted by 1.7x-2.3x

TL;DR NJTs (Nested Jagged Tensors) boost DRAMA model inference efficiency by 1.7x-2.3x, making it more…

Shreya GoyalAugust 22, 2025

Blog

ZenFlow: Stall-Free Offloading Engine for LLM Training

Introduction ZenFlow is a new extension to DeepSpeed introduced in summer 2025, designed as a…

Tingfeng Lan, Yusen Wu, Bin Ma, Zhaoyuan Su, Rui Yang, Tekin Bicer, Masahiro Tanaka, Olatunji Ruwase, Dong Li, Yue ChengAugust 20, 2025

Blog

Accelerating MoE’s with a Triton Persistent Cache-Aware Grouped GEMM Kernel

In this post, we present an optimized Triton BF16 Grouped GEMM kernel for running training…

Less Wright, Adnan Hoque, Garrett GoonAugust 18, 2025

Blog

PyTorch Wheel Variants, the Frontier of Python Packaging

charliemarsh’s tweet, creator of uv PyTorch is the leading machine learning framework for developing and…

Eli UriegasAugust 13, 2025

Blog Community

PyTorch Day China Recap

On June 7, 2025, PyTorch Day China was held in Beijing, co-hosted by PyTorch Foundation…

PyTorch FoundationAugust 12, 2025

Blog

Introducing Mixed Precision Training in Opacus

Introduction We integrate mixed and low-precision training with Opacus to unlock increased throughput and training…

Iden Kalemaj, Huanyu ZhangAugust 12, 2025

Blog

Bringing Generative AI to the Masses with ExecuTorch and KleidiAI

Key Takeaways: ExecuTorch 0.7 now enables KleidiAI by default, delivering automatic acceleration on Arm CPUs…

Gian Marco Iodice, GenAI Engineering Lead, Arm, Mary Bennion, Director Ecosystem, Arm, Digant Desai, Software Engineer, MetaAugust 11, 2025

Blog Community

vLLM Beijing Meetup: Advancing Large-scale LLM Deployment

On August 2, 2025, Tencent’s Beijing Headquarters hosted a major event in the field of…

vLLM TeamAugust 7, 2025

Blog

Advancing Low-Bit Operators in PyTorch and ExecuTorch: Dynamic Kernel Selection, KleidiAI, and Quantized Tied Embeddings

TorchAO brings high-performance low-bit linear and embedding operators to Arm CPUs. In this update, we’re…

Scott Roy, Digant Desai, Ed Miller, Gian Marco Iodice, Ronan NaughtonAugust 7, 2025

Blog

PyTorch 2.8 Release Blog

We are excited to announce the release of PyTorch® 2.8 (release notes)! This release features: …

PyTorch FoundationAugust 6, 2025

Kubeflow Trainer Joins PyTorch Ecosystem

Blog Ecosystem

PyTorch on Kubernetes: Kubeflow Trainer Joins the PyTorch Ecosystem

We’re thrilled to announce that the Kubeflow Trainer project has been integrated into the PyTorch…

Andrey Velichkevich, Apple; Yuki Iwai, CyberAgent, Inc.; Yuan Tang, Red Hat; Antonin Stefanutti, Red Hat; Johnu George, NutanixJuly 28, 2025

Blog

torch.compile and Diffusers: A Hands-On Guide to Peak Performance

Diffusers is the go-to library that provides a unified interface to cutting-edge and open diffusion…

Sayak Paul (Hugging Face), Animesh Jain (Meta), Benjamin Bossan (Hugging Face)July 17, 2025

Blog

Enabling Fully Sharded Data Parallel (FSDP2) in Opacus

Introduction and Context Opacus is making significant strides in supporting private training of large-scale models…

Sai Aparna Aketi, Huanyu ZhangJuly 7, 2025

Blog

Reducing Storage Footprint and Bandwidth Usage for Distributed Checkpoints with PyTorch DCP

Summary PyTorch Distributed Checkpointing (DCP) is a versatile and powerful tool for managing model checkpoints…

Meta: Sibasish Acharya, Marc Horowitz, Pradeep Fernando, Saurabh Mishra IBM: Saransh Gupta, Swaminathan Sundararaman, Raghu GantiJuly 2, 2025

Blog

PyTorch + vLLM = ♥️

Key takeaways: PyTorch and vLLM are both critical to the AI ecosystem and are increasingly…

Simon Mo, Woosuk Kwon, Kaichao You, The PyTorch Team @MetaJune 25, 2025

Blog Ecosystem

FlagGems Joins the PyTorch Ecosystem: Triton-Powered Operator Library for Universal AI Acceleration

In the race to accelerate large language models across diverse AI hardware, FlagGems delivers a…

FlagGems TeamJune 25, 2025

Fast 2-Simplicial Attention: Hardware-Efficient Kernels in TLX

PyTorch 2.8+TorchAO: Unlock Efficient LLM Inference on Intel® AI PCs

Accelerating 2K scale pre-training up to 1.28x with TorchAO, MXFP8 and TorchTitan on Crusoe B200 Cluster

A Primer on LLM Post-Training

DRAMA Model Inference Efficiency Boosted by 1.7x-2.3x

ZenFlow: Stall-Free Offloading Engine for LLM Training

Accelerating MoE’s with a Triton Persistent Cache-Aware Grouped GEMM Kernel

PyTorch Wheel Variants, the Frontier of Python Packaging

PyTorch Day China Recap

Introducing Mixed Precision Training in Opacus

Bringing Generative AI to the Masses with ExecuTorch and KleidiAI

vLLM Beijing Meetup: Advancing Large-scale LLM Deployment

Advancing Low-Bit Operators in PyTorch and ExecuTorch: Dynamic Kernel Selection, KleidiAI, and Quantized Tied Embeddings

PyTorch 2.8 Release Blog

PyTorch on Kubernetes: Kubeflow Trainer Joins the PyTorch Ecosystem

torch.compile and Diffusers: A Hands-On Guide to Peak Performance

Enabling Fully Sharded Data Parallel (FSDP2) in Opacus

Reducing Storage Footprint and Bandwidth Usage for Distributed Checkpoints with PyTorch DCP

PyTorch + vLLM = ♥️

FlagGems Joins the PyTorch Ecosystem: Triton-Powered Operator Library for Universal AI Acceleration

Docs

Tutorials

Resources

Stay in touch for updates, event info, and the latest news