<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>PyTorch | Home - Jyothi Swaroop</title><link>https://kjyothiswaroop.github.io/tag/pytorch/</link><atom:link href="https://kjyothiswaroop.github.io/tag/pytorch/index.xml" rel="self" type="application/rss+xml"/><description>PyTorch</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Sun, 01 Mar 2026 00:00:00 +0000</lastBuildDate><image><url>https://kjyothiswaroop.github.io/media/icon_hue06d895dbab0d0c9b72b1a0534685c49_26160_512x512_fill_lanczos_center_3.png</url><title>PyTorch</title><link>https://kjyothiswaroop.github.io/tag/pytorch/</link></image><item><title>Deformable Object Manipulation</title><link>https://kjyothiswaroop.github.io/project/lehome-challenge/</link><pubDate>Sun, 01 Mar 2026 00:00:00 +0000</pubDate><guid>https://kjyothiswaroop.github.io/project/lehome-challenge/</guid><description>&lt;hr>
&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>The &lt;strong>leHome Challenge&lt;/strong> (ICRA 2026) is a &lt;strong>Deformable BiManipulation Challenge&lt;/strong> focused on folding laundry using a bimanual &lt;strong>SO-101&lt;/strong> robot. Teams must train policies capable of manipulating deformable objects — one of the hardest open problems in robot learning due to the near-infinite configuration space of cloth.&lt;/p>
&lt;p>We finished &lt;strong>54th out of 230+ teams&lt;/strong>.&lt;/p>
&lt;hr>
&lt;h2 id="data-collection">Data Collection&lt;/h2>
&lt;p>Automatic data collection was attempted by extracting &lt;strong>privileged information&lt;/strong> about the cloth mesh directly from the &lt;strong>Isaac Sim&lt;/strong> scene and passing goal poses to &lt;strong>cuRobo&lt;/strong> for bimanual IK solving and motion planning. In practice this pipeline was too brittle — cuRobo IK failures and sim–real cloth geometry mismatches meant trajectories rarely transferred to usable demonstrations.&lt;/p>
&lt;p>We fell back to &lt;strong>manual teleoperation&lt;/strong> on the physical SO-101 arms, then significantly expanded the dataset through &lt;strong>data augmentation&lt;/strong>: applying colour-jitter, brightness, contrast, and blur filters to camera observations to improve policy generalisation across lighting conditions.&lt;/p>
&lt;hr>
&lt;h2 id="policy">Policy&lt;/h2>
&lt;p>Five policy architectures were trained and evaluated:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Policy&lt;/th>
&lt;th>Notes&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>ACT&lt;/strong>&lt;/td>
&lt;td>Action Chunking with Transformers&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Diffusion Policy&lt;/strong>&lt;/td>
&lt;td>Denoising diffusion over action sequences&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>SmolVLA&lt;/strong>&lt;/td>
&lt;td>Vision-Language-Action model — &lt;strong>best performance&lt;/strong>&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>LingBotVLA&lt;/strong>&lt;/td>
&lt;td>Language-conditioned VLA&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>&lt;strong>SmolVLA&lt;/strong> achieved the highest task success rate across our evaluations. All policies were trained using the &lt;strong>LeRobot&lt;/strong> framework with &lt;strong>PyTorch&lt;/strong>.&lt;/p>
&lt;hr>
&lt;h2 id="results">Results&lt;/h2>
&lt;p>Ranked &lt;strong>54th / 230+ teams&lt;/strong> at the leHome Challenge (ICRA 2026).&lt;/p></description></item><item><title>Grayscale Image Colourization</title><link>https://kjyothiswaroop.github.io/project/conditional-diffusion/</link><pubDate>Sun, 01 Jun 2025 00:00:00 +0000</pubDate><guid>https://kjyothiswaroop.github.io/project/conditional-diffusion/</guid><description>&lt;hr>
&lt;h2 id="overview">Overview&lt;/h2>
&lt;p>A &lt;strong>conditional diffusion model&lt;/strong> for grayscale image colorization, built entirely from scratch in PyTorch — forward noising process, noise schedule, U-Net, EMA, and reverse diffusion loop. The grayscale image acts as the conditioning signal, concatenated as an additional channel to the noisy RGB image at each denoising step.&lt;/p>
&lt;hr>
&lt;h2 id="dataset">Dataset&lt;/h2>
&lt;p>Source images come from the &lt;strong>CelebA-HQ&lt;/strong> dataset (&lt;code>korexyz/celeba-hq-256x256&lt;/code>). Each image is resized to &lt;strong>128×128&lt;/strong> and paired with its grayscale version. The resulting dataset is pushed to HuggingFace (&lt;code>kjswaroopNU/celebahq-128-gray&lt;/code>) and loaded directly during training.&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Split&lt;/th>
&lt;th>Samples&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>Train&lt;/td>
&lt;td>28,000&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Validation&lt;/td>
&lt;td>1,000&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>Test&lt;/td>
&lt;td>1,000&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;hr>
&lt;h2 id="forward-process">Forward Process&lt;/h2>
&lt;p>The forward noising process uses an &lt;strong>offset cosine schedule&lt;/strong> over T = 1000 timesteps. Signal and noise rates satisfy the identity signal² + noise² = 1:&lt;/p>
&lt;p>$$x_t = \text{signal_rate}(t) \cdot x_0 + \text{noise_rate}(t) \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$&lt;/p>
&lt;p>The schedule interpolates angles between a &lt;code>max_signal_rate&lt;/code> of 0.95 and &lt;code>min_signal_rate&lt;/code> of 0.02, keeping signal-to-noise well-behaved at both ends.&lt;/p>
&lt;hr>
&lt;h2 id="u-net">U-Net&lt;/h2>
&lt;p>The U-Net is written from scratch with the following structure:&lt;/p>
&lt;table>
&lt;thead>
&lt;tr>
&lt;th>Block&lt;/th>
&lt;th>Details&lt;/th>
&lt;/tr>
&lt;/thead>
&lt;tbody>
&lt;tr>
&lt;td>&lt;strong>DownBlock&lt;/strong>&lt;/td>
&lt;td>2× ResidualBlock + AvgPool2d&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Bottleneck&lt;/strong>&lt;/td>
&lt;td>2× ResidualBlock&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>UpBlock&lt;/strong>&lt;/td>
&lt;td>Bilinear upsample + 2× ResidualBlock with skip connections&lt;/td>
&lt;/tr>
&lt;tr>
&lt;td>&lt;strong>Activation&lt;/strong>&lt;/td>
&lt;td>SiLU throughout&lt;/td>
&lt;/tr>
&lt;/tbody>
&lt;/table>
&lt;p>The noise variance is injected via a &lt;strong>sinusoidal embedding&lt;/strong> (log-spaced frequencies, sin + cos concatenated), upsampled to 128×128 and concatenated after the first convolution. The grayscale conditioning is concatenated to the noisy RGB input, giving the network 4 input channels.&lt;/p>
&lt;hr>
&lt;h2 id="training">Training&lt;/h2>
&lt;p>The network is trained to predict the noise $\epsilon$ added at each timestep (MSE loss). An &lt;strong>EMA copy&lt;/strong> of the U-Net (decay = 0.999) is maintained throughout and used exclusively at inference time for smoother outputs.&lt;/p>
&lt;p>A subtle bug was caught during development: BatchNorm &lt;strong>buffers&lt;/strong> (running mean/variance) are not touched by &lt;code>model.parameters()&lt;/code>, so the EMA network was computing fresh batch statistics at inference instead of using the accumulated training stats. The fix copies buffers explicitly alongside weight averaging each step.&lt;/p>
&lt;p>
&lt;figure >
&lt;div class="d-flex justify-content-center">
&lt;div class="w-100" >&lt;img alt="" srcset="
/project/conditional-diffusion/result_hu28a7587743642c4d8ca51a3af053489c_665629_347784f7e79da2651b2934a55cac1c3a.webp 400w,
/project/conditional-diffusion/result_hu28a7587743642c4d8ca51a3af053489c_665629_47b4ccadeace374dc8d2f12a724ccb54.webp 760w,
/project/conditional-diffusion/result_hu28a7587743642c4d8ca51a3af053489c_665629_1200x1200_fit_q75_h2_lanczos_3.webp 1200w"
src="https://kjyothiswaroop.github.io/project/conditional-diffusion/result_hu28a7587743642c4d8ca51a3af053489c_665629_347784f7e79da2651b2934a55cac1c3a.webp"
width="760"
height="460"
loading="lazy" data-zoomable />&lt;/div>
&lt;/div>&lt;/figure>
&lt;/p>
&lt;hr>
&lt;h2 id="gradio-ui">Gradio UI&lt;/h2>
&lt;p>Running &lt;code>python src/eval.py&lt;/code> launches a &lt;strong>Gradio web app&lt;/strong> for interactive inference. Upload any grayscale image, set the number of diffusion steps (10–100), and the number of colorized samples to generate (1–8). The EMA network runs the full reverse diffusion loop and returns the colorized outputs in a gallery — no code required.&lt;/p>
&lt;hr>
&lt;h2 id="mlops">MLOps&lt;/h2>
&lt;ul>
&lt;li>&lt;strong>Hydra&lt;/strong> — all hyperparameters live in &lt;code>configs/config.yaml&lt;/code> and are overridable from the CLI (&lt;code>python src/train.py lr=0.0001 batch_size=64&lt;/code>), making every run reproducible without touching source code.&lt;/li>
&lt;li>&lt;strong>Weights &amp;amp; Biases&lt;/strong> — per-step loss, per-epoch loss, and sample grids (grayscale / generated / ground truth) logged every 10 epochs.&lt;/li>
&lt;li>Checkpoints saved every 50 epochs; training resumes from any checkpoint via &lt;code>resume=true&lt;/code>.&lt;/li>
&lt;/ul></description></item></channel></rss>