[Guide] Train Custom Lora for Fashion Clothes in Stable Diffusion

(Nov 2, 2024) Update: ComfyUI is a complete solution for image generation. Want to use it quickly and easily? Visit https://comfyai.run to run ComfyUI online without any GPU setup. Start generating images with just one click.

(April 27, 2024) Updates: We've launched a free-to-use tool at Jinta.AI, a platform for image generation for custom products. This allows you to quickly achieve results without having to train Stable Diffusion models yourself. Sign up here.

Introduction:

Training Lora for custom objects can be challenging, as Stable Diffusion may not accurately capture product details, including unique shapes, cuts, and symbols. This is due to the base model's lack of sufficient data for these specifics, leading to mistakes and odd details when working with product images.

This guide is designed to assist you in training Lora specifically for fashion clothing, which tends to be simpler than other categories. The base model's proficiency with images of people and its ability to appropriately fit clothes on various body types provides a solid foundation for custom model training in this area. Although this tutorial focuses on fashion, the principles may be applicable to other commercial products as well. If you have success in these areas, I would appreciate your insights.

I am optimistic that as the base model evolves, these techniques will become applicable to a broader range of commercial products. I am excited about the possibilities this technology offers, and I encourage you to embark on this journey.

Note: If this is your first time training a custom Lora model, I suggest starting with characters and styles, as they are generally more straightforward than product training.

Alright, let’s get started!

What AI-generated fashion clothes look like:

We'll use Armani jackets as an example because of their simplicity in designs. For this example, I used photos of Armani jackets found on the internet.

Original products from Armani's website:

https://www.armani.com/en-us/upton-line-single-breasted-velvet-jacket_cod1647597313155678.html

Once you've successfully trained it, you can use Stable Diffusion to create AI-generated images that showcase the jacket similar to the original photos on Armani’s websites. Since we're training this as Lora, we can apply this Lora on top of various base SD models. Below is an output from the Juggernaut XL model before and after applying the custom Lora.

Note that the faces and overall photo composition may differ from the original photo, which is a normal effect when using Lora. This happens because Lora processes the original pixel seed data and translates them into different outcomes.

If you wish to maintain the consistency of the face from the original photo, it will necessitate a different approach. Unfortunately, this falls outside the scope of our current tutorial. Perhaps we can explore this in a future session!

Currently, the Lora training process doesn't incorporate faces as input, allowing the use of custom Lora for various people, compositions, and poses. See more examples below.

High-level Steps for Developing Custom Lora:

Prepare the training dataset.
Train the Lora model and monitor its behavior.
Use the model to generate photos and refine through iteration.

First Step: Preparing the Training Dataset, which consists of the fashion clothes we plan to use later in the Generative AI process.

The dataset should concentrate on a single product and concept. For our purposes, we will primarily focus on the front view of the product worn by a male model. Minor variations, such as an open or buttoned jacket, are acceptable, but we aim to avoid different pose variations. For instance, we will not include variations like jackets hung over arms, folded, thrown, or torn jackets, to simplify our training.

For this concept, I found that 30 to 50 images are typically sufficient. In this example, we use 37 images. Here's what they look like:

To build the training images, follow the standard Lora/Dreambooth training procedures. I resized the training images to 712x712 pixels for the training size, though 512x512 pixels should also work just fine.

Training images technique:

Quantity: Aim to quickly gather 30 to 50 high-quality photos. Ensure the photos are clear and not blurry before resizing.
Zoom and Rotation: Utilize high-resolution photos and 360-degree view of products available on Armani's website to capture screenshots from various angles and parts of the products.
Product Details: The training images will be resized to 712x712 pixels. At this size, it's easy to lose intricate details if we try to fit the entire jacket into the frame. To prevent this, we also capture screenshots focusing on specific parts of the jacket and close-up shots of its details. This approach helps ensure that the important features of the product are not lost during resizing and are effectively incorporated into the training process.
Exclude Faces: To focus the model on clothing, exclude faces from most photos. This prevents the custom Lora from affecting facial features in the final output. A few photos can include faces to help the model understand jacket positioning and size, but not enough to influence facial alterations.

Labels (aka Tagging):

To keep it simple, I use the same phrase for all photos:
- arman1jacket, jacket, Giorgio Armani
The term "arman1jacket" is a unique identifier intended to encapsulate this new product concept. We will use this term for trigger word when enabling the custom model in later steps.
Please notice that we created a unique term “arman1jacket”, because we want SD to be under this as a new product concept and we’ll use this term as a trigger word during the generation time.

Training Repetitions and Steps:

Best practices I found online for Lora training suggest using 3,500 to 5,000 steps. I found this varies based on the complexity of the custom object. Initially, we can aim for 3,500 steps. If needed, we can increase the steps later by adjusting the number of epochs or conclude the training sooner if the model achieves optimal results sooner than we think. I'll elaborate on this in the following section.
To calculate the number of times each image is used, divide the target steps by the number of training images. In our case, with 3,500 targeted steps and 37 training images, the repetition rate is 3,500 divided by 37, which equals approximately 94.59. We round this down to 94 repetitions per image.

Therefore, we have our final dataset: the training set.

Next, we will utilize these training images and labels for the training process.

Second step: Lora modeling training

For the Lora training, I primarily used kohya_ss. This follows the standard training guidelines, with additional insights tailored for fashion clothing:

GPU Usage

I used a 4070 TIS with 16GB VRAM. Training 3,500 steps takes about 15 minutes. Training time can decrease if we opt for a lower resolution of 512x512 pixels.

Setting Up Lora Training Configuration

To help you setup faster, I've saved a configuration for Lora training here. They’re mostly default settings with minor adjustment.
Training Folders: In the Lora tab, set the training images folder to our prepared dataset and point the output and logging folders to your chosen destination. We'll name the model output "arman1jacket-xl". Make sure to select the correct folders to prevent errors when starting the training.
Training Parameters: I kept all settings at their default values. Below are some key parameters, in case your defaults differ:
- Source model > Base model:
- For SD XL, I chose "stabilityai/stable-diffusion-xl-base-1.0" as my base model. This can vary if you're using SD 1.5 or another version.
- Parameters > Basic > Training batch: 1
- Parameters > Basic > Precision: fp16
- Parameters > Basic > Learning rate: 0.0001
- Parameters > Basic > Text Encoder learning rate: 0.00005
- Parameters > Basic > Max resolution: 712, 712

Advanced parameters

Parameters > Samples >
- Sample prompts: Choose a prompt to view recurring output during training,
  - (GIORGIO ARMANI, full body photo of 30s years old male model wearing upton line single-breasted velvet jacket:1.2),
- Sample every N steps: Set to 250 by default. Adjust this to control how frequently you want to check the model's progress through sample outputs.
Parameters > Advanced > Save every N steps: Also set to 250, aligning with the sampling frequency. This ensures that whenever the model produces an optimal output, it can be saved for us to use later.

Observing Training Samples:

In your specified output folder, you'll find snapshots of the models and a sample folder containing generated images at every specified interval (in this case, every 250 steps).
Training typically progresses through three phases:
- Early phrase: Expect a consistent decrease in average loss (avg_loss) seen in kohya_ss output and continuous improvement in the quality of sample outputs as training progresses.
- Optimal phrase: the avg_loss value will be stable. Different custom objects have different optimal loss values. In this stage, it can be subjective on which model is the best-fit. Typically we use humans to decide which snapshot produces the best outcomes. In our case, we can decide between snapshot 1250 - 1500. I decided to go with 1500.
- Overfit phrase: In this phase, the output begins to show unexpected issues, such as distortions and noise. At times, the model might even generate corrupted results, indicated by 'Not-a-Number' (NaN) average loss values. It's crucial to halt training at this point and disregard these model snapshots, as they will not be suitable for use.

Now we have the final model output. If you're curious about what they look like, you can download them here.

Last Step: Time to Test the Trained Model

You can use the optimal model found in step 2 or try our pre-trained model available here.

Recommendation

- Base model: Juggernaut XL but any XL model should work

- Trigger phrase: (GIORGIO ARMANI, dark gray jacket, arman1jacket:1.2)

- Lora weight: <lora:arman1jacket-xl-step00003500:0.6>

Example prompt:

(GIORGIO ARMANI, full body photo of blonde male model wearing dark gray jacket, arman1jacket:1.2), (front view wide angle full body:1.1), realistic face, by Helmut Newton, movie still, Hasselblad x2D 100c, ISO 300, 1/250s, F/2,8, 38 mm, extremely high quality RAW photograph, detailed background, intricate, Exquisite details and textures, highly detailed, ultra detailed photograph, warm lighting, 4k, sharp focus, high resolution, detailed skin, detailed eyes, 8k UHD, DSLR, high quality, film grain, Fujifilm XT3, (full body shot:1.3), (wide angle shot:1.3), (side view:1.2), luxury walk in a street of paris, <lora:epi_noiseoffset2:0.5> <lora:add-detail-xl:1> <lora:xl_more_art-full_v1:0.5> <lora:arman1jacket-xl-step00003500:0.6>

Negative prompt:

(worst quality, greyscale), ac_neg2, zip2d_neg, ziprealism_neg, watermark, username, signature, text, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, jpeg artifacts, bad feet, extra fingers, mutated hands, poorly drawn hands, bad proportions, extra limbs, disfigured, bad anatomy, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, mutated hands, fused fingers, too many fingers, long neck

Controlling pose:

To control the pose, you can use the ControlNet extension as usual. In this example, I use Depth ControlNet with the extension’s standard model. If you’re not familiar with ControlNet, this will require a separate tutorial for a full explanation.

Ending notes

Thank you for following this detailed guide. Please keep me updated on your progress with model training. If you're interested in using Stable Diffusion for commercial content, join r/GenAiCommerce and stay in touch.

We aim to establish a community to support AI artists pursuing AI art for commercial purposes. Discussions related to the commercial usage of AI are also welcome!
We will regularly post tutorials and share useful resources to assist our members in producing successful GenAI commercial content.
Our community is new, so feel free to post suggestions about what you would like to see from us. Cheers!