fashn-ai/fashn-human-parser

image segmentationtransformersentransformerssafetensorssegformerhuman-parsingsemantic-segmentationfashionother
332.8K

FASHN Human Parser

Open in Spaces

A SegFormer-B4 model fine-tuned for human parsing with 18 semantic classes, optimized for fashion and virtual try-on applications.

Human Parsing Example

Model Description

This model segments human images into 18 semantic categories including body parts (face, hair, arms, hands, legs, feet, torso), clothing items (top, dress, skirt, pants, belt, scarf), and accessories (bag, hat, glasses, jewelry).

  • Architecture: SegFormer-B4 (MIT-B4 encoder + MLP decoder)
  • Input Size: 384 x 576 (width x height)
  • Output: 18-class semantic segmentation mask
  • Base Model: nvidia/mit-b4

Usage

Quick Start with Pipeline

from transformers import pipeline

pipe = pipeline("image-segmentation", model="fashn-ai/fashn-human-parser")
result = pipe("image.jpg")
# result is a list of dicts with 'label', 'score', 'mask' for each detected class

The pipeline automatically manages GPU/CPU and returns per-class masks at the original image resolution.

Explicit Usage

from transformers import SegformerForSemanticSegmentation, SegformerImageProcessor
from PIL import Image
import torch

# Load model and processor
processor = SegformerImageProcessor.from_pretrained("fashn-ai/fashn-human-parser")
model = SegformerForSemanticSegmentation.from_pretrained("fashn-ai/fashn-human-parser")

# Load and preprocess image
image = Image.open("path/to/image.jpg")
inputs = processor(images=image, return_tensors="pt")

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits  # (1, 18, H/4, W/4)

# Upsample to original size and get predictions
upsampled = torch.nn.functional.interpolate(
    logits, size=image.size[::-1], mode="bilinear", align_corners=False
)
predictions = upsampled.argmax(dim=1).squeeze().numpy()

Production Usage (Recommended)

For maximum accuracy, use our Python package which implements the exact preprocessing used during training:

pip install fashn-human-parser
from fashn_human_parser import FashnHumanParser

parser = FashnHumanParser()  # auto-detects GPU
segmentation = parser.predict("image.jpg")
# segmentation is a numpy array of shape (H, W) with class IDs 0-17

The package uses cv2.INTER_AREA for resizing (matching training), while the HuggingFace pipeline uses PIL LANCZOS.

Label Definitions

IDLabel
0background
1face
2hair
3top
4dress
5skirt
6pants
7belt
8bag
9hat
10scarf
11glasses
12arms
13hands
14legs
15feet
16torso
17jewelry

Category Mappings

For virtual try-on applications:

CategoryBody CoverageRelevant Labels
TopsUpper bodytop, dress, scarf
BottomsLower bodyskirt, pants, belt
One-piecesFull bodytop, dress, scarf, skirt, pants, belt

Identity Labels

Labels typically preserved during virtual try-on: face, hair, jewelry, bag, glasses, hat

Training

This model was fine-tuned on a proprietary dataset curated and annotated by FASHN AI, specifically designed for virtual try-on applications. The 18-class label schema was developed to capture the semantic regions most relevant for clothing transfer and human body understanding in fashion contexts.

Limitations

  • Optimized for single-person images with clear visibility
  • Best results on fashion/e-commerce style photography
  • Input images are resized to 384x576; very small subjects may lose detail

Citation

@misc{fashn-human-parser,
  author = {FASHN AI},
  title = {FASHN Human Parser: SegFormer for Fashion Human Parsing},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/fashn-ai/fashn-human-parser}
}

License

This model inherits the NVIDIA Source Code License for SegFormer. Please review the license terms before use.

Links

DEPLOY IN 60 SECONDS

Run fashn-human-parser on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.