RapGPT: Rap Lyric Generator

Overview

RapGPT is a pre-trained & fine-tuned GPT-2 model designed to generate original rap lyrics similar to Eminem$'s Lyric style based on user prompts. It is trained on a decoder-only transformer architecture and built on a slightly modified GPT-2 configuration. The project was iteratively optimized to run on CPU-only environments due to resource constraints, making it accessible for lightweight deployment.

This project was started as a hands-on initiative to gain end-to-end experience in large language model (LLM) development, covering data collection, pre-training, parameter-efficient fine-tuning (PEFT), evaluation, and deployment. While model performance is constrained by limited compute resources, the focus was on building practical skills across the full LLM pipeline.

GitHub Repository: github.com/daniellee6925/rapGPT2.0

Live Site: www.eminemgpt.com

Data Preparation

Source: Genius.com
Cleaning: Removed metadata, repetition tags, and extra whitespaces.
Tokenization: GPT-2's tiktoken-based Byte Pair Encoding (BPE).

Dataset Summary

Total Words: 23,009,521
Total Songs: 17,273
Total Artists: 226
Total Tokens: 25,367,203

Pretraining

The model was pretrained using a decoder-only Transformer architecture based on GPT-2, with modifications to reduce size and optimize for CPU-based environments.

Training Optimization Techniques

Model Selection: While larger parameter (774M) models were trained, the base GPT-2 variant (124M) was chosen to strike a balance between performance and computational efficiency, enabling inference on a CPU-only instance.
Tokenization: Used tiktoken for its faster encoding (2x-5x) compared to equivalent huggingface encoding and lower memory usage.
Mixed Precision Training: Mixed precision training using PyTorch’s autocast context manager was implemented. This allowed for faster training steps and reduced memory usage, enabling larger batch sizes and more efficient training. bfloat16 was used instead of float16 to enable mixed-precision training and inference on a CPU instance, as bfloat16maintains the same exponent range as bfloat32. Since bfloat16 support 8 exponential bits, scaling (GradScaler) was not required to prevent uverflow.
Gradient Accumulation: Enabled to effectively simulate larger batch sizes and stabilize training without requiring additional memory overhead.
Distributed Data Parallel: While Distributed Data Parallel (DDP) was implemented to enable scalable multi-GPU training, it was ultimately not utilized due to the lack of access to multi-GPU instances during development.

Training Stability Techniques

Weight Initialization: Model weights were initialized using standard GPT-2 initialization-normal distribution with a standard deviation of 0.02, promoting convergence stability early in training.
Gradient Clipping: Used to prevent exploding gradients by capping gradient norms during backpropagation.
Learning Rate Scheduling: cosine decay scheduler was used to gradually reduce the learning rate.

Training Process

A custom PyTorch loop loads batches, computes forward and backward passes, and evaluates loss every 500 steps using a held-out validation split.

Hyperparameters:

Batch Size: 8 (per step)
Block Size (context length): 1024 tokens
Learning Rate: 6e-4 (AdamW optimizer)
Training Steps: 4 epochs
Hardware: AWS EC2 g4dn.xlarge & RTX 4080

Overall, the pretraining process reflects practical design choices made to maximize learning outcomes under limited hardware while keeping the model architecture extensible for future improvements.

Fine-Tuning (LoRA)

To reduce computational cost and memory usage during fine-tuning, this project uses a custom Low-Rank Adaptation (LoRA) implementation applied to both linear and embedding layers.

Fine-Tuning Dataset: The model was fine-tuned exclusively on a curated set of Eminem's lyrics, consisting of approximately 213,680 tokens. This dataset was used to guide the model toward generating lyrics in a style consistent with the artist.

Training Configuration: The fine-tuning loop followed similar architecture and hyperparameter conventions as the main pretraining process, including the use of gradient accumulation, cosine learning rate decay, and mixed precision training, ensuring a consistent and efficient optimization pipeline.

Target Modules: Attention + MLP layers
Excluded Modules: Final LayerNorm
Rank: 8
Alpha: 8
RSLoRA used: stabilized fine-tuning with rank based scaling factor. Ideal for small datasets and robust at low-rank settings

LoRA Rank/Alpha Selection: I chose a relatively low LoRA rank/alpha value of 8 to suit the characteristics of this project. Since the fine-tuning task is not highly complex and is performed on a small dataset within a low-resource environment , a lower alpha helps maintain training stability without overwhelming the base model.

Training Efficiency: Only 1.06% of the total model parameters were updated during fine-tuning, significantly reducing the computational load while preserving performance.

Optimization

KV Caching

Key-Value caching improves generation speed by storing previously computed attention values, avoiding recomputation for earlier tokens during inference.

KV-Caching Memory Usage: For a max context length of 1024 tokens and using bfloat16, the model consumes approximately 16 MB of memory for caching key/value tensors during inference.

KV-Cache Memory Usage Estimation

Batch Size:1
Layers:8
Heads:8
Context Length:1024
Head Dim:64
Dtype: bfloat16
Memory Usage=1 × 8 × 8 × 1024 × 64 × 2 × 2 bytes = 16 MB

Quantization

To reduce model size and improve inference speed, dynamic quantization was applied using PyTorch’s built-in API. This optimization allows the model to run efficiently on CPU without retraining.

Although int8 quantization is typically preferred for maximum compression, it significantly degraded the model's output quality in this case. As a result, float16 was used instead, which preserved model performance while still achieving an estimated 20% reduction in inference time.

Method: torch.quantization.quantize_dynamic
Scope: Applied to nn.Linear layers
Precision: float16 (half precision)
Goal: Decrease memory footprint and enable faster CPU inference
Result: Achieved lower latency and smaller model size for deployment on a CPU-only instance

Dynamic Quantization Summary

Dynamic quantization reduces model size and improves inference speed by converting certain weights— in Linear layers—from 32-bit floats to lower precision types like int8 or float16 at runtime.

It uses asymmetric quantization, where the scale (α) and zero-point (β) are automatically calculated based on the distribution of model weights.

Evaluation

To assess the quality and stylistic relevance of generated rap lyrics, a combination of a DistilBERT-based classifier and cosine similarity metrics was used.

Performance Impact: The average cosine similarity score before fine-tuning was approximately 0.34, indicating limited stylistic alignment. After fine-tuning on Eminem's lyrics, the average score increased to around 0.71, resulting in a significant improvement in stylistic relevance and generation quality.

Evaluation Method

Embedding Generation: Each generated lyric is converted into a semantic vector using a pre-trained distilbert-base-uncased model.
Reference Comparison: The generated vectors are compared against embeddings of real rap lyrics from the training set or manually curated references.
Cosine Similarity: Calculates similarity between vectors—higher scores indicate better alignment with real rap lyric style.
Thresholding / Labeling (Optional): A fine-tuned classifier can assign quality labels (e.g., authentic-style vs off-style) or confidence scores to each generation.

Example Results

Prompt	Cosine Similarity	Comment
Chasin' dreams in the rain	0.82	Strong stylistic match
I got Loyalty, got royalty inside my DNA	0.48	Off-style / Kendrik Lamar
It feels so empty without me	0.92	Excellent stylistic match

Deployment

rapGPT is deployed as a full-stack application, consisting of a FastAPI backend and a React/Next.js frontend.

Backend: Powered by FastAPI, hosted on an AWS EC2 c5.large instance (CPU-only).
Frontend: Built with React and Next.js, deployed via Vercel