RapGPT: Rap Lyric Generator
Overview
RapGPT is a pre-trained & fine-tuned GPT-2 model designed to generate original rap lyrics similar to Eminem$'s Lyric style based on user prompts. It is trained on a decoder-only transformer architecture and built on a slightly modified GPT-2 configuration. The project was iteratively optimized to run on CPU-only environments due to resource constraints, making it accessible for lightweight deployment.
This project was started as a hands-on initiative to gain end-to-end experience in large language model (LLM) development, covering data collection, pre-training, parameter-efficient fine-tuning (PEFT), evaluation, and deployment. While model performance is constrained by limited compute resources, the focus was on building practical skills across the full LLM pipeline.
Data Preparation
- Source: Genius.com
- Cleaning: Removed metadata, repetition tags, and extra whitespaces.
- Tokenization: GPT-2's
tiktoken
-based Byte Pair Encoding (BPE).
Dataset Summary
- Total Words: 23,009,521
- Total Songs: 17,273
- Total Artists: 226
- Total Tokens: 25,367,203
Pretraining
The model was pretrained using a decoder-only Transformer architecture based on GPT-2
, with modifications to reduce size and optimize for CPU-based
environments.
Training Optimization Techniques
- Model Selection: While larger parameter
(774M)
models were trained, the base GPT-2 variant(124M)
was chosen to strike a balance between performance and computational efficiency, enabling inference on a CPU-only instance. - Tokenization: Used
tiktoken
for its faster encoding(2x-5x)
compared to equivalent huggingface encoding and lower memory usage. - Mixed Precision Training: Mixed precision training using PyTorch’s autocast context manager was implemented. This allowed for faster training steps and reduced memory usage, enabling larger batch sizes and more efficient training.
bfloat16
was used instead offloat16
to enable mixed-precision training and inference on a CPU instance, asbfloat16
maintains the same exponent range asbfloat32
. Sincebfloat16
support 8 exponential bits, scaling (GradScaler) was not required to prevent uverflow. - Gradient Accumulation: Enabled to effectively simulate larger batch sizes and stabilize training without requiring additional memory overhead.
- Distributed Data Parallel: While Distributed Data Parallel
(DDP)
was implemented to enable scalable multi-GPU training, it was ultimately not utilized due to the lack of access to multi-GPU instances during development.
Training Stability Techniques
- Weight Initialization: Model weights were initialized using standard GPT-2 initialization-normal distribution with a standard deviation of
0.02
, promoting convergence stability early in training. - Gradient Clipping: Used to prevent exploding gradients by capping gradient norms during backpropagation.
- Learning Rate Scheduling: cosine decay scheduler was used to gradually reduce the learning rate.
Training Process
- Batch Size: 8 (per step)
- Block Size (context length): 1024 tokens
- Learning Rate: 6e-4 (AdamW optimizer)
- Training Steps: 4 epochs
- Hardware: AWS EC2 g4dn.xlarge & RTX 4080
Overall, the pretraining process reflects practical design choices made to maximize learning outcomes under limited hardware while keeping the model architecture extensible for future improvements.
Fine-Tuning (LoRA)
To reduce computational cost and memory usage during fine-tuning, this project uses a custom Low-Rank Adaptation (LoRA) implementation applied to both linear and embedding layers.
Fine-Tuning Dataset: The model was fine-tuned exclusively on a curated set of Eminem's lyrics, consisting of approximately 213,680
tokens. This dataset was used to guide the model toward generating lyrics in a style consistent with the artist.
Training Configuration: The fine-tuning loop followed similar architecture and hyperparameter conventions as the main pretraining process, including the use of gradient accumulation, cosine learning rate decay, and mixed precision training, ensuring a consistent and efficient optimization pipeline.
- Target Modules: Attention + MLP layers
- Excluded Modules: Final LayerNorm
- Rank: 8
- Alpha: 8
- RSLoRA used: stabilized fine-tuning with rank based scaling factor. Ideal for small datasets and robust at low-rank settings
LoRA Rank/Alpha Selection: I chose a relatively low LoRA rank/alpha value of 8
to suit the characteristics of this project. Since the fine-tuning task is not highly complex and is performed on a small dataset within a low-resource environment , a lower alpha helps maintain training stability without overwhelming the base model.
Training Efficiency: Only 1.06% of the total model parameters were updated during fine-tuning, significantly reducing the computational load while preserving performance.
Optimization
KV Caching
Key-Value caching improves generation speed by storing previously computed attention values, avoiding recomputation for earlier tokens during inference.
KV-Caching Memory Usage: For a max context length of 1024 tokens and using bfloat16
, the model consumes approximately 16 MB of memory for caching key/value tensors during inference.
KV-Cache Memory Usage Estimation
- Batch Size:1
- Layers:8
- Heads:8
- Context Length:1024
- Head Dim:64
- Dtype: bfloat16
- Memory Usage=
1 × 8 × 8 × 1024 × 64 × 2 × 2 bytes = 16 MB
Quantization
To reduce model size and improve inference speed, dynamic quantization was applied using PyTorch’s built-in API. This optimization allows the model to run efficiently on CPU without retraining.
Although int8 quantization is typically preferred for maximum compression, it significantly degraded the model's output quality in this case. As a result, float16
was used instead, which preserved model performance while still achieving an estimated 20% reduction in inference time.
- Method:
torch.quantization.quantize_dynamic
- Scope: Applied to
nn.Linear
layers - Precision:
float16
(half precision) - Goal: Decrease memory footprint and enable faster CPU inference
- Result: Achieved lower latency and smaller model size for deployment on a CPU-only instance
Dynamic Quantization Summary
Dynamic quantization reduces model size and improves inference speed by converting certain weights— in Linear
layers—from 32-bit floats to lower precision types like int8
or float16
at runtime.
It uses asymmetric quantization, where the scale (α) and zero-point (β) are automatically calculated based on the distribution of model weights.
Evaluation
To assess the quality and stylistic relevance of generated rap lyrics, a combination of a DistilBERT-based classifier and cosine similarity metrics was used.
Performance Impact: The average cosine similarity score before fine-tuning was approximately 0.34, indicating limited stylistic alignment. After fine-tuning on Eminem's lyrics, the average score increased to around 0.71, resulting in a significant improvement in stylistic relevance and generation quality.
Evaluation Method
- Embedding Generation: Each generated lyric is converted into a semantic vector using a pre-trained
distilbert-base-uncased
model. - Reference Comparison: The generated vectors are compared against embeddings of real rap lyrics from the training set or manually curated references.
- Cosine Similarity: Calculates similarity between vectors—higher scores indicate better alignment with real rap lyric style.
- Thresholding / Labeling (Optional): A fine-tuned classifier can assign quality labels (e.g., authentic-style vs off-style) or confidence scores to each generation.
Example Results
Prompt | Cosine Similarity | Comment |
---|---|---|
Chasin' dreams in the rain | 0.82 | Strong stylistic match |
I got Loyalty, got royalty inside my DNA | 0.48 | Off-style / Kendrik Lamar |
It feels so empty without me | 0.92 | Excellent stylistic match |
Deployment
rapGPT is deployed as a full-stack application, consisting of a FastAPI backend and a React/Next.js frontend.
- Backend: Powered by
FastAPI
, hosted on anAWS EC2 c5.large
instance (CPU-only). - Frontend: Built with
React
andNext.js
, deployed via Vercel