Microsoft Learn Location Mention Recognition

Introduction

With the rapid growth of social media, extracting specific entities from tweets has become increasingly useful for businesses, organizations, and researchers. One common task is identifying location mentions, which can be challenging given the variability and informality of language on platforms like Twitter. This project involved developing an NLP model to predict locations in tweets and optimize it using various model architectures and ensemble techniques to achieve the best Word Error Rate (WER) score possible.

Below, we’ll walk through the detailed approach we followed, discussing the steps, experiments, and adjustments that shaped our solution.

Step 1: Initial Model Selection and Setup

For location recognition, transformer-based models are highly effective, as they capture context across sentence structures. We began with BERT-based models (such as bert-large-uncased), which perform well on general language tasks, and later explored other architectures like RoBERTa and DeBERTa to see if they could offer improvements.

Initial Setup

To begin, we configured the following key parameters:

Batch Sizes: Given the memory limitations, we set initial training batch sizes at moderate values and later experimented with different settings.
Gradient Accumulation: To simulate a larger batch size without overwhelming GPU memory, we added gradient_accumulation_steps, allowing the model to accumulate gradients over multiple steps.
Warmup Steps: This was included to control the learning rate in the early stages, making training smoother.

Memory Management Issues
Memory overflow was an early challenge. Adjusting batch size helped initially, but we implemented additional techniques such as mixed-precision training and gradient checkpointing to further reduce memory usage. By setting the PYTORCH_CUDA_ALLOC_CONF environment variable, we mitigated fragmentation issues as well.

Step 2: Fine-Tuning and Experimentation

Tuning Hyperparameters

We experimented with several hyperparameters to optimize WER scores:

MAX_LEN: Reducing MAX_LEN from 512 down to 300, and then to 256, significantly reduced memory usage and improved the WER score.
Warmup Steps: Smaller values (25–100) were found to be more effective. Higher values increased training stability but worsened the WER score.
Epochs: The impact was relatively minimal; scores were stable even as we varied this from 1 to 10.

After numerous trials, bert-large-uncased with MAX_LEN=300 and WARMUP_STEPS=25 yielded the best result of 0.31 WER.

Step 3: Exploring Other Models

After optimizing bert-large-uncased, we tried additional models:

RoBERTa: While robust in general tasks, RoBERTa underperformed for this dataset, despite training for 30 epochs. We found that RoBERTa’s longer training time didn’t translate to lower WER.
DeBERTa: Even with different values of MAX_LEN and WARMUP_STEPS, DeBERTa’s WER scores remained high (>1.6), likely due to the small batch sizes and memory constraints.

After these experiments, BERT-based models remained the top performer.

Step 4: Ensemble Modeling

With several models trained, we created an ensemble to improve predictions by leveraging each model’s strengths.

Ensemble Strategy 1: Majority Voting

This method involves creating a final prediction by taking the most frequently occurring words from each model’s output for a given tweet. We found that majority voting helped improve accuracy slightly without any substantial computational load.

from collections import Counter

# Function for word-level majority voting
def majority_vote(words_list):
    all_words = []
    for words in words_list:
        all_words.extend(words.split())
    word_counter = Counter(all_words)
    return ' '.join([word for word, count in word_counter.most_common()])

Ensemble Strategy 2: Aggregating Unique Mentions

Alternatively, we also tried aggregating unique location mentions across model predictions, which ensured all potential locations from different models were included.

def aggregate_predictions(words_list):
    all_words = set()
    for words in words_list:
        all_words.update(words.split())
    return ' '.join(all_words)

Both ensemble methods allowed us to experiment with combining models, providing a more comprehensive prediction than a single model alone.

Step 5: Saving and Loading Models for Reproducibility

To streamline future experimentation, we saved each model’s weights and predictions. Using Hugging Face’s Trainer API, we saved models directly to a directory (output_dir). For the ensemble, we stored individual predictions as CSV files, allowing quick access without re-running predictions.

Loading and Utilizing Models for Ensembling
By saving each model’s predictions as CSV files, we were able to load them directly to avoid re-computation, saving time and resources during ensemble testing.

Step 6: Final Testing and Conclusion

Ultimately, bert-large-uncased with majority voting ensemble delivered the most robust performance. The lower MAX_LEN values and careful adjustment of warmup steps contributed significantly to this outcome.

Conclusion

This project offered deep insights into fine-tuning NLP models for entity recognition. Despite RoBERTa and DeBERTa’s general strengths, bert-large-uncased emerged as the best option for our dataset. Ensemble methods further enhanced performance, providing a comprehensive and balanced final prediction by combining the strengths of each model.

Through careful experimentation with hyperparameters, memory management, and model ensembling, we achieved a refined system for location mention recognition in tweets. This process highlights how balancing performance with resource constraints is key to building practical NLP applications.