Most reinforcement learning approaches to information retrieval struggle with the same problem: rewards are sparse, noisy, and hard to define when the task is open-ended. Privileged World Supervision (PWS) takes a different route. Instead of trying to evaluate agents in the real world, we build synthetic worlds where the ground truth is fully known, and use that access to compute exact reward signals.
The idea is simple but powerful. An agentic pipeline generates coherent synthetic corpora and knowledge graphs, complete with realistic noise, contradictions, and missing information. Because we control the environment, we know exactly which claims are true, which sources are reliable, and what the correct graph structure looks like. This lets us train RL agents (using PPO without a critic network) with precise per-action advantage signals, avoiding the instability that comes with standard value approximation.
To bootstrap performance before RL optimization, I fine-tuned Qwen3-30B-A3B using LoRA adapters on expert trajectories generated within these environments. The synthetic world generation itself relies on Claude 4.5's API to produce diverse, internally consistent document ecosystems across multiple domains.
This work is still exploratory and actively evolving. The long-term goal is to show that training in privileged environments can transfer to real-world information retrieval, where ground truth is never fully available but the reasoning skills learned in simulation still apply.
Goal: Engineered an end-to-end RAG system to enhance the factual accuracy of the Llama2-7b model by augmenting it with external, verified context.
Contributions: Designed and built a complete system, including an embedding database and an efficient retriever.
Technologies: Python, UAE-Large-V1, FAISS, prompt engineering.
Goal: Fine-tuned an open-source GPT-2 model on mutliples datasets to create a conversational assistant capable of answering specialized questions.
Contributions: Lead the data collection and the fine-tuning of the model.
Technologies: Python, Hugging Face, NLP.
Goal: Improved the efficiency and performance of NLP models for internal data classification.
Contributions:
Engineered a data preprocessing pipeline that reduced processing time by 99.3%, from 35 minutes to just 15 seconds.
Led a custom model fine tuning based on ModernBERT with open documents (CC0) from the Federal Aviation Administration (FAA).
Designed and implemented a training pipeline that improved NLP model performance by 15%.
Designed and implemented a Streamlit application for key users to provide feedback on the model's performance with explainability through SHAP values.
Technologies: Python, AWS, GitLab CI/CD.