Projects
ICLR 2025
NovelQA
A long-context benchmark with 2,305 questions over novels averaging 200K+ tokens, covering question answering, retrieval, and reasoning.
VLM-Lens
Probes and test cases for Qwen-VL and InternLM, with a focus on color naming, language gaps, and modality gaps in vision-language models.
GLoRE
A unified logical reasoning benchmark and evaluation pipeline across datasets such as LogiQA, ReClor, FOLIO, AR-LSAT, ProofWriter, and RuleTaker.
Latinate-Germanic Ratio
A linguistically grounded metric for genre formality using 40K etymology entries and Brown-family corpora, reaching strong correlation with existing formality scores.