This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is.
Post
Martin Jaggi
mjaggi.bsky.social
did:plc:z2fvdaytvutyictg45b4pmt7
Using the 'right' data can hugely speed up LLM training, but how to find the best training data in the vast sea of a whole web crawl?
We propose a simple classifier-based selection, enabling multilingual LLMs 🧵
2025-04-23T05:06:38.682Z