This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is.
Post
Daniel van Strien
danielvanstrien.bsky.social
did:plc:7e5mpxuweopubhexwqg5l3ba
Huginn Dataset: 1.1T token reasoning-focused LLM training corpus
- 274M sequences with 4097 tokens each
- Prioritizes code (25%), math (6%), scientific content (19%)
- Direct instruction data integration in pretraining
- Special handling of long mathematical content
https://huggingface.co/datasets/tomg-group-umd/huginn-dataset
2025-04-02T16:56:25.399Z