This is a heavily interactive web application, and JavaScript is required. Simple HTML interfaces are possible, but that is not what this is.
Post
Ryan Heuser
heuser.bsky.social
did:plc:llsfba2zryavs4uunglknwie
Any of my DH/techy friends know how to check which websites are included in LLM training data corpora (Common Crawl, The Stack/Pile, etc)? I just want a way to search the URLs included for poetry sites, but it's annoyingly kind of difficult. CC's index page/API returns 503 error.
cc @tedunderwood.me
2024-04-13T07:00:27.659Z