SNEWPapers | The International’s First AI Newspaper Archive

53fd6e45 19b0 4aec bbf8 321640c17296.png


Whats up Product Hunt! đź‘‹

I am excited to percentage SNEWPapers — the sector’s first AI-powered ancient newspaper archive. We’ve learn and arranged 6 million+ tales from 250 years of American newspapers (1730s–Nineteen Sixties) so you’ll be able to in the end discover historical past via which means, no longer simply damaged key phrases.

Perhaps the largest information since sliced bread for virtual humanities, historians, researchers, genealogists?

I constructed this after looking to analysis references in The Fourth Turning. Conventional archives dumped me into pale web page scans with horrible seek. So I created my very own.

The end result: blank, summarized articles and just about easiest full-text OCR extractions + The Sleuth (your individual AI analysis assistant), sensible categorization (24 classes / 1,000+ sub-categories), Collections for sharing, and a a laugh Nowadays in Historical past day-to-day feed.

Fast get started (10 mins): → Tutorials

A couple of issues I’d love your ideas on:

  • Nowadays in Historical past — Would you if truth be told open this day-to-day?

  • Seek + Sleuth — How helpful is semantic seek and the AI assistant on your analysis?

  • Collections — Would you employ/percentage public collections?

Pricing: 7-day loose trial. I priced it ~50% underneath conventional archives as a result of we if truth be told ship usable, clever get admission to. Product Hunt particular: Use PRODUCTHUNT20 for 20% off any plan (legitimate till Would possibly 8).

Massive technical adventure. I had to determine the best way to achieve, retailer and procedure just about one million high-resolution newspaper pictures, construct customized multi-modal programs to locate and phase articles, hugely strengthen OCR on centuries outdated ink, educate fashions to know newspaper format and context, run urged engineering at scale, steadiness price vs high quality with LLMs and vLLMs, construct semantic and agentic seek infrastructure that if truth be told works on tens of millions of paperwork, and scale a cheap GPU fleet.

Some “AWS-ish” stats up to now:

  • 115,000+ GPU GB-hours (OCR / Layouts)

  • 26,000+ Lambda GB-hours shifting knowledge round

  • 44.7 billion LLM/vLLM tokens processed

  • 7 months of 80+ hour paintings weeks (natural neural community compute)

Would really like your truthful comments and discoveries you’re making within the archive! 🫡 (right here or hi@snewpapers.com)


Leave a Comment

Your email address will not be published. Required fields are marked *