Stu Mason
Stu Mason

Activity

Pull Request Merged

PR #9 merged: feat: add full content extraction pipeline

Summary

Transforms DevTrends from metadata-only to comprehensive content capture. The system already had residential proxies (Webshare) and anti-detection (ScraperClient) - this PR puts them to use.

  • Quick wins: Extract content already in raw_json at fetch time (RSS content:encoded, HN self-post text, Reddit selftext)
  • Async pipeline: Job-based extraction with rate limiting for items that need fetching
  • High-value extractors: HN comment threads, GitHub READMEs, generic article content

Changes

Database & Model

  • New columns on raw_items: content, content_type, content_length, extraction_status, content_extracted_at, extraction_error
  • Helper methods: markContentExtracted(), markExtractionFailed(), markExtractionSkipped()

Fetcher Updates (Quick Wins)

FetcherContent Source
RssFetchercontent:encoded, description, summary
HackerNewsFetchertext field for Ask HN/Show HN
RedditFetcherselftext for self-posts

Content Extraction Pipeline

  • ContentExtractionManager - Orchestrates extractors by priority
  • ExtractContentJob - Rate-limited queue job with retry logic
  • content:extract command - Manual/scheduled extraction

Extractors

ExtractorPriorityContent TypeSource
HackerNewsCommentExtractor10comment_threadFirebase API
GitHubReadmeExtractor20readmeGitHub API
ArticleExtractor100articleReadability.php

Schedule

  • Hourly at :35 - High-priority items (high engagement)
  • Twice daily (5am/5pm) - Normal priority items

Test plan

  • Run migration: php artisan migrate
  • Test quick wins: Fetch RSS/HN/Reddit, verify content column populated
  • Test command: php artisan content:extract --dry-run --limit=5
  • Test sync extraction: php artisan content:extract --sync --limit=2 --source=arstechnica
  • Verify ArticleExtractor: ~1,780 chars avg from Ars Technica
  • Verify GitHubReadmeExtractor: ~1,000 chars avg from trending repos
  • Verify HackerNewsCommentExtractor: ~11,763 chars from HN discussions
  • Monitor Horizon for job processing in production
+2060
additions
-39
deletions
39
files changed