trends.stumason.dev
TypeScript
Pull Request Merged
PR #9 merged: feat: add full content extraction pipeline
Summary
Transforms DevTrends from metadata-only to comprehensive content capture. The system already had residential proxies (Webshare) and anti-detection (ScraperClient) - this PR puts them to use.
- Quick wins: Extract content already in
raw_jsonat fetch time (RSS content:encoded, HN self-post text, Reddit selftext) - Async pipeline: Job-based extraction with rate limiting for items that need fetching
- High-value extractors: HN comment threads, GitHub READMEs, generic article content
Changes
Database & Model
- New columns on
raw_items:content,content_type,content_length,extraction_status,content_extracted_at,extraction_error - Helper methods:
markContentExtracted(),markExtractionFailed(),markExtractionSkipped()
Fetcher Updates (Quick Wins)
| Fetcher | Content Source |
|---|---|
| RssFetcher | content:encoded, description, summary |
| HackerNewsFetcher | text field for Ask HN/Show HN |
| RedditFetcher | selftext for self-posts |
Content Extraction Pipeline
ContentExtractionManager- Orchestrates extractors by priorityExtractContentJob- Rate-limited queue job with retry logiccontent:extractcommand - Manual/scheduled extraction
Extractors
| Extractor | Priority | Content Type | Source |
|---|---|---|---|
| HackerNewsCommentExtractor | 10 | comment_thread | Firebase API |
| GitHubReadmeExtractor | 20 | readme | GitHub API |
| ArticleExtractor | 100 | article | Readability.php |
Schedule
- Hourly at :35 - High-priority items (high engagement)
- Twice daily (5am/5pm) - Normal priority items
Test plan
- Run migration:
php artisan migrate - Test quick wins: Fetch RSS/HN/Reddit, verify
contentcolumn populated - Test command:
php artisan content:extract --dry-run --limit=5 - Test sync extraction:
php artisan content:extract --sync --limit=2 --source=arstechnica - Verify ArticleExtractor: ~1,780 chars avg from Ars Technica
- Verify GitHubReadmeExtractor: ~1,000 chars avg from trending repos
- Verify HackerNewsCommentExtractor: ~11,763 chars from HN discussions
- Monitor Horizon for job processing in production
+2060
additions
-39
deletions
39
files changed