Stu Mason
Stu Mason

Activity

Pull Request Merged

PR #151 merged: fix(fetchers): swap dead Reddit .json endpoint for /.rss

Summary

  • Reddit's unauthenticated .json endpoint has been globally 403'd for ~6 days. Confirmed live against rotating residential proxy, US/GB country-locked, sticky-session, and direct (no proxy) from the server IP — every variant returns the same 190KB anti-bot HTML page. Webshare itself is healthy (rotation verified via httpbin).
  • The Atom feed at /r/<sub>/.rss still returns 200 OK with the post list, so this PR routes RedditFetcher through it as a stopgap. URL firehose + cross-platform matching come back online; score-based features (Predictions, breakout) cleanly skip RSS rows until OAuth lands.
  • Fixes a silent-failure bug that hid the outage: the previous fetcher returned [] on non-2xx and let the runner mark the run success with items_fetched=0. It now throws so the run gets marked failed with the real HTTP status.

What changes

  • RedditFetcher::fetch() rewrites .json/.rss on the fly (no DB migration needed).
  • Atom parsing pulls id (strips t3_ prefix), link[href], title, author/name, published/updated.
  • raw_json deliberately omits score, num_comments, upvote_ratio. All Predictions and breakout SQL filters on jsonb_exists(raw_json, 'score'), so those features skip these rows naturally instead of seeing zeros and ranking everything dead-last.
  • Proxy still attached via getScraperClient(withProxy: true) — defensive in case Reddit eventually blocks server IPs from RSS too.

What we lose until OAuth lands

  • Reddit score velocity → no Reddit predictions fire (correct degradation, no false positives).
  • Comment counts, self-post bodies, flair.

What we keep

  • Titles, URLs, timestamps, author, subreddit, cross-platform URL matching, opportunity finder lite.

Test plan

  • php artisan test --filter=RedditFetcher — 5 tests, 15 assertions, all green.
  • Wider sweep (--filter=Reddit|Predictions|Fetch) — 60 passed, 177 assertions.
  • vendor/bin/pint --dirty clean.
  • After merge + deploy: confirm reddit_* fetch_runs start reporting non-zero items_fetched again. Watch fetch_runs for any newly-failed ones with informative error_message (silent-failure fix should surface real errors loudly now).
  • Confirm Predictions / breakout queries still skip Reddit rows (they should — jsonb_exists filter).

Follow-up

Register a Reddit OAuth "script" app and add a token-refreshing RedditOAuthFetcher for full data (score, comments, upvote_ratio). Blocked on Reddit's developer registration gate.

+228
additions
-72
deletions
2
files changed