Building a News Aggregator API as a Take-Home Assignment
I recently completed a take-home assignment for a Backend Web Developer position, build a news aggregator API in Laravel. Here's how I approached the architecture, dealt with dead APIs, and kept things simple without sacrificing good engineering.
A few days ago I got a take-home assignment: build a news aggregator backend in Laravel. Fetch articles from multiple sources, store them, expose a RESTful API with filtering. Straightforward enough on paper. In practice, half the suggested sources API were dead or unsuitable. Here's what I built, why I made the decisions I did, and what I'd change with more time.
The Brief
The task asked for a Laravel API that:
- Pulls articles from multiple news sources on demand
- Stores them in a database with authors and tags
- Exposes endpoints to list, filter, and read articles
- Follows clean code principles
The suggested sources were NewsAPI, OpenNews, NewsCred, The Guardian, NYTimes, BBC News, and NewsAPI.org. Half of these either don't exist anymore, have no public API, or are themselves aggregators (which feels redundant when you're building one). The BBC public API has been discontinued entirely.
So I picked four real, working sources and built around those.
Source Selection
| Source | Approach | Notes |
|---|---|---|
| The Guardian | Official REST API | Free tier, great content API |
| New York Times | Official REST API | Free tier, metadata only (no full body) |
| ESPN | Unofficial API | Undocumented but stable for soccer leagues |
| BBC | Public RSS feeds | No API, but RSS is officially maintained |
For BBC I fetched 9 category feeds (Top Stories, World, Business, Technology, Science & Environment, Health, Entertainment & Arts, Politics, UK). RSS gives me everything I need except post content or summary, title, published date, link, categories, without requiring auth. It's not glamorous but it works, and it's better than pretending the BBC API still exists.
The ESPN integration uses an undocumented internal API. I iterate over 6 leagues (Premier League, La Liga, Bundesliga, Serie A, MLS, UEFA Champions League) and pull soccer news per league. It could break if ESPN changes their internals, but for a take-home it's fine, and it was a fun find.
Architecture
I wanted a clean separation between "how do we talk to the API" and "what do we do with the data". Three layers handle this:
Services
Each source has a dedicated service class (GuardianNewsService, NYTimesNewsService, etc.) that extends a shared NewsService base. The service is responsible for one thing: making HTTP requests and returning typed DTOs.
NewsService provides the shared insert() method that persists an article, authors, tags, pivots, in a single database transaction. All source services inherit this.
public function insert(NewsItemDTO $article): Document
{
return DB::transaction(function () use ($article) {
$document = Document::query()->create([...]);
$authors = $article->authors->map(fn (AuthorDTO $author) =>
Author::query()->firstOrCreate(['slug' => $author->slug], ['name' => $author->name])
)->map(fn (Author $author) => $author->id);
$document->authors()->sync($authors);
$tags = $article->tags->mapWithKeys(function ($data) {
$tag = Tag::query()->firstOrCreate(['slug' => $data->slug], ['title' => $data->title]);
return [$tag->id => ['role' => $data->role]];
});
$document->tags()->sync($tags);
return $document;
});
}DTOs
Each source maps its raw API response into a typed DTO. GuardianNewsItemDTO, NYTimesNewsItemDTO, etc. all extend the base NewsItemDTO. This means insert() doesn't know or care which source the article came from.
class NewsItemDTO
{
public function __construct(
public DocumentSource $sourceType,
public string $sourceId,
public string $title,
public string $content,
public Carbon $publishedAt,
public ?string $image,
public ?Collection $authors,
public ?Collection $tags,
) {}
}Tags carry a role enum value, KEYWORD or CATEGORY, stored on the pivot table. Guardian's API returns both keywords and sections, so this distinction matters for filtering later.
Sources
Sources implement a single ISource interface with one method: fetch(?int $maxItems). The source handles pagination, deduplication, and per-article error isolation.
interface ISource
{
public function fetch(?int $maxItems = null): void;
}Each source queries the database for existing IDs before inserting, so re-running the fetch command is safe. Errors on individual articles are logged and skipped, one bad article doesn't abort the whole batch.
Pagination strategies differ by source:
- Guardian: cursor-based, using the latest stored
source_idas the starting point - NYTimes: page-based, hard-limited to 1000 by the API
- ESPN: page-based per league, distributes
maxItemsevenly across leagues - BBC: no pagination needed, RSS always returns the current items
Contracts
Every service is bound to an interface in the IoC container. IGuardianNewsService, INYTimesNewsService, etc. This keeps sources decoupled from service implementations and makes mocking in tests straightforward.
The CLI Command
php artisan app:fetch --source=guardian --max-items=200The DocumentSource enum drives source resolution:
enum DocumentSource: string
{
case GUARDIAN = 'guardian';
case NYTIMES = 'nytimes';
case ESPN = 'espn';
case BBC = 'bbc';
public function getHandler(): ISource
{
return app(match ($this) {
DocumentSource::GUARDIAN => Guardian::class,
DocumentSource::NYTIMES => NYTimes::class,
DocumentSource::ESPN => ESPN::class,
DocumentSource::BBC => BBC::class,
});
}
}Adding a new source means: new service, new source class, new enum case. Nothing else changes.
The API
The document list endpoint is powered by spatie/laravel-query-builder, which gives the client control over what data comes back and how it's filtered, all from query parameters, with no extra controller code.
Filtering: custom filter classes handle each concern:
filter[title]: partial title matchfilter[tag-slug]: articles tagged with a given slugfilter[author-slug]: articles by a given authorfilter[source-type]: by source (guardian,nytimes,espn,bbc)filter[published-from]/filter[published-to]: date range
Field selection: the client can request only the columns it needs:
GET /api/documents?fields[documents]=title,slug,published_atBy default, the content column is excluded from list responses (articles can be large). It's only returned on the single-document endpoint or when the client explicitly requests it via fields.
Includes: relationships are opt-in, not always loaded:
GET /api/documents?include=authors,tagsSorting: prefix with - for descending:
GET /api/documents?sort=-published_atThe result is a flexible, frontend-friendly API where the client fetches exactly what it needs, nothing more.
OpenAPI docs are auto-generated by dedoc/scramble, no manual spec maintenance needed.
Performance: Laravel Octane
The app runs on Laravel Octane. The difference from a standard PHP setup is worth explaining.
With a traditional PHP-FPM setup, every HTTP request boots Laravel from scratch, loads the framework, registers service providers, resolves bindings, then handles the request and discards everything. That bootstrap cost is paid on every single request.
Octane starts Laravel once and keeps it alive as a long-running process. Requests are handled by the already-booted application, so the framework overhead is paid once at startup, not per request. The result is significantly lower CPU and memory usage under load, and much higher throughput, the same server handles more concurrent requests without needing more resources.
For a news API where the application state is read-heavy and mostly stateless between requests, this is a straightforward win.
Database Schema
documents : source_type, source_id, title, content, slug, image, published_at
authors : name, slug
tags : title, slug
document_author: pivot (cascades on delete)
document_tag : pivot with role enum (KEYWORD / CATEGORY)Authors and tags are deduplicated by slug using firstOrCreate. The same author appearing across multiple articles shares one row.
Testing
Tests are split by layer:
- Service tests:
MockHandlerfor Guzzle, no real HTTP. Tests that the service maps API responses to correct DTOs. - Source tests: mock the service, test pagination logic, deduplication, error handling.
- Controller tests: full HTTP feature tests with
RefreshDatabaseon in-memory SQLite.
One thing I hit during testing: several bugs surfaced that only showed up because of the test coverage. The authors() relation was pointing at the wrong pivot table name, two filter classes were calling whereHas with singular relation names, and a form request was reading the wrong route parameter. All caught and fixed. This is exactly why you write tests.
What I'd Do Differently
Scheduler: Right now you run the fetch command manually. In production you'd schedule it with php artisan schedule:work or a cron. Trivial to add, left it out to keep the scope clean.
Queue, Fetching hundreds of articles per source is synchronous. For production, each source fetch should be a queued job.
Full article body for NYTimes, The NYT API doesn't return article body text, only abstracts. A real implementation would scrape or use a different endpoint. I stored what the API gives.
ESPN stability, The unofficial API could change without notice. Guardian and NYTimes have stable, documented APIs. Worth replacing ESPN with a proper source if this went to production.
Closing
The assignment took 2 days. The interesting part wasn't the Laravel plumbing, that's routine, it was the source selection problem. Half the suggested APIs don't exist anymore, and BBC RSS is a legitimate engineering decision, not a cop-out. Sometimes the right answer to "integrate with X API" is "X API is gone, here's what I did instead and why."
The code is on GitHub: itsmattius/briefing