From Ingestion to Analytics: How Witnes Processes Web Performance Data

At Witnes, we’ve built a data processing pipeline that transforms raw browser telemetry into actionable performance insights. This post walks through our architecture, from the moment a user visits a page to when developers see diagnostic insights in their dashboard. At Witnes we have build a processing pipeline that focuses on short to mid term viability (scalability goal is up to 400 paying customers) that can easily ingest our tracker’s telemetry data and transform them into actionable performance insights. In this post we will go over the short history of our data processing architecture which will invovle the infrastructure decision and the code design choices that lead to our targeted dashboards.

The Medallion Architecture

The first first technical decision was to adopt a medallion architecture for our data processing pipeline. This design pattern organizes data into three layers: Bronze, Silver, and Gold. Each layer serves a specific purpose in the transformation process, allowing us to maintain data integrity while enabling complex transformations and analytics.

Bronze Layer: Raw Ingestion

The Bronze layer captures raw telemetry exactly as received from the browser. When the tracker script (w.js) sends performance data, our ingestion endpoint stores it as-is in the metrics_bronze table. This preserves the original structure and metadata without any transformation.

Key features:

Stores complete browser payloads in JSONB columns
Maintains full fidelity of original data
Serves as the immutable source of truth for all processing

By having an immutable Bronze layer, we can always reprocess data if our transformation logic changes due to algorithm improvements or fixing bugs. Our current ‘cold’ storage architecture which we will discuss on another post allows us to keep this raw data for a long period of time and always rehydrate it when needed.

Silver Layer: Structured Transformation

The Silver layer processes raw Bronze data into structured, queryable entities. Our SilverService performs complex transformations such as our Performance metric extraction or the Session Stitching logic. To be very honest sometimes the decision if a ‘logic’ should be placed in the Silver or Gold layer is a bit blurry but we try to keep the Silver layer focused on data transformations and the Gold layer focused on applying diagnostic intelligence as a rule of thumb. In addition as we intend for the Gold layer to be less heavy, we reduce the amount of data that gets sent to the Gold layer - unless we need it for the diagnostic intelligence. For example the waterfal dump of the page load is only stored in the Silver layer as the metrics that we extract from it have already been computed and sent to the Gold layer. This way we reduce the heavy hitting of only 1 layer and as deep dives are less frequent than the general performance metrics, we can keep the Gold layer more lightweight and focused on the diagnostic intelligence.

Gold Layer: Diagnostic Classification

The Gold layer applies Witnes’ diagnostic intelligence to classify performance issues. Our GoldService analyzes Silver data to determine topics such as User Experience Sentiment, Backend vs. Frontend Bottlenecks which are all based on the metrics that Silver has computed. This layer is where we classify performance issues and provide actionable insights to developers.

This produces the metrics_gold table with diagnostic classifications that power our UI.

Continuous Aggregation with TimescaleDB (and a bit of history)

Before adopting TimescaleDB continuous aggregates, we maintained separate aggregation tables that we had to manually update through scheduled jobs. This meant dealing with multiple tables, complex JOINs, and timing issues between real-time data collection and batch aggregation processes. We had to handle data consistency across these different tables and we were not fond of that and specifically the query complexity that arose with it.

With TimescaleDB continuous aggregates, we eliminated these problems. Instead of managing scheduled jobs to aggregate our analytics data, we could define materialized views that would automatically update and provide consistent data.

When you query a TimescaleDB CA, it combines pre-aggregated data with the most recent raw data from the underlying tables on its own without you having to worry if the aggregate is up to date or not.

Therefore the use of TimescaleDB continuous aggregates instead of application-level aggregation jobs resulted in:

Real-time query results that include both historical aggregates and recent data
Elimination of custom C# aggregation jobs

We split the CAs into different categories such as performance, sentiment, device, channel etc in order to keep the context of each separate and any changes only affecting the relevant CA.

One caveat of using CAs is that when you want to introduce a new column you have to drop and recreate the CA which can be quite problematic if you no longer have the raw data and that is why we had to come up with our ‘cold’ storage of the Bronze layer which allows us to always recalculate the CAs.

Lastly one key important aspect of TimescaleDB is the native columnstore compression which we use for the medalion tables. With a service that focuses on storing more and more data points and have them available for analysis for long periods of time (at the moment our goal is 4 years, but would like to increase it in the future) compression is a key aspect of our architecture. You can read more about this here TimescaleDB Compression. Spoilers the compression also helps keep the memory footprint small which was a key deciding factor over ClickHouse.

Why TimescaleDB Over ClickHouse?

We evaluated several options for our analytics backend, including ClickHouse. However, we ultimately chose TimescaleDB for several key reasons:

Self-Hosting Simplicity: As a single-person team that self-hosts everything, we wanted to minimize operational complexity. ClickHouse is RAM-intensive and requires significant operational overhead to run reliably in production.
Single Database Architecture: We already had a solid PostgreSQL foundation and didn’t want to add the operational complexity of maintaining another database system. TimescaleDB runs as a PostgreSQL extension, eliminating the need for separate infrastructure.
Memory Efficiency: ClickHouse’s columnar architecture is memory-hungry, which didn’t align with our self-hosting requirements. TimescaleDB, running on PostgreSQL, provides better memory efficiency for our use case.
Reduced Infrastructure: By choosing TimescaleDB over ClickHouse, we avoided adding another complex system to our stack. This keeps our infrastructure simpler and more maintainable.
Own Knowledge: TimescaleDB has already been familiar to us therefore although it would be great to learn yet another tool, for this project it was preferred to avoid taking extra risks.

Why This Architecture Matters

To Witnes this architecture apart from the technical parts provides one very key aspect, Organization. By keep distinct layers of processing we can keep the codebase and our minds more lean. Each layer has a clear responsibility and we can easily reason about where to place new logic or how to debug existing logic. This is especially important as we are a single-person team and we want to keep the codebase as maintainable as possible.

In addition, this design enables key capabilities to the design philosophy of Witnes:

Immutable source data: Raw telemetry is preserved forever in Bronze, allowing reprocessing. i.e if we mess up we can fix things for years in the past.
Independent processing layers: Each layer communicates with its own contracts, this hopefully will reduce oupsies and allow individual layer edits.
Scalable analytics: TimescaleDB continuous aggregates provide fast, consistent analytics performance, somethig important when you cant infinitely scale your infrastructure.

Ingestion Endpoint

We left last the topic of the ingestion endpoint as it might seem the least complex. We currently run a single API server that handles just the ingestion of the events and sends them over to a Queue that then gets processed. We have 2 events, the main event and the patch event, which means that we have a stricky approach per navigationId so that messages do not go to different consumers as these events might come unordered and a basic retry mechanism works but only because the consumer is the same. In the future using an on-edge endpoint on Bunny CDN or similar will allow us to reduce the latency of ingestion worldwide.

We hope this architecture will scale your needs as we grow!