Engineering

Scrub: Real-Time Insights into Complex Distributed Systems

At Turn, we build highly complex distributed systems. Due to their critical nature, it is necessary to be able to look deep into the internals of the application logic during runtime to understand why the system is behaving the way it is. To do this, we built Scrub: a tool to interactively analyze the state of distributed systems with minimal disruptions.

Given the mixture of legacy code, critical financial transactions and the ever-growing complexity of our partner relationships, the range of insights needed on a daily basis is very broad. In an ideal world, we would be able to store, index and search every request that arrives at our servers, the result of every computation that occurs due to these requests and every RPC call invoked to carry out these computations. This gives us perfect visibility into the working of the system.

Unfortunately, this approach will not scale. For a company that deals with more than three million QPS, we can easily collect multiple petabytes of raw data every day. Besides having a prohibitively expensive storage cost, gaining insights from this data is non-trivial. A thousand-node hadoop cluster will need at least a few minutes for each query. For debugging issues, we need insights quickly.

A Practical Middle-Ground: Scrub

Scrub models a distributed platform like a streaming database. There are two main parts to Scrub. The first is a client library, which provides APIs and Annotations (for the Java clients) to instrument different parts of the application code. This tells the Scrub client which objects are to be made visible to Scrub users, and at what point in the code can we safely take it out of application threads and into Scrub's thread pool for processing, and transmit over to the second part: the aggregation servers.

These objects are now emitted from various services in your application cluster to a Kafka topic, which are read by an array of consumers, aggregated and stored into a HBase table. We have tried to ensure that developers don't have to make a lot of code changes to integrate Scrub. The Scrub instrumentation API resembles that of any commonly used logging framework (such as Slf4j). Instead of taking simple strings, we can push any Java object into Scrub’s log() method. Each object can be thought of a “database table,” whose “columns” are the fields in this class. Every log() method invocation creates a separate stream of objects of the same type which are now queryable with Scrub’s query language.

In the figure above, Scrub is used to look at raw requests from an ad exchange, the RPC calls made to an ad server, the responses from these ad servers, and the various computations occurring within the adserver.

Optimizations: Queries, TTLs and Sampling (Or How Scrub Avoids Doing Any Work)

Scrub does not emit any data from a service until a user asks for it. Until a query is available for a particular stream, all "log-to-scrub" calls in application code are effectively NOPs. This allows us to instrument all corners of our codebase without worrying about the overhead it causes to the host service. Users can pose queries upon this "collective schema" as if it were a regular database. For example, a query can contain standard SQL operations such as selection, projection and aggregation operations along with filters specified in where clauses. Scrub tries to execute as many operations as possible in the host service itself. Specifically, it evaluates all filters and projection operations at the host. This reduces the load on the network to emit all tuples (which can be extremely high if not filtered).

Scrub achieves its insights at the expense of historical data. It logs data only when a query is issued. Every query has a TTL after which it stops emitting tuples from the service. This TTL can vary from two minutes to 24 hours, but having an explicit value guarantees that a host service will never "leak" queries. The CPU and IO resources need to be dedicated to Scrub threads only for a limited period of time.

As a user, you have the flexibility to run a query on a sample of services in your topology. You can ask Scrub to do the load balancing for you, or cherry-pick specific hosts to run queries (this is useful if you are A/B testing a new feature).

Scrub @ Turn

We have been using Scrub for more than a year in production at Turn. It has helped debug various production issues and campaign performance issues. Developers have used it to closely monitor new features. Due to its schema-based query language, it has allowed people with no knowledge of the internals of the system to query for data. This freed up cycles for developers to do more development, and allows support staff to be independent in gathering data (and even fix it, if possible) for certain issues.

As Eric Raymond once said, “Given enough eyeballs, all bugs are shallow.” Scrub allows anyone at Turn to easily dig deep down into the fine details of our system, allowing us to iterate faster on issues both internal and external to our system. Taking less time to get to a better product has made Turn faster than we would be without Scrub and faster than other companies working on similarly challenging problems.

We hope to release Scrub to the public sometime early next year, along with details of how it works and how to integrate it with your code. There are some advanced features in Scrub, which we didn’t get a chance to talk about here. Keep an eye out for future posts on Scrub.

Arjun Satish

Sr. Software Engineer

Application Data:

engineering 
scrub_real_time_insights_into_complex_distributed_systems 
path /srv/www/sites/turn-dev.com/dev/repo/build/app 
main_controller app\controllers\Primary 

Request Data:

$_GET
No Data
$_POST
No Data
$_COOKIE
No Data
$_FILES
No Data
$_SERVER
REDIRECT_STATUS 200 
HTTP_HOST turn.stage.elusive-concepts.com 
HTTP_ACCEPT_ENCODING x-gzip, gzip, deflate 
HTTP_USER_AGENT CCBot/2.0 (http://commoncrawl.org/faq/) 
HTTP_ACCEPT text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 
PATH REMOVED 
SERVER_SIGNATURE Apache/2.4.10 (Linux/SUSE) Server at turn.stage.elusive-concepts.com Port 80 
SERVER_SOFTWARE Apache/2.4.10 (Linux/SUSE) 
SERVER_NAME turn.stage.elusive-concepts.com 
SERVER_ADDR 192.168.1.201 
SERVER_PORT 80 
REMOTE_ADDR 54.225.54.120 
DOCUMENT_ROOT /srv/www/sites/turn-dev.com/prod/webroot 
REQUEST_SCHEME http 
CONTEXT_PREFIX  
CONTEXT_DOCUMENT_ROOT /srv/www/sites/turn-dev.com/prod/webroot 
SERVER_ADMIN roger.soucy@elusive-concepts.com 
SCRIPT_FILENAME /srv/www/sites/turn-dev.com/prod/webroot/index.php 
REMOTE_PORT 54746 
REDIRECT_URL /engineering/scrub-real-time-insights-into-complex-distributed-systems 
GATEWAY_INTERFACE CGI/1.1 
SERVER_PROTOCOL HTTP/1.0 
REQUEST_METHOD GET 
QUERY_STRING  
REQUEST_URI /engineering/scrub-real-time-insights-into-complex-distributed-systems 
SCRIPT_NAME /index.php 
PATH_INFO /engineering/scrub-real-time-insights-into-complex-distributed-systems 
PATH_TRANSLATED redirect:/index.php/engineering/scrub-real-time-insights-into-complex-distributed-systems/scrub-real-time-insights-into-complex-distributed-systems 
PHP_SELF /index.php/engineering/scrub-real-time-insights-into-complex-distributed-systems 
REQUEST_TIME_FLOAT 1505977174.948 
REQUEST_TIME 1505977174 

Logs:

Time Data
2017-09-21 06:59:34
Loading Framework...
2017-09-21 06:59:35
Arjun Satish
2017-09-21 06:59:35
2017-09-21 06:59:35
app\models\Slug::lookup: Content_Slug::lookup(): No record found!

Events:

Event Data Listeners
APPLICATION >> RUN null 0
APPLICATION >> LOADED null 0
APPLICATION >> HANDOFF null 0
TEMPLATE >> HTML_START "" 0
TEMPLATE >> BEFORE_HTML_END null 1

Errors:

Notice (8) Undefined variable: clean_author /srv/www/sites/turn-dev.com/dev/repo/build/app/controllers/class.engineering.php L: 305
Notice (8) Undefined index: f /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Trying to get property of non-object /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Undefined index: f /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Trying to get property of non-object /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Undefined index: f /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Trying to get property of non-object /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Undefined index: f /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Trying to get property of non-object /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Undefined index: f /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Trying to get property of non-object /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Undefined index: image /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 137
Notice (8) Undefined index: facebooklink /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 161
Notice (8) Undefined index: twitterlink /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 165
Notice (8) Undefined index: linkedinlink /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 169

Benchmarks:

Benchmark Tag Time Comment
execution_time TIMER_START 0.000ms Starting bootstrap...
execution_time TIMER_STOP 117.796ms Debug console render output...