Engineering

Strength and Efficiency: How We Reached 4M QPS

Turn has attained an incredible milestone in our reach of the global digital audience. We now see 4 million queries per second (QPS) on our platform, meaning that during every second Turn has 4 million opportunities to access Internet end users. This is particularly astonishing given that just a little over five months ago, Turn achieved 3 million QPS. While such scale and growth is not unexpected from the Internet, what is extraordinary is that Turn did it without any incremental hardware expenditure and at the same time increased ROI for our advertisers. In other words, our engineering team was able to extract more performance from existing servers by a series of application performance tuning. In this blog post we’d like to share some of the work we’ve done to achieve this high level of scale.

The reality of programmatic bidding is that while we have a tremendous number of opportunities to bid, we might not always want to do so if it is not beneficial to Turn or advertisers. So, we developed a rich set of quick decision-making algorithms in bid servers to perform “early exit.” Essentially, these algorithms filter out potentially undesirable bids to avoid passing them to ad servers. This is where Turn’s “secret sauce” resides – preventing the waste of expensive computing and network resources.

One example of early exit is to drop bids from fraudulent sources like botnets. Our integration with DoubleVerify helps us ascertain if a bid is fraudulent, but because we observe many unique signals, we developed and deployed our own in-house fraud detection model as well. Additionally, we might find that certain bids are less valuable, such as those from sites with low predicted performance or from users with low-quality or no signals, in which case we probabilistically filter out some percentage of these bids. For example, in the most recent optimization, we were able to reduce the ad server queue overflow by 20% and increase delivery and ROI on high-impact campaigns by as much as 75% in some cases.The ad server is the “secret sauce” so to speak, in that it is the most computationally intense part of our pipeline and we continue to look for ways to improve it. We reduced memory usage of our ad server by 15%, but in return we developed a cache to store the most complex user targeting rules which decreased ad selection time by 5%. This forms part of a larger effort of ad selection optimizing that reduced average execution time from 2.6ms to 2.1ms, a gain of 19%.

Turn leverages large-scale data to make intelligent decisions on bidding, including page-level contextual data, which is important to determine if a page fits advertisers’ branding requirements. But the feature set for each page has become very rich, and data has grown to a point where it has started to have memory and network implications. Therefore we embarked on optimizing contextual data which is stored in our NoSQL database but is transferred to bid and ad servers. We were able to restructure the data and reduce its footprint by 90%, leading to a 30% decrease in network load, which translated into an ability to take on 30% more traffic.

Communication is the backbone of distributed systems, and Turn spent a lot of effort tuning it. Recently we have integrated GRPC (see http://www.grpc.io/) as our new messaging layer, which led to faster response times and lower failure rates. For example, the 98th percentile of remote procedure call (RPC) latency between the bid server and ad server dropped from 80ms to 60ms, an improvement of 25%. The timeout rate also dropped from 10% to 1%, leading to a 10% global increase in RPC volume. However, this also presents a quandary in that excessive traffic to ad servers might exceed their capacity. We solved that by implementing new concurrency management and robust back pressure so that bursts in traffic are smoothed out, leading to near 100% CPU utilization of ad servers with less than 1% failure rate.

On a concluding note, these innovations are possible only because Turn’s culture empowers our engineers to act on their natural curiosity by providing them with interesting challenges to tackle that have direct bearing on the company’s bottom line. The improvements described in this post were done not by one single super-developer but are accomplishments from different team members. By democratizing the spirit of innovation, time and time again we see the fruits of our labor.

Application Data:

engineering 
strength_and_efficiency_how_we_reached_4m_qps 
path /srv/www/sites/turn-dev.com/dev/repo/build/app 
main_controller app\controllers\Primary 

Request Data:

$_GET
No Data
$_POST
No Data
$_COOKIE
No Data
$_FILES
No Data
$_SERVER
REDIRECT_STATUS 200 
HTTP_HOST turn.stage.elusive-concepts.com 
HTTP_ACCEPT_ENCODING x-gzip, gzip, deflate 
HTTP_USER_AGENT CCBot/2.0 (http://commoncrawl.org/faq/) 
HTTP_ACCEPT text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 
PATH REMOVED 
SERVER_SIGNATURE Apache/2.4.10 (Linux/SUSE) Server at turn.stage.elusive-concepts.com Port 80 
SERVER_SOFTWARE Apache/2.4.10 (Linux/SUSE) 
SERVER_NAME turn.stage.elusive-concepts.com 
SERVER_ADDR 192.168.1.201 
SERVER_PORT 80 
REMOTE_ADDR 54.225.54.120 
DOCUMENT_ROOT /srv/www/sites/turn-dev.com/prod/webroot 
REQUEST_SCHEME http 
CONTEXT_PREFIX  
CONTEXT_DOCUMENT_ROOT /srv/www/sites/turn-dev.com/prod/webroot 
SERVER_ADMIN roger.soucy@elusive-concepts.com 
SCRIPT_FILENAME /srv/www/sites/turn-dev.com/prod/webroot/index.php 
REMOTE_PORT 39572 
REDIRECT_URL /engineering/strength-and-efficiency-how-we-reached-4m-qps 
GATEWAY_INTERFACE CGI/1.1 
SERVER_PROTOCOL HTTP/1.0 
REQUEST_METHOD GET 
QUERY_STRING  
REQUEST_URI /engineering/strength-and-efficiency-how-we-reached-4m-qps 
SCRIPT_NAME /index.php 
PATH_INFO /engineering/strength-and-efficiency-how-we-reached-4m-qps 
PATH_TRANSLATED redirect:/index.php/engineering/strength-and-efficiency-how-we-reached-4m-qps/strength-and-efficiency-how-we-reached-4m-qps 
PHP_SELF /index.php/engineering/strength-and-efficiency-how-we-reached-4m-qps 
REQUEST_TIME_FLOAT 1505977190.697 
REQUEST_TIME 1505977190 

Logs:

Time Data
2017-09-21 06:59:50
Loading Framework...
2017-09-21 06:59:50
app\models\Slug::lookup: Content_Slug::lookup(): No record found!
2017-09-21 06:59:50
Larry Lo
2017-09-21 06:59:50

Events:

Event Data Listeners
APPLICATION >> RUN null 0
APPLICATION >> LOADED null 0
APPLICATION >> HANDOFF null 0
TEMPLATE >> HTML_START "" 0
TEMPLATE >> BEFORE_HTML_END null 1

Errors:

Notice (8) Undefined variable: clean_author /srv/www/sites/turn-dev.com/dev/repo/build/app/controllers/class.engineering.php L: 305
Notice (8) Undefined index: f /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Trying to get property of non-object /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Undefined index: f /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Trying to get property of non-object /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Undefined index: f /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Trying to get property of non-object /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Undefined index: f /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Trying to get property of non-object /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Undefined index: f /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Trying to get property of non-object /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 72
Notice (8) Undefined index: image /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 137
Notice (8) Undefined index: facebooklink /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 161
Notice (8) Undefined index: twitterlink /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 165
Notice (8) Undefined index: linkedinlink /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 169
Notice (8) Undefined index: pTitle /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 175
Notice (8) Undefined index: excerpt /srv/www/sites/turn-dev.com/dev/repo/build/tmp/smarty/templates_c/f7a7186590639363f640fdf7c56578099adb66c0.file.post.tpl.php L: 177

Benchmarks:

Benchmark Tag Time Comment
execution_time TIMER_START 0.000ms Starting bootstrap...
execution_time TIMER_STOP 121.982ms Debug console render output...