A year ago, we at Taboola kicked off a POC to migrate our Spark workload from the thousands of CPU cores cluster to the GPU. I started with a small test, just to get a hunch of what it would require and very quickly found myself in an intense project involving a great group of people from Taboola and NVIDIA’s Rapids team. In this blog post, I’ll describe what Taboola does with Spark, our motivation to move to Rapids and the insights, pitfalls, challenges and achievements so far.
Taboola’s Business
Taboola is the leading content recommendation company in the world. When you surf the web and see a native advertisement, it’s most likely a content served by Taboola. In order to do so, Taboola uses a very complex data pipeline that stretches from the user’s browser or phone, through multiple data centers running complex Deep learning algorithms, databases, infrastructure services such as kafka and thousands of servers in order to serve the best fitting ad for the user. This blog will focus on one component in this complex pipeline – our thousands of CPU cores spark cluster and our effort to migrate it to the GPU. Data to this cluster arrives from data centers all over the world where we collect and build a unique “page view”. A pageview is a very big and wide data structure identifying each user and its interaction with our system. This pageview structure, containing over 1500 distinct columns and amounts to over 1TB of hourly data, is what we process in our Spark cluster.
Project Motivation
Motivation originated from the fact that the CPU cluster scale out is very challenging in terms of both hardware costs and data center capacity. In order to cope with the increasing load of data needed to be processed, Taboola is required to increase its Spark cluster capacity quite often. We have many distinct analyzers, SQL queries, that process the incoming pageviews, 1TB of raw data, every hour with a 2, 6, 12 and 48 catchup runs. New analyzers are being created all the time and increase the load on the Spark cluster, hence the constant need for more compute power. NVIDIA’s Rapids accelerator for Spark was a perfect match.
It is POC Time
We first defined what we are going to test. We took real production data from “Cyber Monday” so that we’d test and benchmark a very big dataset. The data is 1.5TB of ZSTD compressed parquet files, per hour. It has over 1500 columns, of all native types including arrays, structures, nested structures with arrays, the full monty. Hardware wise we started with a 72 CPU cores Intel server with 3 A30 GPUs, 900 GB local SSD drive for Spark to store its intermediate files, 380GB RAM and a 10Gb/s NIC card. We set a minimum bar of x3 factor so that the GPU solution would be considered a successful one cost wise. We picked 15 queries from production from multiple R&D departments that would resemble as many of the hundreds of queries we have in production. These queries are mostly very complex including many SQL operations such as: aggregations, sorts, lateral view explode, distribute by, window functions and UDFS. Figure-1 shows such query.
Figure-2 gives you a sneak peek as to the factors we got.
POC Goals
We started with a single server as described above, however obviously we’d want it to scale to a multi-GPU and multi server cluster. The cluster would be managed by Kubernetes, as opposed to our current Mesos cluster. Mesos is going to be obsolete and NVIDIA supports Kubernetes. Taboola’s R&D should be oblivious to the change and should not care whether their queries run on CPUs or GPUs. GPU output should be the same as the CPU, a task which is sometimes challenging when migrating to the GPU, but one that the Spark Rapids team is very aware of. Stability in production is also a crucial goal.Lastly, the GPU should outperform the CPU by a minimum factor of 3. We benchmarked several GPUs, P100s, V100s, A100s and A30 and found out that the A30 was the best price performance fit.
Run Rapids Run
I ran some SQL queries using Rapids and the results were a bit disappointing. Some of the less complex queries, mostly with lateral view explode, gave a x3 – x5 factor over the CPU. Some showed much lower factors while some queries even crashed. I started to ask around in Rapid’s github repo, played with the relevant Spark and Rapid’s specific parameters and started to get better results. The relevant parameters were:
- sql.files.maxPartitionBytes – The CPU uses a default value of 128MB, for the GPU it is too low. We are using 1-2GB.
- sql.shuffle.partitions – We found the 200 default to be good enough in most cases.
- rapids.sql.concurrentGpuTasks – Determines the number of tasks that can be run concurrently on the GPU. You should try to use at least two.
Tuning these parameters helped the queries to run more smoothly and perform better in some cases.
First Bottleneck
So what is holding us back? We’ve profiled some of the less performant SQLs and saw that most of them wasted a lot of time while parsing Parquet’s footer data on the CPU. Our Parquet data has more than 1500 columns and apparently the regular Java code that is parsing the footer was not adequate for such a big footer. Figure-3 shows a snippet of NVIDIA’s profiler output showing a 9 seconds Spark task, where the GPU was mostly idle and only worked for 330ms.
Figure-4 shows a flamegraph of one of our queries that suffered from this behavior. The purple bars indicate time spent inside the org.apache.parquet.hadoop.ParquetFileReader class. Almost 50% of this query time was spent on parsing Parquet’s footer, during that time the GPU was idle.
We set off to test an idea we had. When parsing the footer, Parquet’s code would iterate over the footer metadata serially for each row group. We made some adjustments to Parquet’s parameters and decreased the number of row groups we had in each file. That gave us about 10-15% improvement but obviously was not enough. Remember that we have 1500 columns so each time a footer metadata is read, even though we’re only asking for 50-100 columns per query, the entire 1500 metadata would be read and parsed serially. We wanted to index the footer metadata so that instead of reading the entire 1500 columns data serially, we’d just access it directly. We managed to pull this off by changing Parquet-mr public code in C++ and Java and did get nice performance results, however it was too cumbersome and complex. Luckily for us, the Rapids team at NVIDIA came with a much better idea and replaced the Java code with Arrow’s C++ implementation. We now have the rapids.sql.format.parquet.reader.footer.type set to NATIVE by default for our GPU implementation. Bottleneck resolved. No more queries with the GPU idle because of footer parsing overheads on the CPU.
Network Bottleneck
The next bottleneck was caused by the fact that the network card was too weak. While the 10Gb/s ethernet card sustained the CPU load, it failed to do so for the GPU load. Replacing it with a 25Gb/s card, resolved this bottleneck.
Disk I/O Bottleneck
So two bottlenecks resolved, queries still run slow, now what? Looking in Spark’s UI page, gave a clear indication as to what is happening now. See the following table.
Metric | Min | 25th percentile | Median | 75th percentile | Max |
Duration | 0.4 s | 0.6 s | 0.8 s | 1 s | 1.2 min |
GC Time | 0.0 ms | 0.0 ms | 0.0 ms | 90.0 ms | 0.5 ms |
Shuffle Read Size/Records | 21.4 MB/1000 | 22.3 MB/1000 | 22.5 MB/1000 | 22.7 MB/1000 | 27.3 MB/1000 |
Shuffle Write Size/Records | 17.5 MB/1000 | 17.9 MB/1000 | 18 MB/1000 | 18.1 MB/1000 | 18.1 MB/1000 |
Scheduler Delay | 3.0 ms | 5.0 ms | 5.0 ms | 7.0 ms | 3 ms |
Peak Execution Memory | 64 MB | 64 MB | 64 MB | 64 MB | 64 MB |
Shuffle Write Time | 9.0 ms | 13.0 ms | 18.0 ms | 21.0 ms | 59 s |