/ Using Data Sketches to extract fast & cheap insights from Big Data

Description

In this talk, we will discuss several streaming algorithms that use hashing techniques to produce fast estimations of queries on large datasets while using a fixed amount of memory. We will review some open source implementations and discuss how to implement them on a Datalake by using Apache Spark. Finally, we will discuss a few examples where we have applied these techniques with great advantage to real-world data at Hybrid Theory. An outline of the talk: - The problem of analyzing Big Data - Data Sketching algorithms. Open source implementations - Save sketches of the data in your lake using Apache Spark - Real use case: from weekly batches to near-real-time on-demand reports - Real use case: real-time audience size estimation.

Session 🗣 Intermediate ⭐⭐ Track: AI, ML, Bigdata, Python

Slides

big data

apache spark

scala

python

sketches

This website uses cookies to enhance the user experience. Read here