In this talk, we will discuss several streaming algorithms that use hashing techniques to produce fast estimations of queries on large datasets while using a fixed amount of memory. We will review some open source implementations and discuss how to implement them on a Datalake by using Apache Spark. Finally, we will discuss a few examples where we have applied these techniques with great advantage to real-world data at Hybrid Theory. An outline of the talk: - The problem of analyzing Big Data - Data Sketching algorithms. Open source implementations - Save sketches of the data in your lake using Apache Spark - Real use case: from weekly batches to near-real-time on-demand reports - Real use case: real-time audience size estimation.
Session 🗣 Intermediate ⭐⭐ Track: AI, ML, Bigdata, Python
big data
apache spark
scala
python
sketches