PySpark optimization techniques

Mayank Choudhary - Aug 28 - - Dev Community
There are a variety of different parts of Spark jobs that you might want to optimize, and it’s valuable to be specific. Following are some of the areas:
  • Code-level design choices (e.g., RDDs versus DataFrames)
  • Joins (e.g., use Broadcast joins and avoid Cartesian joins or even full outer joins
  • Aggregations (e.g., using reduceByKey when possible over groupByKey)
  • Individual application properties
  • Inside of the Java Virtual Machine (JVM) of an executor
  • Worker nodes
  • Cluster and deployment properties

Image description

  • Using efficient data storage formats like Parquet or ORC can significantly reduce storage size and improve read/write performance.
  • Efficient Storage: Using formats like Parquet or ORC compresses the data, reducing storage costs and improving disk I/O performance.
  • Faster Query Performance: These formats are optimized for large-scale processing, leading to faster query execution times due to their columnar storage structure.

Image description

Image description

  • Row-based file formats (e.g., CSV, JSON) store data by rows. Each row contains all the fields for a particular record, making it efficient for writing and retrieving whole records.
  • Columnar-based file formats (e.g., Parquet, ORC) store data by columns. Each column contains all the values for a particular field, making it more efficient for analytical queries that involve aggregation and filtering.

Image description

Image description
ORC (Optimized Row Columnar) and Parquet are popular columnar storage file formats used in big data processing frameworks like Apache Spark and Hadoop. They are optimised for storage and query performance in distributed data environments. Both ORC and Parquet files are binary formats, which means you cannot read them directly like CSV files.

Image description

Image description

SELECT AVG(salary) FROM employees WHERE age > 30;
  • Row-Based (CSV): Reads all rows, including unnecessary data, resulting in higher I/O.
  • Columnar-Based (Parquet): Reads only the age and salary columns, reducing I/O.
  • Columnar-Based (ORC): Reads only the age and salary columns, but with additional optimization due to lightweight indexing, it skips irrelevant rows faster, resulting in even better query performance.

Image description

  • Broadcast joins improve join performance when one of the tables is small enough to fit into the memory of each worker node.
  • Improved Join Performance: Broadcasting a small table to all nodes minimizes the need for shuffling large datasets, significantly speeding up the join operation.
  • Memory Efficiency: This method works best when the small table fits in memory, avoiding expensive disk I/O operations.

Image description

Image description

Image description

  • Caching is useful when a DataFrame is reused multiple times. It avoids recomputation and speeds up the workflow.
  • Avoids Recomputations: Caching prevents the need to recompute DataFrames multiple times during a workflow, saving time.
  • Increases Performance: By storing DataFrames in memory, subsequent actions on the DataFrame are executed much faster.

Image description

Image description

Image description

  • Proper partitioning of DataFrames can improve parallelism and reduce shuffling, enhancing performance.
  • Enhanced Parallelism: Proper repartitioning ensures that the workload is evenly distributed across nodes, improving parallel processing.
  • Reduced Shuffling: By partitioning data based on key columns, you minimize costly shuffle operations during joins or aggregations.

Image description

Image description

Image description

  • DataFrames are optimized for performance and provide a higher level of abstraction compared to RDDs.
  • Higher Abstraction: DataFrames provide a more user-friendly API compared to RDDs, with automatic optimization under the hood.
  • Performance Optimization: The Catalyst optimizer in Spark SQL optimizes DataFrame operations, making them faster than equivalent RDD operations.

Image description

Image description

Image description

  • User-defined functions (UDFs) are often slower as they operate row-wise. Use built-in functions whenever possible.
  • Performance Overhead: UDFs can slow down processing since they operate on each row individually and bypass many of Spark's internal optimizations.
  • Leverage Built-in Functions: Built-in functions are optimized for distributed processing and often execute much faster than UDFs.

Image description

Image description

. . . . . . . . . . . . . . . . . . .
Terabox Video Player