Friday, September 8, 2023

Data Engineering

 What is PySpark?

Python API for Apache spark

It's openSource

Distributed computing framework

set of libraries for real-time and large-scale data processing

Apache Spark is written in Scala

Key Features:

Rapid processing

Effectiveness with RDD -> Resilient Distributed DataSet

  Objects are immutable and it runs in distributed

  Fault tolerance, as a result of node failure it re-process

PySpark is needed

pyspark can be done in python or scala

Python is superior in readability, comprehensive, and maintainable, compare to scala which gives complex code

It also provides machine-learning libraries

Ease of learning python with simple syntax compare to scala

PySpark vs Scala 

Courtesy link: https://www.youtube.com/watch?v=6F2doPE0-vc

// Below script tag for SyntaxHighLighter