What is PySpark?
Python API for Apache spark
It's openSource
Distributed computing framework
set of libraries for real-time and large-scale data processing
Apache Spark is written in Scala
Key Features:
Rapid processing
Effectiveness with RDD -> Resilient Distributed DataSet
Objects are immutable and it runs in distributed
Fault tolerance, as a result of node failure it re-process
PySpark is needed
pyspark can be done in python or scala
Python is superior in readability, comprehensive, and maintainable, compare to scala which gives complex code
It also provides machine-learning libraries
Ease of learning python with simple syntax compare to scala
PySpark vs Scala
Courtesy link: https://www.youtube.com/watch?v=6F2doPE0-vc