Skip to main content
CF

Spark and Python for Big Data with PySpark

10h 35m 43s
English
Paid

Master the latest Big Data Technology - Spark! Learn to leverage it with one of the most popular programming languages, Python! In the modern job market, the ability to analyze huge data sets is invaluable, and this course is designed to equip you with the knowledge of one of the best technologies for this task, Apache Spark! Top technology companies such as Google, Facebook, Netflix, Airbnb, Amazon, and NASA utilize Spark to tackle their big data challenges!

Course Overview

Spark operates up to 100 times faster than Hadoop MapReduce, spiking demand for this skill! The Spark 2.0 DataFrame framework is relatively new, offering you the chance to quickly become a sought-after professional in the job market.

This course covers everything from a crash course in Python to using Spark DataFrames with the latest Spark 2.0 syntax. Progressing further, you'll learn how to use the MLlib Machine Library alongside the DataFrame syntax and Spark. Throughout the course, you'll engage in exercises and Mock Consulting Projects that simulate real-world scenarios, allowing you to apply your new skills to tackle real problems!

We delve into the latest Spark Technologies such as Spark SQL, Spark Streaming, and advanced models including Gradient Boosted Trees! After completing this course, you will be confident in adding Spark and PySpark to your resume!

If you're ready to step into the realm of Python, Spark, and Big Data, this is the course for you!

Requirements

  • General Programming Skills in any Language (preferably Python)
  • 20 GB of free space on your local computer (or strong internet connection for AWS usage)

Who This Course Is For

  • Individuals proficient in Python looking to apply it to Big Data
  • Those skilled in another programming language needing to learn Spark

What You'll Learn

  • Utilize Python and Spark in unison to analyze Big Data
  • Master the new Spark 2.0 DataFrame Syntax
  • Engage in Consulting Projects mimicking real-world situations
  • Classify Customer Churn using Logistic Regression
  • Apply Spark with Random Forests for Classification
  • Use Spark's Gradient Boosted Trees
  • Create robust Machine Learning Models with Spark's MLlib
  • Understand the DataBricks Platform
  • Get started with Amazon Web Services EC2 for Big Data Analysis
  • Learn AWS Elastic MapReduce Service functionalities
  • Leverage the power of Linux in a Spark Environment
  • Create a Spam filter using Spark and Natural Language Processing
  • Analyze Tweets in Real Time with Spark Streaming

About the Author: Udemy

Udemy thumbnail

Udemy is the largest open marketplace for online courses on the internet. Founded in 2010 by Eren Bali, Oktay Caglar, and Gagan Biyani and headquartered in San Francisco, the company went public on the Nasdaq in 2021 under the ticker UDMY. The platform hosts well over two hundred thousand courses across software development, IT and cloud, data science, design, business, marketing, and creative skills, taught by tens of thousands of independent instructors. Roughly seventy million learners use it worldwide, and the corporate arm — Udemy Business — supplies a curated subset of that catalog to enterprise customers.

Because Udemy is a marketplace rather than a single editorial publisher, the catalog is uneven by design. The strongest material lives in the long-form, project-based courses authored by working engineers — full-stack JavaScript, React, Node.js, Python data science, AWS, Docker and Kubernetes, mobile development with Flutter and React Native, and cloud certification preparation. The CourseFlix listing under this source is the slice of that catalog that has been mirrored here for offline-friendly viewing, organized by topic and updated as new releases land. Pricing on Udemy itself swings dramatically with the site's near-permanent sales, which is why the platform is best treated as a deep reference catalog: pick instructors with strong reviews and a track record of updating their material rather than buying on the headline price alone.

Watch Online 63 lessons

This is a demo lesson (10:00 remaining)

You can watch up to 10 minutes for free. Subscribe to unlock all 63 lessons in this course and access 10,000+ hours of premium content across all courses.

View Pricing
0:00
/
#1: Introduction
All Course Lessons (63)
#Lesson TitleDurationAccess
1
Introduction Demo
03:10
2
Course Overview
07:56
3
What is Spark? Why Python?
18:58
4
Set-up Overview
05:59
5
Local Installation VirtualBox Part 1
11:26
6
Local Installation VirtualBox Part 2
14:00
7
Setting up PySpark
05:46
8
AWS EC2 Set-up Guide
02:47
9
Creating the EC2 Instance
16:19
10
SSH with Mac or Linux
04:50
11
Installations on EC2
15:06
12
Databricks Setup
11:42
13
AWS EMR Setup
17:17
14
Introduction to Python Crash Course
01:34
15
Jupyter Notebook Overview
06:50
16
Python Crash Course Part One
16:09
17
Python Crash Course Part Two
12:08
18
Python Crash Course Part Three
11:20
19
Python Crash Course Exercises
01:30
20
Python Crash Course Exercise Solutions
09:27
21
Introduction to Spark DataFrames
02:27
22
Spark DataFrame Basics
10:52
23
Spark DataFrame Basics Part Two
09:56
24
Spark DataFrame Basic Operations
10:16
25
Groupby and Aggregate Operations
12:28
26
Missing Data
08:57
27
Dates and Timestamps
10:05
28
DataFrame Project Exercise
03:14
29
DataFrame Project Exercise Solutions
16:54
30
Introduction to Machine Learning and ISLR
10:22
31
Machine Learning with Spark and Python with MLlib
09:05
32
Linear Regression Theory and Reading
05:04
33
Linear Regression Documentation Example
14:20
34
Regression Evaluation
06:47
35
Linear Regression Example Code Along
15:14
36
Linear Regression Consulting Project
03:12
37
Linear Regression Consulting Project Solutions
15:33
38
Logistic Regression Theory and Reading
11:23
39
Logistic Regression Example Code Along
15:40
40
Logistic Regression Code Along
18:37
41
Logistic Regression Consulting Project
03:14
42
Logistic Regression Consulting Project Solutions
11:14
43
Tree Methods Theory and Reading
08:01
44
Tree Methods Documentation Examples
13:19
45
Decision Tress and Random Forest Code Along Examples
20:38
46
Random Forest - Classification Consulting Project
02:34
47
Random Forest Classification Consulting Project Solutions
08:01
48
K-means Clustering Theory and Reading
06:55
49
KMeans Clustering Documentation Example
09:52
50
Clustering Example Code Along
12:46
51
Clustering Consulting Project
03:10
52
Clustering Consulting Project Solutions
08:43
53
Introduction to Recommender Systems
06:33
54
Recommender System - Code Along Project
12:09
55
Introduction to Natural Language Processing
08:03
56
NLP Tools Part One
16:13
57
NLP Tools Part Two
08:06
58
Natural Language Processing Code Along Project
14:09
59
Introduction to Streaming with Spark!
10:20
60
Spark Streaming Documentation Example
11:48
61
Spark Streaming Twitter Project - Part
04:30
62
Spark Streaming Twitter Project - Part Two
13:09
63
Spark Streaming Twitter Project - Part Three
17:36
Unlock unlimited learning

Get instant access to all 62 lessons in this course, plus thousands of other premium courses. One subscription, unlimited knowledge.

Learn more about subscription

Related courses

Frequently asked questions

What prior knowledge do I need before enrolling in this course?
Before enrolling in this course, you should have general programming skills. The course includes a Python crash course to get you familiar with Python syntax and programming concepts. However, having a basic understanding of programming logic and constructs will be beneficial to keep up with the pace of the lessons.
What kinds of projects will I work on during this course?
Throughout the course, you'll engage in various Mock Consulting Projects that simulate real-world scenarios. These projects include a DataFrame project exercise, Linear Regression consulting project, Logistic Regression consulting project, Random Forest classification consulting project, and a clustering consulting project, among others. These practical exercises are designed to help you apply Spark and PySpark tools to solve real data challenges.
Who is the target audience for this course?
This course is ideal for individuals interested in mastering big data technologies, specifically Apache Spark and Python. It's suitable for those looking to enhance their data analysis skills, such as data scientists, analysts, and IT professionals seeking to leverage Spark in their work. Individuals aiming to add Spark and PySpark to their resumes to increase their job marketability would also benefit from this course.
How does this course compare in depth to other big data courses?
This course offers a comprehensive look at Apache Spark, from setup to advanced data processing techniques. It covers the latest Spark 2.0 DataFrame syntax and includes sections on machine learning with MLlib, Spark SQL, and real-time data processing with Spark Streaming. Compared to other courses, it provides a wide-ranging exploration of Spark's capabilities, focusing on practical, hands-on projects to reinforce learning.
What specific tools and platforms are covered in the course?
The course covers several tools and platforms, including VirtualBox for local installation, AWS EC2 for cloud-based setups, Databricks for collaborative data science work, and AWS EMR for managed Hadoop frameworks. It also includes a section on setting up PySpark and using Jupyter Notebook for interactive data analysis.
Is there anything that is not covered in this course?
While the course extensively covers Apache Spark and its associated technologies, it does not delve into other big data frameworks like Hadoop MapReduce beyond a brief comparison with Spark. Additionally, it assumes prior general programming knowledge, so foundational programming concepts are not covered in depth outside of the Python crash course.
How much time should I expect to commit to complete this course?
The course comprises 63 lessons, including a variety of practical exercises and projects. While the exact runtime isn't specified, students should be prepared to dedicate a significant amount of time to both the instructional content and to engage thoroughly with the hands-on projects and exercises, which are crucial for mastering the skills taught in the course.