Recommended textbooks
Prerequisites
- Data Mining Algorithms: supervised/unsupervised learning techniques
- Basic programming - Java or scripting languages: Python/Perl
- Database query language: SQL
- Algorithm and data structures
- Probability, Statistics, and Matrices
Useful Documentation and Data
Class Lectures
1st Week Introduction to Big Data
Lecture Notes: Introduction
Welcome to the BUDT 758B: Big Data Analytics. Big data is newly emerging topics and becomes popular in various domains, including business, computer science, biological science, climate science, and others. The data generated by users grows rapidly, which needs us to find scalable methods and platforms to collect, store, and analyze them in order to provide informed decisions.
This is a graduate-level class. Data mining and some basic math (probability, statistics) knowledge are required. There are many online resources which can be used to help you understand this area. No textbooks are specified. During the first week, we will discuss: (1) Syllabus; (2) The Introduction to big data; (3) Examples that are successfully using big data techniques; (4) Business value of big data and AI.
The required reading for the first week is the course syllabus.
2nd Week Business Value and Internet of Things
Lecture Notes: Business Value | IoT
3rd Week Review of Machine Learning Algorithms
Lecture Notes: ML Review
4,5,6,7th Week Deep Learning (1)
Lecture Notes: DL Foundation | Autoencoder | Word2vec
Deep learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data. We will introduce basics of deep learning in this lecture, including how to build deep neural networks and perform parameter learning, autoencoder, and word2vec.
DL foundation lab (Solution)
Dropout lab
Spark Lab
8,9th Week Deep Learning (2)
Lecture Notes: CNN, RNN
Deep learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data. We will introduce the variants of network architectures, such as CNN and RNN (LSTM).
Image classification (Solution)
Sentiment classification using LSTM (Solution)
10th Week Architecture of Hadoop
Lecture Notes: Hadoop
In this week, we will discuss the framework and the architecture of the Hadoop system. In addition, we will have an in-class lab of configuration and installation for a pseudo-distributed Hadoop system, mainly including how to configure the environment of Hadoop daemons, site-specific configuration of Hadoop daemons, how to start and stop Hadoop system, and how to monitor the Hadoop system. There are many online tutorials and documentation. Please read them in advance if you have time. Please bring your laptop to the class.
11th Week MapReduce Programming
Lecture Notes: MapReduce
In this week, we discuss how to write a MapReduce program and make it work in Hadoop. Three important components: Mapper, Reducer, and Driver classes. To make the MapReduce program more efficient, some operations such as Combiner, Partitioner, sorting are also introduced. To deal with different specific tasks, you can write your own Writable data types, your own inputformat and outputformat.
12th Week Hive, Impala, Pig, and Sqoop
Lecture Notes: Hive Impala, Pig Sqoop
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. This week discusses Hive, including introduction to Hive, Hive architecture, HiveQL, and installation and configuration, and similarly for Pig and Sqoop.
13th Week Spark RDD
Lecture Notes: Spark RDD and DataFrame
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. In this week, we will cover the following materials: Introduction to Spark, working with RDD, building and running Spark applications.
14, 15th Week Spark SQL/ML/Graphx
Lecture Notes: Spark ML
Spark Lab
Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. In this week, we will cover the following materials: Spark SQL, MLlib, and GraphX.