BUDT 758B: Big Data and AI for business (Fall 2020)

    Department of Decison, Operations & Information Technologies
    University of Maryland, College park

    Instructor:  Kunpeng Zhang  (kpzhang@umd.edu)

    Course Syllabus

    Lecture-Discussion: Tuesday, Thursday (Zoom ID: 3206804434)
                        2:00--3:15, Online (section 01)
                        3:30--4:45, Online (section 02)
                        5:00--6:15, Online (section 03)
           Office hour: Monday (1:00 - 3:00pm), Online)

    TA: 
    Session 0501 - Mingwei Sun (ms1991@umd.edu): Thursday 6:00 - 7:00PM: (Zoom ID: 395294617761)

    Session 0502 - Zhonghao Li (zhonghao.li@marylandsmith.umd.edu): Wednesday 3:00 - 4:00PM: (Zoom ID: 36811380544)

    Session 0503 - Adityta Solai (adithyasolai7@gmail.com): Friday 2:00 - 3:00PM: (Zoom ID: 37531538659)

Prerequisites

Data Mining Algorithms: supervised/unsupervised learning techniques
Basic programming - Java or scripting languages: Python/Perl
Database query language: SQL
Algorithm and data structures
Probability, Statistics, and Matrices

Useful Documentation and Data

Apache Hadoop Online Documentation: http://hadoop.apache.org/docs/current/
Apache Hbase: http://hbase.apache.org/
Apache Hive: http://hive.apache.org/
Apache Mahout - Scalable Machine Learning Library: http://mahout.apache.org/
Apache Spark: https://spark.apache.org/
Deep Learning: http://deeplearning4j.org/
Stanford Large Network Dataset Collection: http://snap.stanford.edu/data/
Some Time Series Data Collections: http://www-bcf.usc.edu/~liu32/data.htm
Mahout Testing Data: https://cwiki.apache.org/confluence/display/MAHOUT/Collections
Collecting Online Data (Facebook Graph API and Twitter Stream API)
Facebook Graph API
Twitter Stream API
Quandl: financial, economic and social datasets
http://kevinchai.net/datasets
Kaggle: http://www.kaggle.com/competitions

Class Lectures

1^st Week Introduction to Big Data
Lecture Notes: Introduction

Welcome to the BUDT 758B: Big Data Analytics. Big data is newly emerging topics and becomes popular in various domains, including business, computer science, biological science, climate science, and others. The data generated by users grows rapidly, which needs us to find scalable methods and platforms to collect, store, and analyze them in order to provide informed decisions.

This is a graduate-level class. Data mining and some basic math (probability, statistics) knowledge are required. There are many online resources which can be used to help you understand this area. No textbooks are specified. During the first week, we will discuss: (1) Syllabus; (2) The Introduction to big data; (3) Examples that are successfully using big data techniques; (4) Business value of big data and AI.

The required reading for the first week is the course syllabus.

2^nd Week Business Value and Internet of Things
Lecture Notes: Business Value | IoT

3^rd Week Review of Machine Learning Algorithms
Lecture Notes: ML Review

4,5,6,7^th Week Deep Learning (1)
Lecture Notes: DL Foundation | Autoencoder | Word2vec

Deep learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data. We will introduce basics of deep learning in this lecture, including how to build deep neural networks and perform parameter learning, autoencoder, and word2vec.

DL foundation lab (Solution)
Dropout lab
Spark Lab

8,9^th Week Deep Learning (2)
Lecture Notes: CNN, RNN

Deep learning (also known as deep structured learning, hierarchical learning or deep machine learning) is a branch of machine learning based on a set of algorithms that attempt to model high level abstractions in data. We will introduce the variants of network architectures, such as CNN and RNN (LSTM).
Image classification (Solution)
Sentiment classification using LSTM (Solution)

10^th Week Architecture of Hadoop
Lecture Notes: Hadoop

In this week, we will discuss the framework and the architecture of the Hadoop system. In addition, we will have an in-class lab of configuration and installation for a pseudo-distributed Hadoop system, mainly including how to configure the environment of Hadoop daemons, site-specific configuration of Hadoop daemons, how to start and stop Hadoop system, and how to monitor the Hadoop system. There are many online tutorials and documentation. Please read them in advance if you have time. Please bring your laptop to the class.

11^th Week MapReduce Programming
Lecture Notes: MapReduce

In this week, we discuss how to write a MapReduce program and make it work in Hadoop. Three important components: Mapper, Reducer, and Driver classes. To make the MapReduce program more efficient, some operations such as Combiner, Partitioner, sorting are also introduced. To deal with different specific tasks, you can write your own Writable data types, your own inputformat and outputformat.

12^th Week Hive, Impala, Pig, and Sqoop
Lecture Notes: Hive Impala, Pig Sqoop

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL. This week discusses Hive, including introduction to Hive, Hive architecture, HiveQL, and installation and configuration, and similarly for Pig and Sqoop.

13^th Week Spark RDD
Lecture Notes: Spark RDD and DataFrame

Apache Spark is a fast and general engine for big data processing, with built-in modules for streaming, SQL, machine learning and graph processing. In this week, we will cover the following materials: Introduction to Spark, working with RDD, building and running Spark applications.

14, 15^th Week Spark SQL/ML/Graphx
Lecture Notes: Spark ML
Spark Lab