Você está na página 1de 11

CIS787-

FINAL REVIEW
Reza Zafarani (reza@data.syr.edu)
DATA MINING
Given lots of data, the data mining process discovers
patterns and models that are:

1) Valid: hold on new data with some certainty


2) Useful (Actionable): should be possible to act on the item
3) Unexpected: non-obvious to the system
4) Understandable: humans should be able to
interpret the pattern

Patterns are the relationships and summaries derived through


a data mining exercise.

2
THIS CLASS: CIS787
This class overlaps with machine learning, statistics,
artificial intelligence, databases but more stress on
 Fundamental Data Mining Algorithms
 Scalability (big data)
Statistics Machine
 Algorithms
 Hands-on Experience Learning
 Go to www.awseducate.com
Data Mining

Database
systems

3
WHAT WE HAVE COVERED
 Data Types/Data  Linear/Mulitivariate/Logistic/Lasso Regression  Rank of a matrix
Preprocessing  Logistic Regression  SVD/Dimensionality Reduction with
 Sampling/Stratified  Bagging/Bootstrap SVD
Sampling Samples/Boosting/AdaBoost  SVD and Eigen Decomposition
 Curse of  Partitional Clustering/ K-means/Objective  CUR decomposition
Dimensionality Function/Centroids  Power method
 PCA  Bisecting K-means  MapReduce/Mapper/Reducer/
 Discretization  Hierarchical Clustering (Min,Max, group Combiner
 Decision Trees average, centroid, Ward) -> Dendrogram  Samping a fixed proportion
 Hunt’s/ID3/C4.5/Ob  Inversion/Globular Clusters/Chain Effect  Reservoir Sampling
lique Trees  Lance-Williams Formula  Bloom Filter
 Gini/Entropy  Minimum Spanning Tree Divisive Clustering  Flajolet Martin
 Evaluation  SSE/Cohesion/Separation/Silhouette Index
(Accuracy/Recall/F-  AMS method for computing
Measure/AUC)  DBScan/Chameleon moments
 KNN/Naïve Bayes  Frequent Itemsets  Shingling/Minhashing/LSH
 Support Vectors / SVM  Support/Confidence/Lift
 DTW  Apriori
 Maximal/Closed Itemsets
4
WHAT WE HAVEN’T COVERED
 Data Preprocessing  Other clustering  Semi-Supervised Learning  Spatial Data
 Wavelet Transforms methods
 Co-training/Self-training  Trajectory Mining
(Haar)  CLARANS/DENCLUE
 Active Learning  Graph Mining
 * MDS / /NMF/BIRCH/CURE
/CLIQUE/PROCLUS/  Ensembles  Graph Isomorphism
Multidimensional
Scaling ORCLUS  Random Forest  MCGs
 * Embedding  Classification  Clustering data streams  Web Mining
Techniques  Fisher LDA  STREAM/ClusStream  Collaborative Filtering
 ISOMAP!  Rule Induction  Text Mining  SimRank/TrustRank
 MLLE  Bayesian Networks  Co-clustering  PageRank/HITS
 Similarity-based  Kernel Methods  PLSA  Social Network Analysis
methods  Neural Networks  Community Detection
 Rocchio Method
 Similarity between  MLP/Perceptron/  Collective
 Topic Models
graphs Hebbian Learning Classification
 Discrete Sequences
 Other Association Rule  Privacy Preserving
Mining methods  HMMs
Algorithms
 * FP-Tree/FP-Growth  Prob. Suffix Trees
 K-anonymity /
Samarati’s method
5
WHAT CAN I READ TO KNOW
MORE
1. Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and
techniques: concepts and techniques. Elsevier, 2011.
2. Aggarwal, Charu C. Data Mining: The Textbook. Springer, 2015.
3. Zaki, Mohammed J., and Wagner Meira Jr. Data mining and analysis:
fundamental concepts and algorithms. Cambridge University Press, 2014.
4. Quinlan, J. Ross. C4. 5: programs for machine learning. Elsevier, 2014.
5. Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of
statistical learning. Vol. 1. Springer, Berlin: Springer series in statistics, 2001.
6. Bishop, Christopher M. Pattern recognition and machine learning. springer,
2006.
7. Schölkopf, Bernhard, and Alexander J. Smola. Learning with kernels: Support
vector machines, regularization, optimization, and beyond. MIT press, 2002.
8. Koller, Daphne, and Nir Friedman. Probabilistic graphical models: principles
and techniques. MIT press, 2009.
ENGINEERING AND COMPUTER SCIENCE | SYRACUSE UNIVERSITY 6
WE DON’T NEED NO BOOKS!

Stanford - Machine Learning Course


https://www.youtube.com/playlist?list=PLA89D
CFA6ADACE599

Caltech - Machine Learning Course - CS 156

https://www.youtube.com/playlist?list=PLD63A
284B7615313A

ENGINEERING AND COMPUTER SCIENCE | SYRACUSE UNIVERSITY 7


COMPETITIONS
Kaggle.com

KDD Cup
 http://www.kdd.org/kdd-cup
 Participate in KDD Cups!

CIKM Cup

ENGINEERING AND COMPUTER SCIENCE | SYRACUSE UNIVERSITY 8


WHAT’S NEXT: 3RD EXAM

ENGINEERING AND COMPUTER SCIENCE | SYRACUSE UNIVERSITY 9


WHAT’S AFTER THE CLASS
[possibility] Advanced Data Mining (Fall 2018)
 Sparse Learning / More math and stats
 Large-Scale Machine Learning/Sketching/etc.
 Graph Mining
[possibility] Machine Learning (Fall 2018)
[possibility] Spectral Graph Theory (Fall 2018)
[possibility] or the same course, but more refined
[possibility] Social Media Mining (Spring 2019)
 Mine Big Data on Social Media
 Study Networks, Users, Influentials, Content on Social Media
 How does information propagate on Social Networks?
 Measure Influence, Model Human Behavior, Determine Communities

10
FINALLY!

• You have done a lot!!!


I hope you have learned a lot!

• Spent time answering questions, proving results, and preparing for quizzes
• Have implemented a number of methods
• And did great on the 3rd exam!

Thank you for the


Hard Work!!!
ENGINEERING AND COMPUTER SCIENCE | SYRACUSE UNIVERSITY 11