|
|
Advanced Statistics and
Data Mining Summerschool
|
|
Madrid. July 6th to 17th 2009
|
|
|
|
|
This summerschool is organized by the Artificial Intelligence
Department of the Computer Science Faculty of the Univ. Politécnica de Madrid. This
summerschool is the continuation of the one organized by Univ. San
Pablo - CEU for three years. Therefore, this would be its 4th edition.
It is an intensive course aiming at providing attendees with an
introduction to the theoretical foundations as well as the practical
applications of some of the modern statistical analysis techniques
currently in use. The summerschool takes 2 weeks and is divided into 18
courses. Each subject has 8 theoretical classes and 7 practical classes
in which each technique is put into practice with a computer program.
Students may register only in those courses of their interest.
Academic Interest: this course complements
the background of many students from a variety of disciplines with the
theoretical and practical fundamentals of those modern techniques
employed in the analysis and modelling of large data sets. The academic
interest of this course is high since there are no specific university
studies on this kind of techniques.
Scientific interest: any
scientist in most fields (engineering, life sciences, economics, etc.)
is confronted to the problem of extracting conclusions from a set of
experimental data. This course supplies experimentalists with the
sufficient resources to be able to select the appropriate analysis
technique and how to apply it to their specific problem.
Professional interest: the
application of modern data analysis in the industry is well spread
since it is practically needed in nearly all disciplines. As for job
offers, it is a quite demanded topic: a search in Monster.com as for
March 2008 retrieves more than 5000 offers for “data
analysis”, more than 1784 offers for “data mining”,
and 438 offers for “statistical consultant”.
|
|
|
|
The goals of this summer school are to complement the technical
background of attendees in the field of data analysis and modelling.
This course is open to any student or professional wanting to enlarge
his knowledge of a topic that is more and more involved in nearly all
productive areas (Computer Science, Engineering, Pharmacy, Medicine,
Economics, Statistics, etc.)
A second objective of the summerschool is that the student is
acquainted with a set of computational tools in which to try the
techniques studied during the course on practical problems that they
may bring on their own or that the summerschool professors may propose.
Note that the Summerschool is on advanced techniques and the
courses will provide with the insight of modern techniques that, nearly
by definition, are not mathematically trivial. Although, the emphasis
is placed on their use and not in the mathematics behind, attendees
should not be afraid or surprised of seeing some mathematics. Teachers
will make the course content accessible to students with all
backgrounds. For making the course easier, the student is supposed to
be familiar with certain concepts that are described as
"prerequisites" and he is encouraged to read the "before
attending documents" to benefit as much as possible from the
course.
|
|
|
|
All classes will be given in
English. Courses 1, 2, and 3; 4, 5 and 6; ... are given simultaneously,
therefore a student cannot register in two simultaneous courses.
Week 1 (July 6th - July 10th,
2009)
9'00-12'00
Course 1: Bayesian networks (15
h), Practical sessions: Hugin, Elvira, Weka, LibB (see sample)
Prof. Mª
Concepción Bielza, Pedro Larrañaga (Univ. Politénica
de Madrid)
Theory: Block 6, Room 6105; Practice: Block 4, Room Los Verdes
Course 2: Multivariate data
analysis (15 h), Practical sessions: R (see sample)
Prof. Carlos Óscar S. Sorzano (Univ. San Pablo CEU, CSIC)
Theory: Block 6, Room 6101; Practice: Block 4, Room Monje
Course 3: Dimensionality
reduction (15 h), Practical sessions: MATLAB (see sample)
Prof. Alberto Pascual Montano (CSIC)
13'00-16'00
Course 4: Supervised pattern
recognition (Classification) (15 h), Practical sessions: Weka (see sample)
Prof. Pedro Larrañaga (Univ. País Vasco)
Theory: Block 6, Room 6105; Practice: Block 4, Room Los Verdes
Course 5: Introduction to MATLAB
(15h), Practical sessions: MATLAB
Prof. Rubén
Armañanzas (Univ. Politécnica Madrid)
Course 6: Data Mining: a
practical perspective (15 h), Practical sessions: Weka, R, MATLAB
Prof. Alberto Pascual Montano (CSIC), Miguel Vázquez, Mariana Lara,
Pedro Carmona (Univ. Complutense)
Theory: Block 6, Room 6101; Practice: Block 4, Room Monje
16'30-19'30
Course 7: Time series analysis
(15 h), Practical sessions: R (see sample)
Prof. Carlos Óscar S. Sorzano (Univ. San Pablo CEU)
Theory: Block 6, Room 6105; Practice: Block 4, Room Los Verdes
Course 8: Neural networks (15
h), Practical sessions: MATLAB (see sample)
Prof. Santiago
Falcón (Unión Fenosa)
Course 9: Introduction to SPSS
(15h), Practical sessions: SPSS
Prof. Concepción
Bielza, Antonio Jiménez (Univ. Politécnica
Madrid)
Week 2 (July 13th - July 17th,
2009)
9'00-12'00
Course 10: Regression (15 h),
Practical sessions: SPSS (see sample)
Prof. Carlos
Rivero Rodríguez (Univ. Complutense de Madrid)
Course 11: Practical Statistical
Questions (15 h), Practical sessions: study of cases (without computer)
(see sample)
Prof. Carlos Óscar S. Sorzano (Univ. San Pablo CEU,CSIC)
Theory: Block 6, Room 6105
Course 12: Missing data and
outliers, Practical sessions: R
Prof. Román
Mínguez (Univ. Castilla-La Mancha)
13'00-16'00
Course 13: Hidden Markov Models
(15 h), Practical sessions:HTK (see sample)
Prof. Agustín
Álvarez (Univ. Politécnica de Madrid)
Theory: Block 6, Room 6105; Practice: Block 4,
Room Los Verdes
Course 14: Statistical inference
(15 h), Practical sessions: SPSS (see sample)
Prof. Román
Mínguez (Univ. Castilla-La Mancha)
Theory: Block 6, Room 6101; Practice: Block 4,
Room Monje
Course 15: Feature subset
selection (15 h), Practical sessions: Weka, R, MATLAB
Prof. Rubén
Armañanzas, Víctor Robles, Pedro Larrañaga, (Univ. Politécnica de Madrid)
16'30-19'30
Course 16: Introduction to R
(15h), Practical sessions: R
Prof. Pedro Carmona (Univ. Complutense Madrid)
Theory: Block 6, Room 6101; Practice: Block 4, Room Monje
Course 17: Unsupervised pattern
recognition (clustering) (15 h), Practical sessions: MATLAB (see sample)
Prof. Carlos Oscar
S. Sorzano (Univ. San Pablo CEU, CSIC)
Theory: Block 6, Room 6105; Practice: Block 4, Room Los Verdes
Course 18: Evolutionary
computation (15 h), Practical sessions: MATLAB
Prof. Daniel Manrique, Roberto Santana (Univ. Politécnica de Madrid)
Click here for a more detailed program
|
|
|
|
- Week
1: July 6th - July 10th, 2008
- Week
2: July 13th - July 17th, 2008
The timetable of each week is as follows
Timetable
|
Monday-Friday Week 1
|
Monday-Friday Week 2
|
9'00-12'00
|
Courses
1,2 and 3
|
Courses
10,11 and 12
|
13'00-16'00
|
Courses
4,5 and 6
|
Courses
13,14 and 15
|
16'30-19'30
|
Courses
7,8 and 9
|
Courses
16,17 and 18
|
|
|
|
|
The
summerschool takes place at the Computer Science Faculty of
the Univ. Politécnica de Madrid located in the Urb. Montepríncipe,
Boadilla del Monte (Metro stations: Montepríncipe (Metro ligero línea 3, first stop: Colonia Jardín (line 10));
Buses: 571 and 573) in
Madrid. Click here for
a detailed map of the area. Note that we don't provide lodgement for
those students from outside Madrid.
Students are recommended to stay in Madrid
and commute daily to the Faculty (about 40 minutes from Madrid depending
on location). In the following links you will find useful lodging
information in Spanish and English.
In Spanish it might be
useful to ask for the "Facultad de Informática de la
Universidad Politécnica de Madrid". This link also contains useful information for reaching the building.
|
|
|
Diplomas
and Academic recognition
|
|
|
All students will obtain an
Assistance Diplomma and 1 ECTS (European Credit) will be acknowledged.
|
|
|
|
The price of each course in the
summerschool is 150 euros. This fee includes attendance to lectures and
educational materials.
To apply for the summerschool, the students
must send an email to the organizers (pedro.larranaga@fi.upm.es, coss.eps@ceu.es)
asking for course availability. After confirmation from the organizer,
they should make the payment and send another e-mail stating name,
e-mail, telephone, institution, nationality (note that depending on
your nationality you may need a visa for entering Spain),
courses paid, and a copy (PDF) of the payment. Courses have a maximum
attendance of 40 people, and they will be selected in strict order of
payment date. Courses with less than 6 people will be cancelled.
The payment shall be made by bank transfer to
the following address:
To: Títulos propios de la UPM
Subject: Course identifier+Your name
Address: Madrid
Bank name: Banco Bilbao Vizcaya Argentaria
(BBVA)
Bank Adress: c/Alcalá, 16
Account Number: 0182 2370 48 0201516245
SWIFT: BBVAESMMXXX
IBAN: ES0501822370480201516245
For official invoices are fiscal data is:
Universidad
Politécnica de Madrid
CIF: Q2818015F
|
|
|
|
Coordinators of the course:
Pedro
Larrañaga
Facultad de Informática
Univ. Politécnica de Madrid
Campus Urb. Montepríncipe s/n
28660 Boadilla del Monte, Madrid
Spain
email: pedro.larranaga@fi.upm.es
Tel.: +34 91 336 7443
Fax: +34 91 352 4819
|
Carlos
Oscar Sánchez Sorzano
Escuela Politécnica Superior
Univ. San Pablo - CEU
Campus Urb. Montepríncipe
s/n
28668 Boadilla del Monte, Madrid
Spain
email:
coss.eps@ceu.es
Tel.: +34 91 372 4034
Fax: +34 91 372 4049
|
|
|
|
|
|
Collaborating institutions
|
|
|
Red
temática de Investigación Cooperativa de Centros de Cáncer (RTICCC)
|
|
|
Detailed program
|
Course 1
|
Bayesian
networks
|
Program
|
1.
Bayesian networks basics.
1.1. Reasoning with uncertainty
1.2. Probabilistic conditional independence
1.3 Correspondence between graph and model: D-separation
1.4 Probabilistic graphical models
1.5 Bayesian networks: properties
1.6 Building bayesian networks (examples)
2.
Inference in Bayesian networks.
2.1 Queries in Bayesian networks: deductive, diagnostic and
intercausal reasoning
2.2 Exact inference:
- Brute force approach
- Variable elimination
- Message passing
2.3 Searching for explanations: abduction
2.4 Approximate inference
3
Learning Bayesian networks from data
3.1 Introduction
3.2 Factorization of the joint probability distribution
- Methods based on testing conditional
independence
- Methods based on score + search
- Hybrid methods
- Applications
3.3 Bayesian classifiers
- Introduction
- Naive Bayes
- Seminaive Bayes
- Tree augmented network
- K- dependence network
- Markov Blanket
- Applications
3.4 Clustering
- Introduction
- The EM algorithm
- The structural EM algorithm
- Applications
Practical demonstration: Hugin, Elvira,
Weka, BayesBuilder
|
Bibliography
|
Learning Bayesian Networks by Richard E.
Neapolitan. Prentice Hall; 1st edition (2003)
Expert Systems and Probabilistic Network Models by E. Castillo, J.
Gutierrez, and A. Hadi. Springer-Verlag (1997)
Bayesian Networks and Decision Graphs (second edition) Finn V.
Jensen and Thomas D. Nielsen. Springer Verlag (2007)
|
Prerequisites
|
The attendant is supposed to be familiar with basic notions of
probability and graphs.
|
Readings before coming
|
The attendant will benefit more from the course if he reads before
coming:
|
Course 2
|
Multivariate
data analysis
|
Program
|
1.
Introduction
1.1. Types of variables
1.2. Types of analysis and technique selection
1.3. Descriptors (mean, covariance matrix)
1.4. Variability and distance
1.5. Linear dependence
2. Data Examination
2.1. Graphical examination of the data
2.2. Missing Data
2.3. Outliers
2.4. Assumptions of multivariate analysis
3. Principal
component analysis (PCA)
3.1. Introductionn
3.2. Component computation
3.3. Example
3.4. Properties
3.5. Extensions
4. Factor Analysis
4.1. Introduction
4.2. Factor computation
4.3. Example
4.4. Extensions
5. Multidimensional Scaling (MDS)
5.1. Introduction
5.2. Metric scaling
5.3. Example
5.4. Non metric scaling
5.5. Extensions
6. Correspondence analysis
6.1. Introduction
6.2. Projection search
6.3. Example
6.4. Extensions
7. Multivariate Analysis of Variance (MANOVA)
7.1. Introductionn
7.2. Computations (1-way)
7.3. Computations (2-way)
7.4. Post-hoc tests
8. Canonical correlation
8.1. Introduction
8.2. Construction of the canonical variables
8.3. Example
8.4. Extensions
Practical
demonstration: R
|
Bibliography
|
Multivariate Data Analysis (6th
Edition) by Joseph F. Hair, Bill Black, Barry Babin, Rolph E. Anderson,
Ronald L. Tatham. Prentice Hall; 6 edition (2005)
Análisis de datos multivariantes by Daniel Sánchez
Peña. McGraw-Hill (2002)
|
Prerrequisites
|
The attendant is supposed to be familiar with univariate
statistics, the univariate Gaussian, ANOVA, and Correlation.
|
Readings before coming
|
The attendant will benefit more from the course if he reads before
coming:
|
Course 3
|
Dimensionality
reduction
|
Program
|
1.
Introduction:
1.1. Why dimensionality reduction
1.2. Curse of dimensionality
1.3. Feature selection vs. feature extraction
1.4. Linear vs. no linear
1.5. Accuracy vs. Interpretation
2. Matrix factorization methods
2.1. Principal Component Analysis
2.2. Singular Value Decomposition
2.3. Factor analysis
2.4. Non-negative matrix factorization
2.5. Non-negative tensor factorization
2.6. Independent Component Analysis and Blind Source
Separation techniques
3. Clustering methods
3.1. Motivation and theoretical basis
3.2. K-means
3.3. Fuzzy c-means
3.4. Hierarchical clustering
4. Projection methods
4.1. Random mapping
4.2. Sammon mapping
4.3. Self-organizing maps
4.4. Isomap
4.5. Locally linear embedding (LLE)
5. Applications
5.1. Pattern recognition
5.2. Image classification
5.3. Gene expression analysis
5.4. Text mining
6. Practical excercises
6.1. Image classification
6.2. Gene expression analysis
6.3. Scientific text analysis
Practical
Demonstration: MATLAB and Web applications
|
Bibliography
|
Mitchell,
T., (1997) Machine Learning, Mc Graw Hill.
Russell, S., Norvig, P., (2003) Artificial Intelligence: A Modern
Approach, 2nd Ed. Prentice
Hall.
|
Prerequisites
|
The attendant is supposed to be familiar with basic statistical
concepts
|
Reading before coming
|
The attendant will benefit more from this course if before coming
he reads:
|
Course
4
|
Supervised pattern
recognition (Classification)
|
Program
|
1.
Introduction
1.1. Supervised classification
1.2. Semisupervised classification
1.3. Partially supervised classification
1.4. Unsupervised classification
2. Assessing the Performance of Supervised Classification Algorithms
2.1. Error generalization
2.2. Area under the ROC curve
2.3. Brier score
2.4. Holdout method
2.5. k-fold cross validation
2.6. Bootstrapping
3. Classification techniques
3.1. Discriminant analysis
3.2. Classification trees
3.3. Nearest neighbour classifier
3.4. Logistic regression
3.5. Bayesian network classifiers
3.6. Neural network classifiers
3.7. Support Vector Machines (SVM)
4. Combining Classifierss
4.1. Hybridizing classifiers
4.2. Basic methods: - Fusion of
label outputs
- Fusion of continuous-valued outputs
- Sketched generalization
- Cascading
4.3. Advanced methods:
- Bagging
- Randomization
- Boosting
- Hybrid classifiers
5. Comparing Supervised Classification Algorithms
5.1. Two classifiers in the same database
5.2. More than two classifiers in the same database
5.3. Two classifiers in multiple databases
5.4. More than two classifiers in multiple databases
Practical
demonstration: WEKA, SPSS
|
Bibliography
|
Statistical Pattern Recognition by Andrew R. Webb.
John Wiley & Sons; 2nd edition (2002)
Kuncheva,
L. (2004) Combining Pattern Classifiers, Wiley.
|
Readings before coming
|
Recommended listening: “An Introduction to
Pattern Classification” by E. Yom Tov, IBM Haifa Research Lab at http://videolectures.net/mlss03_tov_ipc
|
Course
5
|
Introduction to MATLAB
|
Program
|
1.
Overview of the Matlab suite
1.1. History
1.2. Elements and use of the Graphical Use Interface
1.3. Editors
1.4. The help system
1.5. Computing with Matlab
1.6. Matlab peculiarities
2. Data structures and files
2.1. Arrays
2.2. Matrices
2.3. Cell arrays
2.4. Structure arrays
2.6. Operations using arrays/matrices
2.7. Importing / Exporting data
3. Programming in Matlab
3.1. Types of files
3.2. Scripts and .m code
3.3. Functions and design conventions
3.4. Operators
3.6. Imperative statements
3.7. Complex structures
3.8. Toolboxes
4. Visualization tools
4.1. Plots and subplots
4.2. Property editor
4.3. Command line customization
4.4. Advanced plots
4.5. Exporting figures
5. Some applications in pattern recognition
5.1. Clustering
5.2. Feature selection
5.3. Supervised classification
Practical
demonstration: MATLAB
|
Bibliography
|
Introduction
to Matlab 7 for Engineers. William Palm III. McGraw-Hill
Science/Engineering/Math, 2004. ISBN 978-0072548181.
|
Prerequisites
|
The attendee is supposed to be familiar with
imperative programming and basic pattern recognition concepts.
Familiarity with mathematical softwares and spreadsheets is also desired
|
Readings before coming
|
The attendee will start the course with a precise idea
of the Matlab suite by watching the following three videos:
http://www.mathworks.com/demos/matlab/getting-started-with-matlab-video-tutorial.html
http://www.mathworks.com/demos/matlab/working-in-the-development-environment-matlab-video-tutorial.html
http://www.mathworks.com/demos/matlab/writing-a-matlab-program-matlab-video-tutorial.html
|
Course
6
|
Data mining: a practical
perspective
|
Program
|
1.
Introduction to Data Mining and Knowledge Discovery
1.1. What is datamining and knowledge discovery
1.2. Preparing data for mining
1.3. Data warehouses
1.4. What kind of patterns can be mined?
1.5. Data cleaning and transformation
1.6. Online Analytical Processing (OLAP)
1.7. Visual data mining
1.8. Practical examples
2. Prediction in data mining
2.1. Regression of one dependent variable
2.2. Predictor variable selection and regularization
2.3. Non-linear fits
2.4. Model assessment
2.5. Real data examples
3. Classification
3.1. Two class classification
3.2. Evaluation of classification models
3.3. Interpretability. Support Vector Machines vs Random
Forest
3.4. Multiclass classification approaches
3.5. Unsupervised classification and Clustering
3.6. Real data examples
4. Association studies
4.1. Introduction
4.2. Frequent Itemset Mining
4.3. Association Rules
4.4. Application areas and Hands-on
5. Data mining in free-form texts: text mining
5.1. Why is text mining needed?
5.2. What is natural language processing?
5.3. Words, syntax and semantics
5.4. Information retrieval
5.5. Information extraction
5.6. Text categorization
5.7. Corpora
5.8. Applications
Practical demonstration: Biomedical text mining
online tools, open source data mining applications, R, Weka, MATLAB
|
Bibliography
|
• Data Mining Techniques: For Marketing, Sales,
and Customer Relationship Management. Michael J. A. Berry, Gordon S.
Linoff. Wiley. Second Edition.
• Data Mining: Concepts and Techniques (The Morgan Kaufmann Series
in Data Management Systems). Jiawei Han, Micheline Kamber. Academic
Press.
• Data Mining: Practical machine learning tools and techniques with
java implementations. IH Witten, E Frank Morgan Kaufmann, San Francisco,
2005
• The Elements of Statistical Learning. Hastie, Tibshirani and
Friedman - Springer-Verlag, 2008
• Bart Goethals: Survey on Frequent Pattern Mining. http://www.adrem.ua.ac.be/~goethals/software/survey.pdf
• R. Agrawal, T. Imielinski, and A. Swami. Mining association rules
between sets of items in large databases. In Proceedings of the ACM
SIGMOD International Conference on Management of Data, pages 207-216,
Washington D.C., May 1993.
• Natural language processing for online applications: text
retrieval, extraction and categorization. Peter Jackson and Isabelle Moulinier.
John Benjamins Publishing Company. 2nd revised edition. 2002.
• Text mining for Biology and Biomedicine. Sophia Ananiadou and
John McNaught. Artech House Publisher. 2005.
• A Survey of Current Work in Biomedical Text Mining. Aaron M.
Cohen and William R. Hersh. Briefings in Bioinformatics. 6(1):57-71.
2005.
|
Course
7
|
Time series analysis
|
Program
|
1.
Introduction:
1.1. Areas of applications
1.2. Objectives of time series analysis
1.3. Components of time series
1.4. Descriptive analysis
1.5. Distributional properties: independence,
autocorrelation, stationarity.
1.6. Detection and removal of outliers
2.
Trend and seasonal component analysis
2.1. Linear and non-linear regression
2.2. Polynomial fitting
2.3. Cubic spline fitting
2.4. Fourier representation of a sequence
2.5. Spectral representation of stationary processes
2.6. Detrending and filtering
3.
Probability models to time series:
3.1. Random walk
3.2. Autoregressive model (AR)
3.3. Moving Average model (MA)
3.4. Mixed models (ARMA, ARIMA, FARIMA, SARIMA, Box-Jenkins,
ARMAX)
3.5. System identification and model families
3.6. Generalized Autoregressive Conditional
Heteroscedasticity (GARCH)
3.7. Parameter estimation
3.8. Order selection
3.9. Model checking
4. Forecasting and Data mining
4.1. Optimal forecasts
4.2. Forecasts for ARMA models
4.3. Analysis of the prediction error
4.4. State-space modelling
4.5. Mining of seasonal trends
4.6. Frequently occurring patterns
4.7. Connections between different time-series
Practical
demonstration: R
|
Bibliography
|
The Analysis of Time Series: An Introduction, by Chris
Chatfield. Chapman & Hall/CRC; 6th edition (2003)
Time series analysis, by James D. Hamilton. Princeton
University Press (1994)
Handbook of Time Series Analysis, Signal Processing, and Dynamics
by D. S.G. Pollock, Richard C. Green, Truong Nguyen. Academic Press
(1999)
|
Prerrequisites
|
The attendant is supposed to be familiar with the
concept of correlation, regression, and inference.
|
Readings before comming
|
The attendant will benefit more from the course if he
reads before coming:
|
Course
8
|
Neural networks
|
Program
|
1.
Introduction to the biological models. Nomenclature.
1.1. The biological model
1.2. Artificial neural networks
1.3. Notation
1.4. The neural model
1.5. Architecture of neural networks
1.6. Learning mechanisms
1.7. First examples and geometrical representation.
2.
Perceptron networks
2.1. Perceptron architecture
2.2. Decision contour
2.3. Learning rules
2.4. Classification examples
3.
The Hebb rule.
3.1. Linear associator and the Hebb rule
3.2. Pseudoinverse rule
4.
Foundations of multivariate optimization. Numerical optimization
4.1 Mathematical foundations
4.2. Function optimization: minimization
4.3. Gradient algorithms
4.3.1. Gradient method
4.3.2. Conjugate gradient method
5.
Rule of Widrow-Hoff
5.1. Mathematical foundations
5.2. Widrow-Hoff algorithm
5.3. Examples
6. Backpropagation algorithm
6.1. Backpropagation
6.2. Backpropagation algorithm
6.3. Examples
7.
Practical data modelling with neural networks
Practical
demonstration: MATLAB Neural network toolbox
|
Bibliography
|
Neural Networks: A Comprehensive Foundation by Simon
Haykin. Prentice Hall; 2nd edition (1998)
|
Course
9
|
Introduction to SPSS
|
Program
|
1.
Introduction
1.1. Menu structure
1.2. Getting help
1.3. Basic operations with .sav data: read, save, Data
Editor
1.4. Running an analysis
1.5. Using the Viewer
1.6. Data transformations: functions, random number
generators, recoding
1.7. File handling: sort cases and variables, merging and
split files, select cases, restructuring data
2. Describing data
2.1. Frequencies: statistics, charts
2.2. Descriptive: options
2.3. Explore: statistics, plots
2.4. Interactive plots
2.5. Contingency tables
3. Statistical inference
3.1. Confidence intervals with Explore and error bars
3.2. Compare means: independent samples and paired samples T
test
3.3. Test whether the mean differs from a constant: One
sample T test
3.4. Nonparametric tests: chi-square, binomial, runs,
Kolmogorov-Smirnov, two-independent-samples, two-related-samples, several
independent samples, several related samples
4. Time series
4.1. Time series creation and transformations
5. Sampling
5.1. Complex Samples option
6. Classification and regression
6.1. Discriminant analysis, logistic regression and decision
trees
6.2. Variable selection methods in regression
6.3. Plots, statistics, options and Curve Estimation
Practical
demonstration: MATLAB Neural network toolbox
|
Bibliography
|
J. Pallant (2007) “SPSS Survival Manual: A Step
by Step Guide to Data Analysis Using SPSS for Windows (Version
15)”. Open University Press.
SPSS 16.0 Student Version for Windows (Audio CD). Inc. SPSS.
Tutorials and manuals provided by SPSS.
|
Prerequisites
|
Assumed familiarity with the basics of probability and
inference
|
Reading before coming
|
The student will benefit if he takes a look at the
following videos:
http://www.as.ysu.edu/~chang/SPSS/SPSSmain.htm
http://www.stat.tamu.edu/spss.php
|
Course
10
|
Regression
|
Program
|
1.
Introduction
1.1. A brief historical framework
1.2. Some applications and examples of Regression Analysis
1.3. Specification of a Regression Model
1.4. Organization of the Regression Analysis
2. Simple Linear Regression Model
2.1. Introduction and examples
2.2. Data graphs
2.3. Specification of a Linear Regression Model
2.4. Parameter estimation
2.5. Inference on the model parameters
2.5.1. Hypothesis testing
2.5.2. Confidence intervals
2.6. Prediction of new observations
2.7. Coefficient of Determination
2.8. Correlation
2.9. Examples
3. Measures of model adequacy
3.1. Residual analysis
3.2. Outliers
3.3. Transformations
3.3.1. To a straight line
3.3.2. To stabilize variance
3.4. Methods to select a transformation
4. Multiple Linear Regression
4.1. Introduction and examples
4.2. Interpretation of regression coefficients
4.3. Parameter estimation
4.3.1. Ordinary Least Squares estimation
4.3.2. Geometrical interpretation
4.4. Inadequacy of some data graphs in Multiple Linear
Regression
4.5. Inference on the model parameters
4.5.1. Hypothesis testing: significance of
regression; general linear hypothesis
4.5.2. Confidence intervals
4.5.3. Simultaneous inference
4.6. Prediction of new observations
4.7. Standardized regression coefficients
4.8. Multiple correlation coefficient
5. Regression Diagnostics and model violations
5.1. Some types of residuals
5.2. Residuals plot
5.3. Linearity assumption
5.4. Normality assumption
5.5. Influence diagnostics
5.6. Multicollinearity
5.7. Additional predictors
5.8. Heteroskedasticity and autocorrelation
6. Polynomial regression
6.1. Introduction
6.2. Polynomial model in one variable
6.3. Polynomial model in more than one variable
7. Variable selection
7.1. Consequences of model misspecification
7.2. Evaluation of subset regression models
7.3. All possible regressions
7.4. Stepwise regression
8. Indicator variables as regressors
8.1. Use of indicator variables
8.2. Qualitative data
8.3. Interaction
8.4. Indicator response variables
9. Logistic regression
9.1. Introduction and examples
9.2. Parameter estimation: maximum likelihood and nonlinear
least squares
9.3. Measures of model adequacy
9.4. Inference
9.5. Regressor selection
10. Nonlinear Regression
10.1. Model specification
10.2. Iterative estimation: nonlinear least squares
10.3. Linear approximation and normal approximation
10.4. Inference
Practical
demonstration: SPSS
|
Bibliography
|
Ryan, T. P. (1997) Modern Regression Methods.
New York: Wiley
Draper, N. R. and Smith, H. (1998) Applied Regression Analysis. Third
edition. New York: Wiley
Greene, W. H. (2007) Econometric Analysis. Prentice Hall
Seber, G. A. F. and Lee, A. J. (2003) Linear Regression Analysis. Second
edition. New Jersey: Wiley
Montgomery, D. C. and Peck, E. A. (1992) Introduction to Linear
Regression Analysis. Second edition. New York: Wiley
Chatterjee, S., Hadi, A. S. and Price, B. (2000) Regression Analysis by
Example. Third edition. New York: Wiley
Goldberger, A. S. (1998) Introductory Econometrics. Harvard University
Press
|
Prerequisites
|
The attendant is
supposed to be familiar with basic concepts on Probability and
Statistical Inference: probability, random variables, discrete and
continuous distributions, normal distribution, random sample, maximum
likelihood estimator, confidence interval, hypothesis testing…
|
Readings before coming
|
The attendat will
benefit more from the course if he reads before coming:
|
Course
11
|
Practical statistical questions
|
Program
|
1. I would like to know the
intuitive definition and use of …: The basics
1.1. Descriptive vs inferential statistics
1.2. Statistic vs parametric. What is a sampling
distribution?
1.3. Types of variables
1.4. Parametric vs non-parametric statistics
1.5. What to measure? Central tendency, differences,
variability, skewness and kurtosis, association
1.6. Use and abuse of the normal distribution
1.7. Is my data really independent?
2. How do I collect the data? Experimental design
2.1. Methodology
2.2. Design types
2.3. Basics of experimental design
2.4. Some designs: Randomized Complete Blocks, Balanced
Incomplete Blocks, Latin squares, Graeco-latin squares, Full 2k
factorial, Fractional 2k-p factorial
2.5. What is a covariate?
3. Now I have data, how do I extract information? Parameter estimation
3.1. How to estimate a parameter of a distribution?
3.2. How to report on a parameter of a distribution? What are
confidence intervals?
3.3. What if my data is “contaminated"? Robust
statistics
4. Can I see any interesting association between two variables, two
populations, …?
4.1. What are the different measures available?
4.2. Use and abuse of the correlation coefficient
4.3. How can I use models and regression to improve my
measure?
5. How can I know if what I see is “true”? Hypothesis testing
5.1. The basics
5.1.a. What is a hypothesis test?
5.1.b. What is the statistical power?
5.1.c. What is a p-value? How to use it?
5.1.d. What is the effect size?
5.1.e. What is the relationship between sample
size, sampling error, effect size and power?
5.1.f. What are the assumptions of hypothesis
testing?
5.2. How to select the appropriate statistical test
5.2.a. Tests about a population central tendency
5.2.b. Tests about a population variability
5.2.c. Tests about a population parameter
5.2.d. Tests about differences between
populations
5.2.e. Tests about the ordering of data
5.2.f. Tests about distributions
5.2.g. Tests about correlation/association
measures
5.3. Multiple testing
5.4. Permutation tests
5.5. Words of caution
6. How many samples do I need for my test?: Sample size
6.1. Basic formulas for different distributions
6.2. Formulas for samples with different costs
6.3. What if I cannot get more samples? Resampling:
Bootstrapping, jackknife
7. Can I deduce a model for my data?
7.1. What kind of models are available?
7.2. How to select the appropriate model?
7.2. Analysis of Variance as a model
7.2.a. What is ANOVA?
7.2.b. What is ANCOVA?
7.2.c. How do I use them with the pretest and the
posttest designs?
7.2.d. What are planned and post-hoc comparisons?
7.2.e. What are fixed effects and random effects?
7.2.f. When should I use Multivariate ANOVA
(MANOVA)?
7.3. Regression as model
7.3.a. What are the assumptions of regression?
7.3.b. Are there other kinds of regression?
7.3.c. How reliable are the coefficients?
Confidence intervals
7.3.d. How reliable are the coefficients?
Validation
7.3.e. When should I use nonlinear regression?
Practical
sessions: study of cases of different fields (economics, biology,
engineering, computer science, ...)
|
Bibliography
|
David J. Sheskin. Handbook of Parametric and
Nonparametric Statistical Procedures. Chapman & Hall/CRC (2007)
R. R. Newton, K. E. Rudestam. Your Statistical Consultant: Answers to
Your Data Analysis Questions. Sage Publications, Inc (1999)
G. van Belle. Statistical Rules of Thumb. Wiley-Interscience
(2002)
P. I. Good, J. W. Hardin. Common Errors in Statistics (and How to
Avoid Them). Wiley-Interscience (2006)
|
Prerequisites
|
The attendant is supposed to be familiar with the
basics of probability, experimental design, hypothesis testing,
regression and ANOVA.
|
Readings before coming
|
The attendant will benefit more from the course if he
reads before attending:
|
Course
12
|
Missing data and outliers
|
Program
|
1.
Missing Data
1.1. Introduction
1.2. Typology of missing data
1.2.1. Missing completely at random (MCAR)
1.2.2. Missing at random (MAR)
1.2.3. Non-ignorable missingness
1.2.3.1 Depends on unobserved
predictors
1.2.3.2. Depends on the missing
value itself
1.3. Simple missing-data methods
1.3.1. Missing-data methods that discard data
1.3.2. Missing-data methods that retain all the
data
1.4. Imputation Methods
1.4.1. Single Imputation
1.4.1.1. Regression imputation
1.4.1.2. Hot-deck imputation
1.4.2. Multiple Imputation
1.4.2.1. Iterative regression
imputation
1.4.2.2. Likelihood-based
approach. EM Algorithm
1.4.2.3. Markov Chain Monte Carlo
(MCMC)
1.5. Diagnostics and Overimputing.
2. Outliers and Robust Statistics
2.1. Introduction
2.2. Typology of outliers
2.2.1. Additive Outliers
2.2.2. Level Shifts
2.2.3. Innovational Outliers
2.3. Influence measures
2.4. Robust methods
2.4.1. M-estimates of location and scale
2.4.2. Robust inference
2.4.3. Huber estimators for regression
2.4.4. Some robust techniques for multivariate
analysis and time series
2.4.5. Software based on R.
Practical
demonstration: R
|
Course 13
|
Hidden Markov Models
|
Program
|
1.
Introduction
1.1. Introduction to Hidden Markov Models
1.2. Hidden Markov Models definition
1.3. Application of HMMs to speech recognition
1.4. Overview of quantification
2.
Discrete Hidden Markov Models
2.1. Presentation of Discrete HMMs: model description
2.2. HMM simple examples
3.
Basic algorithms for Hidden Markov Models.
3.1. Forward-backward
3.2. Viterbi decoding
3.3. Baum-Welch reestimation
3.4. Practical issues for the implementation
4.
Semicontinuous Hidden Markov Models
4.1. Overview, advantages and disadvantages
4.2. Formulae modification in the basic algorithms
5.
Continuous Hidden Markov Models.
5.1. Overview, advantages and disadvantages
5.2. Formulae modification in the basic algorithms
5.3. Multi-Gaussian modeling
6.
Unit selection and clustering
6.1. Considerations for unit selection in Hidden Markov
Models
6.2. Parameter sharing
6.3. Unit clustering in HMMs
7.
Speaker and Environment Adaptation for HMMs
7.1. Adaptation modes
7.2. Maximum Likelihood Linear Regression (MLLR)
7.3. Maximum a Posteriori (MAP)
7.4. Rapid adaptation
8.
Other applications of HMMs
8.1. HMMs for alignment and gene finding in DNA
8.2. HMMs for handwritten word recognition
8.3. HMMs for lip reading
8.4. HMMs for stroke recognition in tennis videos
8.5. HMMs for electricity market modelling
Practical
demonstration: An open-source solution for HMM modeling: The HTK toolkit
from Cambridge University.
|
Bibliography
|
Hidden Markov Models for Speech Recognition. X.D.Huang,
J. Ariki, M. A. Jack. Edinburgh University Press, 1990.
Spoken Language Processing. Huang, X., Acero, A., Hon, H.W. Ed. Prentice
Hall, New Jersey, 2001.
|
Prerequisites
|
The attendant should
be familiar with some basics in pattern recognition, multivariate
Gaussian distribution, dynamic programming
|
Readings before
attending
|
Attendant will benefit
more from the course if they read before coming:
|
Course
14
|
Statistical
inference
|
Program
|
1.
Introduction
1.1. The general problem of Statistical inference
1.2. Deduction vs induction
1.3. Statistics and Probability
1.4. Estimation
1.5. Hypothesis Testing
1.6. Decision Theory
1.7. Examples
2. Some basic statistical test
2.1. Cross tabulation
2.2. Chi Square test
2.3. Nominal data cross tabulation tests
2.4. Ordinal data cross tabulation tests
2.5. Nominal by scale test
2.6. Concordance measures
2.7. T test for comparing means: paired and independent
samples
2.8. Non-parametric versions
2.9. One Way ANOVA. Non parametric version
2.10. Comparing variances of two samples, the F distribution
2.11. Correlations and partial correlations
2.12. Regression and non-linear regression
2.13. Kolmogorov-Smirnov test
2.14. Run Test
2.15. Randomized tests
3. Multiple testing
3.1. The Sidak correction
3.2. The Bonferroni correction
3.3. Holm's step-wise correction
3.4. The False Discovery Rate
3.5. Permutation correction
4. Introduction to bootstrapping
4.1. Parametric bootstrapping
4.2. Nonparametric bootstrapping
Practical
demonstration: SPSS
|
Bibliography
|
Essentials of Statistical Inference, G.A. Young and R.K.
Smith. Cambridge University Press
Statistical
Inference, S.D. Silvey. Chapman & Hall Monographs on Statistics and
Applied Probability, 7
Statistical
Inference, Adelchi Azzalini. Chapman & Hall Monographs on Statistics
and Applied Probability, 68
Applied
Linear Statistical Models, Neter et al, Mc Graw Hill
|
Prerequisites
|
The student is assumed to be familiar with the basics
of probability, random variables and probability distributions (binomial,
Poisson, normal, t-Student, Chis square and F), concepts of random
sampling and estimators.
|
Readings before coming
|
The student will benefit more from the course if he
reads before attending:
|
Course
15
|
Feature subset selection
|
Program
|
1.
Introduction
2. Filter approaches
2.1. Introduction
2.2. Univariate filters: parametrics (t-test, ANOVA,
Bayesian, regression) and model-free (Wilcoxon rank-sum, BSS/WSS, rank
products, random permutations, TNoM)
2.3. Multivariate filters: bivariate, CFS, MRMR, USC, Markov
blanket
3. Wrapper methods: sequential search, genetic algorithms, EDAs
4. Embedded methods: random forest, weight vector of SVM, weights of
logistic regression, regularization
5. Practical exercises: Weka, Spider (Matlab), SVM and Kernel (matlab),
SAM (R), GALGO (R), EDGE (R)
Practical
demonstration: MATLAB, Weka, R
|
Bibliography
|
H. Liu, H. Motoda (2008). Computational Methods of
Feature Selection. Chapman and Hall/CRC
H. Liu, H. Motoda (1998). Feature Selection for Knowledge Discovery
and Data Mining. Kluwer Academic Publishers.
Y. Saeys, I. Inza, P. Larrañaga (2007). A review of feature selection
techniques in bioinformatics. Bioinformatics, 23(19), 2507-2517.
|
Course 16
|
Introduction to R
|
Program
|
1. Introduction
2. An introductory R session
2.1. The R Environment
2.2. Getting help
2.3. R Packages
2.4. Editors for R scripts
2.5. Simple manipulations and basic commands
3. Data in R
3.1. Objects in R
3.2. Basic operators
3.3. Data access and manipulation
4. Importing/Exporting data
5. Programming in R
5.1. Loops and conditionals
5.2. Writing R functions
5.3. Debugging
6. R Graphics
6.1. Types of graphs
6.2. The plot function
6.3. Editing graphs
6.4. Trellis graphics
7. Statistical Functions in R
8. R session 1: Feature selection
9. R session 2: Clustering Analysis
10. R session 3: Survival Analysis
|
Bibliography
|
• Jason Owen: “The R Guide"
• W. N. Venables and D. M. Smith. al.: An Introduction to R
• Brian S. Everitt and Torsten Nothorn: A Handbook of Statistical
Analyses Using R
• Tutorials and manuals provided by R users at the R web site: http://cran.r-project.org/other-docs.html
|
Course
17
|
Unsupervised pattern recognition (Clustering)
|
Program
|
1.
Introduction
1.1. Problem formulation
1.2. Types of features
1.3. Feature extraction
1.4. Graphical examination
1.5. Data quality
1.6. Distance measures
1.7. Preprocessing
1.8. Data reduction
1.9. Types of clustering
2. Prototype-based clustering
2.1. K-means: problem formulation
2.2. K-means: suboptimal solution
2.3. K-means: initialization
2.4. K-means: limitations
2.5. ISODATA
2.6. Fuzzy K-means
2.7. Mixture models (EM algorithm)
2.8. Self-Organizing Maps
2.9. Extensions
3. Density-based clustering
3.1. DBSCAN (Density based spatial clustering of
applications with noise)
3.2. Grid-clustering
3.3. Denclue (density clustering)
3.4. More algorithms
4. Graph-based clustering
4.1. Hierarchical clustering: introduction
4.2. Hierarchical clustering: locally optimal algorithm
4.3. Hierarchical clustering: linking comparison
4.4. Chameleon
4.5. Hybrid graph-density based clustering: SNN-DBSCAN
4.6. Extensions
5. Cluster evaluation
5.1. Clustering tendency
5.2. Unsupervised cluster evaluation
5.3. Supervised cluster evaluation
5.4. Determining the number of clusters
6. Miscellanea
6.1. Categorical clustering
6.2. Conceptual clustering
6.3. Subspace clustering
6.4. Information theoretic clustering
6.5. Ensemble/Consensus clustering
6.6. Semisupervised clustering
6.7. Clustering with obstacles
6.8. Biclustering, coclustering, two-mode clustering
6.9. Turning a supervised classification algorithm into a
clustering algorithm
Practical
demonstration: MATLAB
|
Bibliography
|
Data Mining: Concepts and Techniques by Jiawei Han,
Micheline Kamber. Morgan Kaufmann (2000)
Principles of Data Mining by David Hand, Heikki
Mannila, Padhraic Smyth. MIT Press (2001)
|
Course 18
|
Evolutionary
computation
|
Program
|
1. Introduction
1.1. Genetic algorithms
1.2. Genetic programming
1.3. Robust intelligent systems
1.4. Self-adapting intelligent systems
1.5. Sub-symbolic knowledge representation: artificial
neural networks
1.6. Symbolic knowledge representation: rule and fuzzy-rule
based systems
2. Genetic algorithms
2.1. How do they work?
2.2. Main features
2.3. Problem codification
2.4. Convergence
2.5. Reproduction operators
2.6. Mutation operators
2.7. Crossover operators for finite alphabets
2.8. Crossover operators for real numbers codification
methods
2.9. Individual replacement
3. Genetic programming
3.1. Grammar-guided genetic programming
3.2. Initialization methods
3.3. How initialization method influences the evolution
process
3.4. Classic crossover operators
3.5. The grammar-based crossover operator
3.6. Mutation operators
4. Robust and self-adapting intelligent systems
4.1. Fitness function
4.2. Binary direct codification method of neural
architectures
4.3. Grammar codification method
4.4. Basic neural architectures codification method
4.5. Training neural networks with genetic algorithms
4.6. Designing neural architectures with genetic programming
4.7. Rule and fuzzy rule based systems
4.8. Real-world applications
5. Introduction to Estimation of Distribution Algorithms
5.1. Probabilistic modelling in optimization
5.2. Components of an EDA
5.3. EDAs based on univariate probabilistic models
5.4. EDAs based on multivariate probabilistic models
5.5. EDAs for problems with discrete representation
5.6. EDAs for problems with continuous and mixed
representation
6. Improvements, extensions and applications of EDAs
6.1. Parallel EDAs
6.2. Fitness partial evaluation using probabilistic models
6.3. Multi-objective optimization using EDAs
6.4. Hybrid EDAs
6.6. Applications in Bioinformatics
6.7. Applications in Robotics
6.8. Applications in Engineering and Scheduling
7. Current research in EDAs
7.1. Information extraction from probabilistic models
learned by EDAs
7.2. EDA approaches to highly complicated, constrained,
mixed and other difficult problems
7.3. Addition of prior knowledge about the problem domain
into the search
Practical
Demonstration: MATLAB and GeLi library
|
Bibliography
|
- A.E. Eiben,
J.E. Smith. Introduction to Evolutionary Computing. Springer-Verlag
Berlin Heidelberg New York (2003)
- M. Mitchell. An
Introduction to Genetic Algorithms. MIT Press (1998)
- W.B. Langdon,
R. Poli. Foundations of Genetic Programming. Springer-Verlag Berlin
Heidelberg New York (2002).
- Estimation of
Distribution Algorithms. A New Tool for Evolutionary Computation.
Series: Genetic Algorithms and Evolutionary Computation , Vol. 2 ,
Pedro Larrañaga and , José A. Lozano (Eds.) 2001.
|
Prerequisites
|
The attendant is supposed to be familiar with neural
networks and the structure of knowledge based systems, and with basic
statistical concepts and has attended the course on Bayesian networks.
|
Reading before coming
|
The attendant will benefit more from the course if he
reads before attending:
Daniel Manrique and Juan Ríos, Artificial Neural Networks. A
Survey of Optimization by Building and Using Probabilistic Models
Computational Optimization and Applications Volume 21 , Issue 1
(January 2002) Pages: 5 – 20. by: Martin Pelikan, David E Goldberg,
Fernando G Lobo.
|
|
|
|
|