1
00:00:17,340 --> 00:00:31,369
So, good morning, last class what we have
discussed that your data equal to pattern
2
00:00:31,369 --> 00:00:52,710
plus error pattern plus error. So, how do
we extract pattern that was the issue, so
3
00:00:52,710 --> 00:01:13,050
with respect to this we will discuss. Now,
what are the statistical approaches, so statistical
4
00:01:13,050 --> 00:01:23,140
approaches to problem solving. If you see
the slide, you see that here we will start
5
00:01:23,140 --> 00:01:45,860
with practical problem then this practical problem will be converted into statistical problem. Statistical
6
00:01:45,860 --> 00:01:58,729
problem will help us to generate statistical
solution, and this statistical solution.
7
00:01:58,729 --> 00:02:04,259
So, if I say that first one is practical problem,
second is statistical problem, third one is
8
00:02:04,259 --> 00:02:10,399
statistical solution, then this statistical
solution must be practical enough. So, that
9
00:02:10,399 --> 00:02:23,349
is practical solution, now whatever solution
you get statistical solution is converted
10
00:02:23,349 --> 00:02:32,430
to practical solution. It should be checked
whether the practical problem is truly solved
11
00:02:32,430 --> 00:02:37,280
or not, that is why the cycle.
12
00:02:37,280 --> 00:02:45,260
If you see the slide, you see that it is basically
a cyclic one your practical problem and that
13
00:02:45,260 --> 00:02:54,120
is the crux of the matter. If you find out
the practical problem, nicely identify and
14
00:02:54,120 --> 00:03:01,800
define the problem nicely, because rest of
the things are following it. So, then the
15
00:03:01,900 --> 00:03:06,480
statistical problem then your statistical
solution, and finally the practical solution.
16
00:03:06,480 --> 00:03:20,310
Many a times, we falter in effectively defining
the practical problem, getting me.
17
00:03:20,310 --> 00:03:35,060
So, what is practical problem how do you understand
a practical problem. In last class, that a
18
00:03:35,060 --> 00:03:45,139
small company in a city doing business primarily
local at the local level and the company is
19
00:03:45,139 --> 00:03:52,900
monitoring its profit and sales volume. It
is observed that absenteeism is quite high
20
00:03:52,900 --> 00:04:05,430
and there are machines breakdowns also and
in order to improve the business performance
21
00:04:05,430 --> 00:04:12,469
marketing department is also established may
be recently. Then they want to check that
22
00:04:12,469 --> 00:04:23,840
how the marketing department is performing. Now, here what is your practical problem if
23
00:04:23,840 --> 00:04:27,830
I go back to this slide again, what is the
practical problem.
24
00:04:27,830 --> 00:04:36,550
The practical problem is, as I told you that
there is threat to profit and sales volume,
25
00:04:36,550 --> 00:04:48,270
high absenteeism and occasional to frequent
breakdowns. May be the poor performance of
26
00:04:48,270 --> 00:04:57,990
the marketing department may not be, but I
am assuming that performance of marketing
27
00:04:57,990 --> 00:05:08,130
department is also not. That means threat
to how to overcome threat to profit and sales
28
00:05:08,130 --> 00:05:18,960
volume if high absenteeism is causing lower
profit. What is the extent of that effect,
29
00:05:18,960 --> 00:05:24,410
then it is better for the management to take
actions.
30
00:05:24,410 --> 00:05:31,620
Now, we will convert it into a statistical
problem, if you really want to built statistical
31
00:05:31,620 --> 00:05:39,759
problem from practical problem these are the
few items or steps which you must follow.
32
00:05:39,759 --> 00:05:51,009
First one is identify variables of interest,
in these example profit sales volume percentage
33
00:05:51,009 --> 00:05:57,729
absenteeism machine breakdown in hours and
M ratio. These are the variables what we have
34
00:05:57,729 --> 00:06:08,569
identified, now identify response variables
out of those variables what are the response
35
00:06:08,569 --> 00:06:12,180
variables.
Now, here response by response variable what
36
00:06:12,180 --> 00:06:18,870
I mean to say by response variable we want
to mean to say that this variables are affected
37
00:06:18,870 --> 00:06:25,289
by presence of other variables in the system,
getting me. In other words, what we can say
38
00:06:25,289 --> 00:06:32,750
dependent variables, so Y 1 and Y 2 are response
variable they are dependent variable, why
39
00:06:32,750 --> 00:06:44,569
because the absenteeism machine breakdown
M ratio can influence profit and sales volume,
40
00:06:44,569 --> 00:06:52,110
so immediately the next objective is identify
the exploratory variables and find out the
41
00:06:52,110 --> 00:07:01,909
dependence relationship, getting me. For this
particular problem, although it is little
42
00:07:01,909 --> 00:07:09,789
bit difficult at this moment that why suddenly
we are going for this dependence modeling
43
00:07:09,789 --> 00:07:14,500
what are the different kinds of dependence
modeling techniques are there.
44
00:07:14,500 --> 00:07:23,300
But, you must be, what I mean to say, you
should not be governed by the techniques
45
00:07:23,540 --> 00:07:30,840
that is one of the threats. Many a times, what
happens you know many multivariate techniques
46
00:07:30,840 --> 00:07:38,750
and you think that your problem will fit to
that technique please, do not do this. That
47
00:07:38,750 --> 00:07:43,770
is why I have written here in finding the
relationship Y is function of X we have not
48
00:07:43,770 --> 00:07:49,460
said it is a linear relationship or non linear
relationship or it is a regression. It is
49
00:07:49,460 --> 00:07:55,840
something some other way of doing things nothing
mentioned, so you must be driven by the problem
50
00:07:55,840 --> 00:07:59,979
at hand. So, our problem at hand is threat to profit
51
00:07:59,979 --> 00:08:09,389
and threat to your sales volume and so we
got the response variable where we want to
52
00:08:09,389 --> 00:08:15,419
concentrate. Accordingly, the other explanatory
variables means the variables which explain
53
00:08:15,419 --> 00:08:27,360
why there is variability in profit and variability
in sales volume. Function of Y is f X in statistics
54
00:08:27,360 --> 00:08:37,500
statistical modeling the variability, variability
is the crux of the method. Please understand,
55
00:08:37,500 --> 00:08:46,130
you must understand the variability structure
in case of one variable there is variance.
56
00:08:46,130 --> 00:08:52,660
In case of several variables, there is variance
and covariance in totality we say that is
57
00:08:52,660 --> 00:08:53,940
covariance structure.
58
00:08:53,940 --> 00:09:10,310
So, that covariance structure, covariance
structure this will give you the pattern,
59
00:09:10,310 --> 00:09:15,130
the pattern you want to extract this will
be given by the covariance structure.
60
00:09:15,130 --> 00:09:23,280
So, let us see the statistical problem, here
the statistical problem is that the variability
61
00:09:23,280 --> 00:09:31,570
in profit and sales volume. By variability,
we are saying that definitely there is high
62
00:09:31,570 --> 00:09:42,260
variability caused by X 1 which is absenteeism
X 2 machine breakdown in hours and X 3 M ratio
63
00:09:42,260 --> 00:09:49,110
there may be linear relationships. Now, you
are coming to statistics domain, you have
64
00:09:49,110 --> 00:10:01,730
to assume something there may be linear relationship
any problem up to this. Do you have any query
65
00:10:01,730 --> 00:10:15,280
and query from your side, it is this side,
no problem, yes.
66
00:10:15,280 --> 00:10:27,570
Then I want to show, we say that fine there
is a your relationship statistical way you
67
00:10:27,570 --> 00:10:36,070
can examine that relationship. And we also
assume that their relationship is linear.
68
00:10:36,070 --> 00:10:44,720
Now, let us see, then pictorially if you see
this see this particular figure, here what
69
00:10:44,720 --> 00:10:55,470
we have said that Y 1 is affected by X 1,
X 2 and x 3 what is Y 1, Y 1 we have said
70
00:10:55,470 --> 00:11:05,020
Y 1 is profit correct. So, Y 1 is profit,
X 1 is percentage absenteeism, X 2 is breakdown
71
00:11:05,020 --> 00:11:27,270
and X 3 is machine hours, so Y 1 is profit,
X 1, X 2, X 3 absenteeism and your breakdown
72
00:11:27,270 --> 00:11:36,450
hours and M ratio.
Now, this is the linear relationship between
73
00:11:36,450 --> 00:11:52,180
Y 1 and X 1, X 2 and X 3, you may be wondering
that what is X 0 then that X 0 what is constant
74
00:11:52,180 --> 00:11:57,640
value which will be later on in multiple regression.
You will be knowing that X 0 will be given
75
00:11:57,640 --> 00:12:04,760
a value of one, for all observation one value will be given and beta 1 0 will be the intercept
76
00:12:04,760 --> 00:12:10,670
means, that irrespective of the explanatory
variable considered here. But, there will
77
00:12:10,670 --> 00:12:17,820
be still some value for Y and that will be
determined by beta 1 0, let it be slowly you
78
00:12:17,820 --> 00:12:23,880
do it later on, we will see.
But, the sole purpose of this particular figure
79
00:12:23,880 --> 00:12:32,380
is that what I mean to say that Y 1 if you
want a linear relationship, you can pictorially
80
00:12:32,380 --> 00:12:39,010
represent like this. Similarly, for sales
volume also you can represent like this in
81
00:12:39,010 --> 00:12:48,900
the pictorial representation. Please keep
in mind two things, first of all the arrow
82
00:12:48,900 --> 00:12:56,460
head is see that it is suppose X 1 versus
Y 1. If you consider that the arrow basically
83
00:12:56,460 --> 00:13:03,940
starts with X 1 and ends with Y 1, and arrow
head is at the Y 1 level. So, that it simply
84
00:13:03,940 --> 00:13:10,700
indicates that Y 1 is a dependent or affected
variable and X 1 is the causal variable or
85
00:13:10,700 --> 00:13:17,490
explanatory variable.
Another issue is the epsilon 1 that is basically
86
00:13:17,490 --> 00:13:24,660
some error component as I told you that two
parts, one is pattern and error. Apart from
87
00:13:24,660 --> 00:13:28,530
this arrow, the rest of the things are basically
pattern beta 1 0, beta 1 1, beta 1 2, beta
88
00:13:28,530 --> 00:13:38,190
1 3, all will reflect the pattern, similarly
for Y 2. But, for this particular company,
89
00:13:38,190 --> 00:13:45,750
your Y 1 and Y 2, they are sales and volume and profit and for the same one they are simultaneously
90
00:13:45,940 --> 00:13:51,200
occurring. So, if you go by two different
models, linear models will it suffice
91
00:13:51,230 --> 00:14:30,530
that is why what I have written Y 1 and Y 2 are
measured, simultaneously they are co varying
92
00:14:30,530 --> 00:14:53,000
also like beta 1, p X p epsilon 1.
93
00:14:53,000 --> 00:15:04,540
So, what is this, now what I am saying further
that Y 1, Y 2 you are measuring, simultaneously
94
00:15:04,540 --> 00:15:10,270
do you want to keep the structure intact while
estimating, while modeling, what is the structure.
95
00:15:10,270 --> 00:15:17,240
Here, both things you are measuring simultaneously,
you see the next slide here, well next slide
96
00:15:17,240 --> 00:15:28,190
left one, what you are doing. Here, you are
basically saying, no both are occurring simultaneously
97
00:15:28,190 --> 00:15:33,770
and they are multivariate observations only.
So, what I want to want to keep them in one
98
00:15:33,770 --> 00:15:42,270
model, I will not go for two linear models
only in one linear model I will do this.
99
00:15:42,270 --> 00:15:50,310
But, there is another problem, what is other
problem other. Problem is Y 1 is basically
100
00:15:50,310 --> 00:15:58,240
and Y 2 3 may be relationship between the
two, it may so happen that Y 1 may affect
101
00:15:58,240 --> 00:16:08,740
Y 2. That means Y 1 is not only a dependent
variable or response variable, it also becomes
102
00:16:08,740 --> 00:16:15,420
a causal variable or explanatory variable,
it all depends on the situation. Suppose,
103
00:16:15,420 --> 00:16:23,370
any of the situation this is not the case
no problem, but many a times what will happen
104
00:16:23,370 --> 00:16:30,410
you will find out that this type of structure
is there. So, when such type of a structure
105
00:16:30,410 --> 00:16:38,310
is present in the practical problem, how can
you ignore it, you cannot ignore it.
106
00:16:38,310 --> 00:16:48,920
That means you have to keep the practical
that behavior real behavior intact and then
107
00:16:48,920 --> 00:16:55,920
find out the model. Not the other way round,
find out the model with the data and accordingly
108
00:16:55,920 --> 00:17:05,699
you say that is the behavior of the system
studied, it is not like this. So, actually
109
00:17:05,699 --> 00:17:14,100
what happened there are few models, by saying
this few models are I say discussed pictorially
110
00:17:14,100 --> 00:17:20,959
one is multiple regression, one is multivariate
regression, then path model. Later on, we
111
00:17:20,959 --> 00:17:34,870
will see that what are all those things, now
what is the purpose effectively, basically
112
00:17:34,870 --> 00:17:40,769
from statistical sense I said that what is
the purposes of multivariate modeling at the
113
00:17:40,769 --> 00:17:44,820
beginning?
But, from statistical sense, what are the
114
00:17:44,820 --> 00:17:56,779
purpose you see what first purpose is description
any model you built is definitely describe
115
00:17:56,779 --> 00:18:05,440
what is the problem at hand. What are the
purpose of that study and what are the different
116
00:18:05,440 --> 00:18:15,129
variables involved in this particular problem
and the how these variables are measured.
117
00:18:15,129 --> 00:18:21,669
How these variables, I can say are stored
or kept from what data source it is found
118
00:18:21,669 --> 00:18:26,700
source, whether it is a primary data mane
you have just gone and collected or taken
119
00:18:26,700 --> 00:18:35,919
from some other source. So, all those things
are coming under description, then these description
120
00:18:35,919 --> 00:18:40,970
part should be done ritually. In fact, everything
should be done ritually descriptive, you should
121
00:18:40,970 --> 00:18:49,080
not falter because this is the problem definition
part then explanation, what is explanation,
122
00:18:49,080 --> 00:18:56,509
the relationships between the variables amongst
the variables, what is the relationship that
123
00:18:56,509 --> 00:19:08,289
is coming under explanation then prediction.
So, whenever we talk about any model, we talk
124
00:19:08,289 --> 00:19:18,330
about description of the problem, explanation
of the relationships of the variables, building
125
00:19:18,330 --> 00:19:24,860
which is basically used to build the model
and prediction. Now, many models are not able
126
00:19:24,860 --> 00:19:33,419
to predict, so that means in a model, any
model if the first two portion is missing
127
00:19:33,419 --> 00:19:42,039
then that is not a model. You will just develop
a structure like this, a schematic diagram
128
00:19:42,039 --> 00:19:47,080
like this and describe something and you will
say that is my model. That is the description
129
00:19:47,080 --> 00:19:55,970
part, that is not the model, you have to explain
the relationship, that means in this particular
130
00:19:55,970 --> 00:20:01,379
diagram, this is the description plus relation,
where is the relation.
131
00:20:01,379 --> 00:20:11,519
Relation is gamma 1, gamma 2, gamma 1 3, gamma
2 1, gamma 2 2, these are the relations and
132
00:20:11,519 --> 00:20:21,749
without this gamma and beta values, this is
only a description. If you use certain model
133
00:20:21,749 --> 00:20:31,909
which will give you all those beta gamma estimates,
and then explanation is completed. Now, using
134
00:20:31,909 --> 00:20:39,610
this side, the left hand side plus the estimates
of beta gamma and errors if you predict some
135
00:20:39,610 --> 00:20:52,960
values for Y 1 or for Y 2, that is what is
prediction any model.
136
00:20:52,960 --> 00:21:06,490
I understand the problem, this is the first
class, basically second lecture here, now
137
00:21:06,490 --> 00:21:13,159
this what i am trying to give you a pictorial
view here. Now, how this beta 1 gamma 1 or
138
00:21:13,159 --> 00:21:18,999
beta 1 gamma 1 on all those things mean what
is the issue and how it will be estimated,
139
00:21:18,999 --> 00:21:24,470
all those things will come slowly in the subsequent
lecture. You will be knowing the estimation
140
00:21:24,470 --> 00:22:04,690
process where this is the only diagram fine
I understand, now your question is this one,
141
00:22:04,690 --> 00:22:11,840
this side actually in this case this is a
representation of multivariate regression.
142
00:22:11,840 --> 00:22:19,149
Here, we consider all these variables are
purely independent and in this model case,
143
00:22:19,149 --> 00:22:25,990
this is a pictorial representation of path
model. Here, we consider that the independent
144
00:22:25,990 --> 00:22:33,610
variable can vary, co vary, getting me, but
in reality you will not find out that all
145
00:22:33,610 --> 00:22:43,490
variables are independent, truly independent.
So, this is better representation, slowly
146
00:22:43,490 --> 00:22:52,970
just if multicollinearity is a problem, then
this is a case, in few later subsequent lectures
147
00:22:52,970 --> 00:22:56,159
it is there.
148
00:22:56,159 --> 00:23:05,240
Now, what are the different multivariate models
that will be discussed in this subject. So,
149
00:23:05,240 --> 00:23:13,110
I have given you some idea of that what is
multivariate analysis, what is multivariate
150
00:23:13,110 --> 00:23:20,559
statistical modeling, what do you mean by
multivariate and also different data types,
151
00:23:20,559 --> 00:23:26,679
how practical problem can be converted to
statistical problem. In statistical domain,
152
00:23:26,679 --> 00:23:34,860
what are the problems then there we will now
see that there are different types of multivariate
153
00:23:34,860 --> 00:23:43,389
models and for primarily these models will be discussed in the subject. So, multivariate
154
00:23:43,389 --> 00:23:51,340
models are categorized into two broad groups,
one is dependence models and the other one
155
00:23:51,340 --> 00:23:58,659
is interdependence model. By dependence model,
we say there are two sets of variables.
156
00:23:58,659 --> 00:24:10,610
One is response side or response variables
another is explanatory variables, getting
157
00:24:10,610 --> 00:24:24,980
me. There are two sides, explanatory variables
are used to explain the response variables,
158
00:24:24,980 --> 00:24:38,450
so that is what is dependence then that is
the dependence structure and interdependence
159
00:24:38,450 --> 00:24:49,200
is interdependence. If you see that is interdependence
model, there is no response side all, variables
160
00:24:49,200 --> 00:24:59,509
are coming under one bracket, all variables
under one bracket.
161
00:24:59,509 --> 00:25:04,679
What do you mean to say there is relationship
definitely, but within this the same set of
162
00:25:04,679 --> 00:25:09,559
variables, there is no categorization, that
once that is dependent variable or response variable
163
00:25:09,559 --> 00:25:16,749
other is explanatory variables. So, based
on this concept, so there are dependence model,
164
00:25:16,749 --> 00:25:28,330
there are interdependence model, so in dependence
model MANOVA is there which is multivariate
165
00:25:28,330 --> 00:25:42,879
analysis of variance.
166
00:25:42,879 --> 00:25:48,499
Multivariate analysis of variance, I think
all of you what is analysis of variance that
167
00:25:48,499 --> 00:25:56,850
is ANOVA and it is presumed that the ANOVA
is known, then only MANOVA can be understood.
168
00:25:56,850 --> 00:26:11,960
Then multiple linear regression, then multivariate
linear regression you have path model structural
169
00:26:11,960 --> 00:26:30,860
equation, modeling is another component in addition logistic regression discriminant analysis
170
00:26:30,860 --> 00:26:40,720
there are many more dependence model.But, in these sets of models what is the basic feature is that
171
00:26:41,040 --> 00:26:52,360
we have two sets of variables, one can be affected by other set of variables. We can build model like Y equal to
172
00:26:52,369 --> 00:27:02,519
function of X, this type of relationship model we can build other sets other group that is what
173
00:27:02,519 --> 00:27:07,309
I say the interdependence model.
So, under interdependence model case they
174
00:27:07,309 --> 00:27:20,450
are principal component analysis, factor analysis,
cluster analysis. Here, we will not segregate
175
00:27:20,450 --> 00:27:27,119
the variables into dependent or independent
side, we take everything as just one set of
176
00:27:27,119 --> 00:27:35,869
variables. Then we want to see the covariance
structure of these of this X variables, there
177
00:27:35,869 --> 00:27:41,460
are many X variables. Based on this structure,
our primary objective in principal component
178
00:27:41,460 --> 00:27:52,570
analysis is reduction of dimension, so dimension
reduction factor analysis not only dimension
179
00:27:52,570 --> 00:28:05,019
reduction plus naming the factors. So, dimensional
reduction plus naming the factor in cluster
180
00:28:05,019 --> 00:28:13,830
analysis, we want to group the individuals
not the variable factor analysis and P C A.
181
00:28:13,830 --> 00:28:19,629
We basically group the variables, so then
we reduce the dimension and then in cluster
182
00:28:19,629 --> 00:28:23,789
analysis we will not group the variables.
We will group the individuals or items or
183
00:28:23,789 --> 00:28:36,330
objects on which data is collected and then
we make several clusters, cluster 1, cluster
184
00:28:36,330 --> 00:28:45,879
2, cluster 3. So, like this several clusters
will be formed before that because we will
185
00:28:45,879 --> 00:28:52,919
try to cover so many models it may not before
possible to cover all. But, substantial number
186
00:28:52,919 --> 00:29:00,239
of models will be covered here, but before
that what the basics of multivariate statistics
187
00:29:00,239 --> 00:29:03,710
is very, very important that will that will
be covered.
188
00:29:03,710 --> 00:29:12,389
So, as far as the sequence of the lecture
is concerned next lecture we will be covering
189
00:29:12,389 --> 00:29:22,649
your univariate statistics a few lectures
on univariate statistics followed by multivariate
190
00:29:22,649 --> 00:29:34,200
statistics. So, there may be few hours, 4
hours or 5 hours, let it be 4 to 5 hours of
191
00:29:34,200 --> 00:29:41,840
this univariate case then multivariate statistics
may be around 7 to 10 hours will be on multivariate
192
00:29:41,840 --> 00:29:46,659
statistics. Understanding of multivariate
statistics is very, very important for all
193
00:29:46,659 --> 00:29:54,669
of us, otherwise you will not be able to explain
the multivariate models. Now, what are those
194
00:29:54,669 --> 00:29:58,789
models, where from these models are coming,
what is the utility of these models, we will
195
00:29:58,789 --> 00:30:01,279
say many things and you will not be able to
explain.
196
00:30:01,279 --> 00:30:07,179
So, the backbone for all these multivariate
models that I have shown you just few minutes
197
00:30:07,179 --> 00:30:14,860
back is multivariate statistics. Under multivariate
statistics, there will be multivariate normal
198
00:30:14,860 --> 00:30:32,759
distribution, multivariate normal distribution
which we will be denoting like MND and followed
199
00:30:32,759 --> 00:30:45,840
by that will be multivariate descriptive statistics.
Descriptive statistics which covers primarily
200
00:30:45,840 --> 00:31:05,119
mean vector, covariance matrix correlationmatrix, so under multivariate descriptive statistics
201
00:31:05,120 --> 00:31:13,860
we will be concentrating on this.
Then under multivariate statistics, the next
202
00:31:13,860 --> 00:31:30,580
issue is multivariate inferential statistics,
so under multivariate inferential statistics,
203
00:31:30,580 --> 00:31:45,320
what are things we will be covering.
First is hotelling’s T square, getting me
204
00:31:45,320 --> 00:32:07,799
hotelling’s T square then that is confidence
region and simultaneous confidence interval,
205
00:32:07,799 --> 00:32:19,809
confidence intervals then your hypothesis
testing. So, whatever you have learnt in univariate
206
00:32:19,809 --> 00:32:27,499
statistics, that univariate inferential statistics
that counterpart in multivariate domain that
207
00:32:27,499 --> 00:32:35,350
will be discussed upto hypothesis testing.
That in case of one population, in case of
208
00:32:35,350 --> 00:32:44,539
two population all those things will be discussed
and you will see later on. As I told you that
209
00:32:44,539 --> 00:32:51,899
multivariate, your multivariate normal distribution
is the crux of the matter, so univariate normal
210
00:32:51,899 --> 00:32:59,609
distribution, all of you know and that we
will denote it like this N mu and sigma square,
211
00:32:59,609 --> 00:33:08,200
where mu is the population mean.
That population is characterized by one variable
212
00:33:08,200 --> 00:33:15,720
that is x and sigma square is the component
variance of the population with respect to
213
00:33:15,720 --> 00:33:28,580
x. This one by square root of 2 pi of sigma
square e to the power minus half x minus mu
214
00:33:28,580 --> 00:33:43,889
by sigma square, where your this x value lies
minus infinity to plus infinity it is visible.
215
00:33:43,889 --> 00:33:49,299
So, now what will be the multivariate counterpart
of this.
216
00:33:49,299 --> 00:33:59,869
So, normal distribution if you draw you will
be finding like this, this is your univariate
217
00:33:59,869 --> 00:34:07,389
normal distribution which is known as probability
density function or other way we say F x.
218
00:34:07,389 --> 00:34:18,030
This direction x axis will be y variable and
your y axis will be your F x, so you want
219
00:34:18,030 --> 00:34:24,050
you have to understand its multivariate counterpart
that will be very difficult one, it is not
220
00:34:24,050 --> 00:34:32,090
that easy. Suppose there are two variables
x 1 and x 2, and then there will be joint
221
00:34:32,090 --> 00:34:45,480
distribution x 1 and x 2, so what is the distribution.
Something like this you will be getting a
222
00:34:45,480 --> 00:34:52,919
two dimensional picture when we talk about
only two variables, when we talk about more
223
00:34:52,919 --> 00:34:59,910
than two variable that is not possible to
draw, but you have think that concept.
224
00:34:59,910 --> 00:35:08,240
So, that abstraction level of thinking is
required for understanding multivariate normal
225
00:35:08,240 --> 00:35:15,280
distribution. As I told you that multivariate
normal distribution is very, very important
226
00:35:15,280 --> 00:35:24,560
because most all the models in multivariate
statistical modeling subject follows multivariate
227
00:35:24,560 --> 00:35:31,610
normal distribution, because if the data follows
multivariate normal distribution what will
228
00:35:31,610 --> 00:35:37,840
happen.
Most of the test can be possible, you can
229
00:35:37,840 --> 00:35:45,640
measure the goodness of fit with appropriate
statistical distribution, how the statistics
230
00:35:45,640 --> 00:35:54,130
what will be used in those models. So, you
will know all those things mean vector covariance,
231
00:35:54,130 --> 00:36:03,910
matrix correlation matrix, it is like this.
Now, with respect to this small company data
232
00:36:03,910 --> 00:36:12,460
I think I have given you that the data is
12 into 6 how many 12 into 6 data was given.
233
00:36:12,460 --> 00:36:21,290
So, you have seen the matrix, you have also
seen this matrix, now if I ask you what will
234
00:36:21,290 --> 00:36:29,790
be the distribution of this distribution as
there are 6 variables, getting me.
235
00:36:29,790 --> 00:36:42,290
So, it will be a 6 dimensional issue and your
joint distribution will be F x 1, x 2 to x
236
00:36:42,290 --> 00:36:54,290
6, so how to get this distribution, so you
cannot work now in scalar domain like the
237
00:36:54,290 --> 00:37:00,430
univariate case you have to go to the matrix
domain. As a result, you will be see that
238
00:37:00,430 --> 00:37:06,840
I have written under multivariate distribution
that is few things, one is your mean vector,
239
00:37:06,840 --> 00:37:12,680
vector is coming into consideration. Second
one is covariance matrix, third one is correlation
240
00:37:12,680 --> 00:37:24,900
matrix, so we will be discussing this after
univariate statistics. Once these will multivariate
241
00:37:24,900 --> 00:37:30,670
normal distribution and descriptive statistics
is covered we do for multivariate inferential
242
00:37:30,670 --> 00:37:36,220
statistics which we will basically discuss
based on hotelling’s T square.
243
00:37:36,220 --> 00:37:43,560
We will be using extensively the hotelling
T square distribution, so apart from this
244
00:37:43,560 --> 00:37:52,860
I request all of you to do one more thing,
that you please go through probability distributions
245
00:37:52,860 --> 00:37:59,340
little bit of this because this knowledge
is very, very important. Both discrete and
246
00:37:59,340 --> 00:38:14,640
continuous as well as continuous the probability
distribution will not be taught, I will definitely
247
00:38:14,640 --> 00:38:21,400
bring some probability distribution. But,
I cannot explain in each of the distribution,
248
00:38:21,400 --> 00:38:23,890
so you have to go through this.
249
00:38:23,890 --> 00:38:31,470
Another issue in this subject is that statistical
distribution no not statistical, sorry sampling
250
00:38:31,470 --> 00:38:43,930
distribution it is also probability distribution.
But, it is related to a that is probability
251
00:38:43,930 --> 00:39:00,270
distribution of statistic probability distribution
of suppose one probability distribution of
252
00:39:00,270 --> 00:39:10,490
statistic statistic statistic or there will
be statistics. But actually statistics by
253
00:39:10,490 --> 00:39:17,120
statistics, we are basically talking about
the random variable I will explain in a univariate
254
00:39:17,120 --> 00:39:25,630
as well as multivariable by describing a statistics
case, how the statistics, how it is a random
255
00:39:25,630 --> 00:39:30,360
variable all those things will be discussed to you.
256
00:39:30,360 --> 00:39:40,180
Now, let us come back to this again introduction
257
00:39:40,180 --> 00:39:49,840
part that I told you that multivariate analysis
is purely for this particular subject, here
258
00:39:49,840 --> 00:39:56,370
it is for the practice for the people who
will be using it for solving the real life
259
00:39:56,370 --> 00:40:04,160
problems. Now, real life problems what we
mean, here some illustrative examples quality
260
00:40:04,160 --> 00:40:14,110
control, any idea about quality control for
example, you think that suppose in any Kharagpur,
261
00:40:14,110 --> 00:40:21,110
there is Tata bearings, there is Tata metallics.
Now, Tata bearings will produce the ball bearings
262
00:40:21,110 --> 00:40:29,130
roller bearings this component they produce,
so this component has certain quality features.
263
00:40:29,130 --> 00:40:41,510
For example, the dimensions, for example the
strength, now customers require the bearings
264
00:40:41,510 --> 00:40:47,580
the product with certain quality features.
These quality features basically converted
265
00:40:47,580 --> 00:40:54,770
into specifications if you produce beyond
specifications means not within the specification
266
00:40:54,770 --> 00:41:01,670
provided by the customer, what will happen
that product will be rejected by the customer.
267
00:41:01,670 --> 00:41:07,660
But, if you want to find out why you are producing
rejects then you will find out that there
268
00:41:07,660 --> 00:41:10,980
is a problem with the process manufacturing
process.
269
00:41:10,980 --> 00:41:18,270
Now, manufacturing process variables will
lead will affect the product quality, so it
270
00:41:18,270 --> 00:41:23,660
is the product quality variables will be the
dependent side and the process variables will
271
00:41:23,660 --> 00:41:31,720
be the independent side. You can model that
product quality a viz process variables, enormous
272
00:41:31,720 --> 00:41:41,320
examples of quality control issues are there
in the literature using statistical techniques.
273
00:41:41,320 --> 00:41:48,550
Second example, basically say one hotel room
service case service example, suppose you
274
00:41:48,550 --> 00:41:57,460
see that rack storage that hotel such a big
hotel I am basically talking about this hotel,
275
00:41:57,460 --> 00:42:04,710
they are basically providing service to many
customers. So, they also require some sort
276
00:42:04,710 --> 00:42:13,300
of service quality management otherwise impossible,
they will not yield to the profit they are
277
00:42:13,300 --> 00:42:18,020
interested in. So, there also lot of data is generated and
278
00:42:18,020 --> 00:42:30,010
you have to use those data and develop model
and solve the purpose for which the model
279
00:42:30,010 --> 00:42:36,770
is developed. Basically, the hotel room service
can be much better test of medicine, another
280
00:42:36,770 --> 00:42:43,150
issue test of medicine means every time every
month you will see that there is a different
281
00:42:43,150 --> 00:42:49,650
medicine coming for different diseases. Means
for every single disease, there may be 4,
282
00:42:49,650 --> 00:42:57,410
5 medicines are there, so my question is how
do I know that medicine works. Medicine a
283
00:42:57,410 --> 00:43:01,790
works for some groups, medicine b may work
for some other groups even though the same
284
00:43:01,790 --> 00:43:08,830
disease or some medicine are performing better
that the other medicine for the same disease.
285
00:43:08,830 --> 00:43:14,570
So, that type of testing is also possible
using if you collect appropriate data that
286
00:43:14,570 --> 00:43:21,540
is possible, now vendor selection is a very
tricky problem very, very essential. Critical
287
00:43:21,540 --> 00:43:30,580
problem in almost all small medium large enterprises,
because your vendor the supplier’s quality
288
00:43:30,580 --> 00:43:39,880
is very, very important. If you get poor items,
poor raw materials supplied by your vendor
289
00:43:39,880 --> 00:43:48,120
ultimately your product quality will suffer,
so how to go about it, what modeling can be
290
00:43:48,120 --> 00:43:54,040
possible there, so that you will select a
base vendor. There were different ways of
291
00:43:54,040 --> 00:44:01,340
selection, may be someone may go for not statistical
route some other route, but multivariate statistics
292
00:44:01,340 --> 00:44:06,820
also helps you in selecting vendor safety
management.
293
00:44:06,820 --> 00:44:14,470
So, as I told few minutes back there are many
variables affecting the people’s safety,
294
00:44:14,470 --> 00:44:19,430
I want to know that what are the variables
affecting more. What are the variables affecting
295
00:44:19,430 --> 00:44:27,450
less so that I can take appropriate actions
which will improve the overall safety standard
296
00:44:27,450 --> 00:44:38,530
of the plant of the shop floor of the work
everywhere. Now, lean production you see this
297
00:44:38,530 --> 00:44:43,310
lean production lean production means what
is there in the suppose inventory control
298
00:44:43,310 --> 00:44:49,370
is a big issue you are producing 10 items.
But, from raw material side you are keeping
299
00:44:49,370 --> 00:44:56,900
a huge raw materials item in inventory it
is a loss to the company, so what you want
300
00:44:56,900 --> 00:45:04,210
you want to minimize the inventory. But, you
cannot minimize inventory unless you have
301
00:45:04,210 --> 00:45:09,480
some steps taken some buffer, or something
else is there through which you can satisfy
302
00:45:09,480 --> 00:45:16,740
the customer at the same time based on your
product requirement. Now, lean production
303
00:45:16,740 --> 00:45:22,130
says that you minimize everything in such
a manner that you will satisfy the customer.
304
00:45:22,130 --> 00:45:29,060
But, as well as from inventory point of view,
there are also at the minimum level raw material
305
00:45:29,060 --> 00:45:34,360
inventory work, in process inventory everything
will be at the minimum levels.
306
00:45:34,360 --> 00:45:42,630
There are very good models in or present research
area, but statistics can also be useful there,
307
00:45:42,630 --> 00:45:48,730
now damage during transportation you see that
damage during transportation, it is a logistic
308
00:45:48,730 --> 00:45:57,450
problem, I will show you some. I think one
tutorial I will give you later on, that how
309
00:45:57,450 --> 00:46:04,480
to model that damage part with respect to
that what mode of transport you are using,
310
00:46:04,480 --> 00:46:09,650
what is the distance that you are transporting,
what type of packaging systems you are using.
311
00:46:09,650 --> 00:46:15,620
So, all those things ultimately lead to damage
and it is a huge cost to the company because
312
00:46:15,620 --> 00:46:24,200
you are basically supplying that products
to the customer effectiveness of training
313
00:46:24,200 --> 00:46:27,290
there is.
Now, that whether ppt based training is a
314
00:46:27,290 --> 00:46:35,030
teaching is based or class black board or
chalk and talk there is question like that
315
00:46:35,030 --> 00:46:41,160
in the training. In the industry level that
whether it is basically a class room training
316
00:46:41,160 --> 00:46:45,780
or on shop floor training, what will be there,
what will be the teacher’s combination,
317
00:46:45,780 --> 00:46:52,190
the faculty’s combination so many issues
are there. So, if you collect data effectively,
318
00:46:52,190 --> 00:46:58,780
you will be able to use multivariate statistical
models to find out that which mode of training
319
00:46:58,780 --> 00:47:06,580
or which type of training or which type of
training is better. Marketing is a area where
320
00:47:06,580 --> 00:47:17,040
safety, sorry your statistics is used very
much for example if you go through any marketing
321
00:47:17,040 --> 00:47:22,500
journal you will be finding out that full
of statistics.
322
00:47:22,500 --> 00:47:30,610
Many models comes from that side, also marketing
research that is the area where you will be
323
00:47:30,610 --> 00:47:35,890
using statistics by marketing performance.
Basically, we have conducted one study a to
324
00:47:35,890 --> 00:47:46,410
understand the purchase intention of customers
can be use structural equation modeling there
325
00:47:46,410 --> 00:47:52,370
getting structural equation model in there.
So, me illustrative examples means some other
326
00:47:52,370 --> 00:48:01,790
illustrative examples can be framed, but you
are the person who knows your system and please
327
00:48:01,790 --> 00:48:08,960
find out some case.
Some example from your work place your domain
328
00:48:08,960 --> 00:48:15,090
of expertise, so that whatever I will teach
here you will be able to translate to your
329
00:48:15,090 --> 00:48:24,270
own system. In an administrative science management,
science, social science everywhere this multivariate
330
00:48:24,270 --> 00:48:30,170
statistics is used and you have to find out
all those things and I will be giving one
331
00:48:30,170 --> 00:48:36,900
after another. You will just based on one
of this explanation and the things you will
332
00:48:36,900 --> 00:48:44,010
be learning, here you try to find out analogy
to your system and develop accordingly. Then
333
00:48:44,010 --> 00:48:51,150
learning will be complete, otherwise learning
will become incomplete you have to very, very
334
00:48:51,150 --> 00:48:54,700
careful for this purpose.
335
00:48:54,700 --> 00:49:04,320
I will show you later on some cases, so that
you understand that the totality as I told
336
00:49:04,320 --> 00:49:10,060
you that starting from practical problem to
practical solution. That mean practical problem
337
00:49:10,060 --> 00:49:16,320
statistical problem, statistical solution
practical solution, this total cycle will
338
00:49:16,320 --> 00:49:24,950
be completed. Using these cases, first case
is characteristics of accidents involved and
339
00:49:24,950 --> 00:49:34,750
non involved workers working in a underground
coal mine I told you that in safety management.
340
00:49:34,750 --> 00:49:42,650
Just I told you that one that in safety management
one of the issue is one of the theories were
341
00:49:42,650 --> 00:49:47,110
there was there which is now not that much popular.
342
00:49:47,110 --> 00:49:56,070
But, this theory still working in that sense that some people are inherently accidental
343
00:49:56,070 --> 00:50:08,680
irrespective of the situation they meet an
accident. But, some people are able to avoid
344
00:50:08,680 --> 00:50:21,010
accident, so that is the issue then we started
thinking that whether that people can be categorized
345
00:50:21,010 --> 00:50:30,920
as accident prone or not accident prone. So,
there are twenty two variables we have collected,
346
00:50:30,920 --> 00:50:40,170
and I will show you this case wherever it
is required based on the need of the model
347
00:50:40,170 --> 00:50:47,470
that will be just right, then study of process
quality process variables as well as quality
348
00:50:47,470 --> 00:50:54,560
characteristics. In a cast iron melting process,
that is also a real case study like earlier
349
00:50:54,560 --> 00:51:02,810
one and that where we have that two sets of
variables we considered.
350
00:51:02,810 --> 00:51:10,530
We found out the relationship between the
two sets and based on this relationship how
351
00:51:10,530 --> 00:51:19,010
effectively the quality of melting process
can be improved. Third one is job stress assessment
352
00:51:19,010 --> 00:51:30,670
for industrial workers, so there are from
officials to that shop floor may be your worker
353
00:51:30,670 --> 00:51:37,090
like mechanic or the someone who is doing
welding. So, different job profiles are there,
354
00:51:37,090 --> 00:51:45,160
their responsibilities and roles vary and
accordingly what happened they may suffer
355
00:51:45,160 --> 00:51:55,840
from different level of job stress.
So, I will show you in that how the job profile
356
00:51:55,840 --> 00:52:02,340
as well as their demographic variables will
affect the job stress and you will see that
357
00:52:02,340 --> 00:52:10,980
what way it will be done. The similar situation
you will face in your work also, then effect
358
00:52:10,980 --> 00:52:19,410
you will be able to just straight way translate
this to your work. It is possible getting
359
00:52:19,410 --> 00:52:31,710
me any question because this is a purely qualitative
one lecture. And but next class onwards it
360
00:52:31,710 --> 00:52:43,630
will be mathematical and then if there are
some, I think all of you know this.
361
00:52:43,630 --> 00:52:53,320
But, there are some famous quotes for statistics
in today’s, now this lecture with these
362
00:52:53,320 --> 00:53:05,280
quotes you know George Bernard Shaw what he
has basically said. You see it is the mark
363
00:53:05,280 --> 00:53:16,700
of a truly intelligent person to be moved
by statistics, now you know Mark TWAIN also,
364
00:53:16,700 --> 00:53:25,810
but he says there are lies damned lies and
statistics. The two different philosophy Mark
365
00:53:25,810 --> 00:53:31,890
Twain is saying that do not believe in statistics
because statistics is the maximum lies and
366
00:53:31,890 --> 00:53:39,790
that level.
But, we believe in Edward Deming’s because
367
00:53:39,790 --> 00:53:48,130
ultimately Edward Deming is the quality guru
that sense he used statistics in the quality
368
00:53:48,130 --> 00:53:57,830
domain, quality management domain. He showed
that the statistics can do wonderful in improving
369
00:53:57,830 --> 00:54:08,170
quality in terms of totality quality management
what he said in God with trust all other must
370
00:54:08,170 --> 00:54:22,180
bring data, if there is no data I will not
believe. So, what I we can then summarize
371
00:54:22,180 --> 00:54:24,140
what you have learnt today.
372
00:54:24,140 --> 00:54:37,140
So, today ultimately we started with first
is the definition of the word multivariate
373
00:54:37,140 --> 00:54:55,620
then we said that how statistical modeling
will be useful. We also said that when you
374
00:54:55,620 --> 00:55:05,070
talk about multivariate that multivariate
observations needs to be carefully measured
375
00:55:05,070 --> 00:55:13,060
you require careful measurement of multivariate
observations. You must be thorough about the
376
00:55:13,060 --> 00:55:24,090
data types like nominal data, your ordinal
data, your interval data, your ratio data,
377
00:55:24,090 --> 00:55:31,500
if possible you go by this order means if
possible collect ratio data.
378
00:55:31,500 --> 00:55:38,940
We have seen also that ratio data and interval
data are basically common need to this continuous
379
00:55:38,940 --> 00:55:51,830
and first two come under discrete. But, by
discrete data we also mean we understand to
380
00:55:51,830 --> 00:56:02,590
count data that we will keep in mind then
I show you basically that multivariate analysis.
381
00:56:02,590 --> 00:56:11,980
Means you have to work in the domain of matrix
because X n cross p that is what is our multivariate
382
00:56:11,980 --> 00:56:23,630
observation and this one can be written like
this. There may be several your vectors like
383
00:56:23,630 --> 00:56:32,300
x 1 like x 2 like x p where x 1 denotes the
first column, all observations x 2 denotes
384
00:56:32,300 --> 00:56:36,120
the second column observation these are the
vector variable.
385
00:56:36,120 --> 00:56:42,630
Vector 1 variable, vector 2 variable, vector
3 that is very, very important for all of
386
00:56:42,630 --> 00:56:58,690
us, then another thing what we said that do
not rely on technique do not be biased with
387
00:56:58,690 --> 00:57:14,190
techniques. You go by problem go by the need
means what I mean to say go by the problem
388
00:57:14,190 --> 00:57:19,280
and you carefully define these and that definition.
389
00:57:19,280 --> 00:57:32,670
So, that means you are basically going from
practical problem to statistical problem,
390
00:57:32,670 --> 00:57:43,890
statistical problem to statistical solution,
statistical solution to practical solution.
391
00:57:43,890 --> 00:57:50,810
Then practical solution to this and what you
have understood, here also that you have understood
392
00:57:50,810 --> 00:58:07,580
that there are in multivariate models
there are two sets of modeling techniques.
One is dependence and the other one is interdependence
393
00:58:07,580 --> 00:58:18,080
correct, but please keep in mind that these
two will not be fully understood if you do
394
00:58:18,080 --> 00:58:28,220
not understand multivariate statistics. Mainly
the descriptive statistic part, three things
395
00:58:28,220 --> 00:58:35,330
you have to understand, here these three things
are multivariate normal distribution, descriptive
396
00:58:35,330 --> 00:58:41,710
statistics and inferential statistics.
397
00:58:41,710 --> 00:59:11,360
Last, but that is also very, very important
one is that your prerequisite is basic statistics
398
00:59:11,360 --> 00:59:25,990
and I think linear algebra that will be better.
So, thank you very much, we will meet tomorrow
399
00:59:25,990 --> 00:59:26,850
again.