click below
click below
Normal Size Small Size show me how
QM Exam 1
Term | Definition |
---|---|
data warehouses | vast digital repositories that record and store data electronically |
Big Data | describe data sets so large that traditional methods of storage and analysis are inadequate |
transactional data | data collected for recording the companies' transactions |
data mining or predictive analytics | process of using data, especially transactional data to make decisions and predictions |
business analytics | describes any use of data and statistical analysis to drive business decisions from data whether the purpose is predictive or simply descriptive |
data | numerical, alphabetic, or alphanumerical; useless unless we know what it represents |
context | answering the questions who, what when, why, where, and how can make data values meaningful |
data table | clearly shows who the data was about and what was measured |
cases | rows of a data table correspond to individual __________ |
variables | some recorded characteristics |
respondents | individuals who answer a survey |
subjects/participants | people on whom we experiment |
experimental units | animals, plants, websites, and other inanimate subjects |
records | rows in a database |
metadata | typically contains information about how, when, and where (and possible why) the data were collected; who each case represents; and the definitions of all variables |
spreadsheet | a name that comes from bookkeeping ledgers of financial information |
relational database | two or more separate data tables are linked together so that information can be merged across them |
categorical/qualitative variable | when the values of a variable are simply the names of categories |
quantitative variable | when the values of a variable are measured numerical quantities |
identifier variables | categorical variables whose only purpose is to assign a unique identifier code to each individual in the data set |
ordinal | the variable is ______________ when the value of a categorical variable have an intrinsic order |
nominal | categorical variable with unordered categories |
cross-sectional data | several variables are measured at the same time point |
frequency table | records the counts for each of the categories of the variable |
area principle | says that the area occupied by a part of the graph should correspond to the magnitude of the value it represents |
bar chart | displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison |
relative frequency bar chart | replace the counts with percentages in order to draw attention to the relative proportion of visits from each Source |
pie chart | shows how w whole group breaks into several categories |
contingency tables | they show how individuals are distributed along each variable depending on, or contingent on, the value of the other variable |
marginal distribution | when presented like this, at the margins of a contingency table, the frequency distribution of either one of the variables is called __________ |
cell | any intersection of a row and column of the table; gives the count for a combination of values of the two variables |
total percent, row percent, or column percent | most statistics programs offer a choice for contingency tables |
conditional distribution | shows the distribution of one variable for just those cases that satisfy a condition on another |
independent | in a contingency table, when the distribution of one variable is the same for all categories of another variable, we say that the two variables are ________ |
segmented (or stacked) bar chart | treats each bar as the "whole" and divides it proportionally into segments corresponding to the percentage in each group |
mosaic plot | looks like a segmented bar chart, but obeys the area principle better by making the bars proportional to the sizes of the groups |
Simpson's Paradox | only combine compatible measurements for comparable individuals |
bins | give the distribution of the quantitative variable and provide the building blocks for the display of the distribution called a histogram |
histogram | plots the bin counts as the heights of bars |
gaps | indicate a region where there are no values |
relative frequency histogram | alternative is to report the percentage of cases in each bin |
stem-and-leaf displays | like histograms, but they also show the individual values |
quantitative data condition | the data must be values of a quantitative variable whose units are known |
shape, center, and spread | when you describe a distribution, you should pay attention to these three things |
shape | we describe the shape of a distribution in terms of its modes, its symmetry, and whether it has any gaps or outlying values |
modes | humps of a histogram |
unimodal | a distribution whose histogram has one main hump |
bimodal | distributions whose histograms have two humps |
multimodal | histograms with three or more humps |
uniform | a distribution whose histogram doesn't appear to habe any mode and in which all the bars are approximately the same height |
symmetric | the halves of a distribution on either side of the center look, at least approximately, like mirror images |
tails | the (usually) thinner ends of a distribution |
skewed | if one tail stretches out farther than the other, the distribution is said to be ________ to the side of the longer tail |
outliers | any stragglers that stand off away from the body of the distribution |
mean (average) | add up all the values of the variable, x, and divide that sum by the number of data values |
median | the value that splits the histogram into two equal areas |
range | the difference between the extremes: max-min |
lower quartile (Q1) | value for which one quarter of the data lie below it |
upper quartile (Q3) | value for which one quarter of the data lie above it |
interquartile rage (IQR) | summarizes the spread by focusing on the middle half of the data; it's defined as the difference between the two quartiles: Q3-Q1 |
variance | the average of the squared deviations |
standard deviation | we want measures of spread to have the same units as the data, so we usually take the square root of the variance, giving the __________ |
standardized value | the resulting value of standard deviation |
z-score | tells us how many standard deviations a value is from its mean |
five-number summary | reports a distribution's median, quartiles, and extremes (max and min) |
boxplot | displays the information from a five-number summary |
stationary | when a time series has no strong trend or change in variability |
time series plot | a display of values against time |
re-express/transform | one way to make a skewed distribution more symmetric is to ___________ the data by applying a simple function to all the data values |
scatterplot | plots one quantitative variable against another |
direction | pattern that can either be negative, positive, or neither |
form | straight, curved, exotic, no pattern? |
straight line relationship/linear form | will appear as a cloud or swarm of points stretched out in a generally consistent, straight form |
strength | tightly clustered in a single stream or so variable and spread out that we can barely discern a trend or pattern? |
explanatory or predicator variable | variable on the x-axis |
response variable | variable on the y-axis |
independent and dependent variables | the idea is that the y-variable depends on the x-variable and the x-variable act independently to make y respond |
correlation coefficient | a numerical measure of the direction and strength of a linear association |
correlation | measures the strength of the linear association between two quantitative variables |
quantitative variables condition | correlation applies only to quantitative variables |
linearity condition | correlation measures the strength only of the linear association and will be misleading if the relationship is not straight enough |
outlier condition | unusual observations can distort the correlation and can make an otherwise small correlation look big or, on the other hand, hide a large correlation |
lurking variable | some third variable that affects both of the variables you have observed |
linear model | just an equation of a straight line through the data |
predicted value | the prediction for y found for each x-value in the data; found by substituting the x-value in the regression equation; values on the fitted line |
residual | the difference between the predicted value and the observed value |
line of best fit/least squares line | the line for which the sum of the squared residuals is smallest |
slope | b1 is given in y-units per x-unit. differences of one unit in x are associated with differences of b1 units in predicted values of y |
intercept | the value of the line when the x-variable is zero |
regression lines | common name for least squares lines |
regression to the mean | because the correlation is always less than 1.0 in magnitude, each predicted y tends to be fewer standard deviations from its mean than its corresponding x is from its mean |
quantitative data condition | pretty easy to check, but don't be fooled by categorical data recorded as numbers |
linearity assumption | the regression model assumes that the relationship between the variables is, in fact, linear |
linearity condition | the two variables must have a linear association, or the model won't mean a thing and decisions you base on the model may be wrong |
outlier condition | make sure that no points need special attention |
independence assumption | assumption that the residuals are independent of each other |
equal spread condition | new assumption about the standard deviation around the line gives us this new condition |
R-squared | all regression analyses include this statistic, although by tradition, it is written with a capital letter; a fraction of a whole, it is often given a percentage |
Spearman rank correlation | works with the ranks of the data rather than their values |
random phenomena | we can't predict the individual outcomes, but we can hope to understand characteristics of their long-run behavior |
trial | each attempt of a random phenomena |
outcome | generated be each trial of a random phenomena |
event | more general term to refer to outcomes or combinations of outcomes |
sample space | a special event; the collection of all possible outcomes |
probability | the percentage of the callers who qualify |
independence | the outcome of one trial doesn't influence or change the outcome of another |
Law of Large Numbers (LLN) | states that if the events are independent, then as the number of trials increases, the long-run relative frequency of any outcome gets closer and closer to a single value |
empirical probability | because it is based on repeatedly observing the event's outcome, this definition of probability is often called ____________ |
theoretical probability | when we have equally likely outcomes |
personal probability | we call this kind of probability subjective |
probability | a number between 0 and 1 |
probability assignment rule | the probability of the set of all possible outcomes must be 1. P(S) =1 |
complement rule | the probability of an event occurring is 1 minus the probability that doesn't occur. P(A)=1-P(A^c) |
multiplication rule | to find the probability that two independent events occur, we multiply the probabilities; P(A and B)=P(A) x P(B), provided that A and B are independent |
disjoint or mutually exclusive | two events are _________ if they have no outcome in common |
addition rule | allows us tot add the probabilities of disjoint events to get the probability that either event occurs: P(A or B)=P(A) + P(B), provided that A and B are disjoint |
general addition rule | does not require disjoint events: P(A or B)=P(A) + P(B) - B(A and B) for any two events A and B |
marginal probability | uses a marginal frequency (from either the total row or total column) to compute the probability |
joint probabilities | probability that two events occur together |
conditional probability | a probability that takes into account a given condition |
general multiplication rule | for compound events that does not require the events to be independent: P(A and B)=P(A) x P(B|A) for any two events A and B |
independent | events A and B are __________ whenever P(B|A)=P(B) |
tree diagram | probability tree used to help think through the decision-making process |