Está en la página 1de 8

Lesson 1

Categorical Variables: word descriptions that reflect categories to


which an individual belongs
Use frequency table: category (x) and count/percentage
(frequency- y)
Pie chart
o Can only graph parts of a wholeadds up to 100%
o Cant compare results of different studies
Bar graph:
o Horizontal axis: labels/words (category)
Quantitative variable: numeric values that can be added or averaged
Histogram
o Horizontal axis: scaled according to measurements
(numbers!!)
o Vertical axis: frequency or percent
o Best for large data setsshows the number of data points
that fall in each class, not each individual data point
Stem plots
o Stems and leaves
o Reveal outliers
Time plots
Differences between bar graph and histogram
Bar Graph
Histogram
For categorical data
For quantitative data
Shape, center, spread have NO
Shape, center, and mean are
meaning
important!!
Bars are separated by spaces
Bars are adjacent gaps indicate
no data in that interval
Histogram
Center: variable on horizontal axis that represents the midpoint
of the distribution (half of data on each side of it)
Spread: variability of data in a distribution; describes how tightly
grouped or widely dispersed the data are around the center
o Look for min and max values on x-axis
Outliers: observations that fall outside overall pattern of
distribution
Bell shaped, right skewed (long tail on right), left skewed,
uniform (no obvious peaks or valleys)
Time Plots/Time Series
X-axis: time
Y-axis: variable of interest

Look for trends in overall pattern (upward, downward, plateau,


etc) and cyclical variations (lil variations that occur at regular
intervals)

Lesson 2
The Quartiles
Q1: The 1st Quartile
o Has 25% of the observations in the ordered data set below
it and 75% above it
Q2 the 2nd Quartile
o Median (M)
Q3
o 75% below, 25% above
computing Q1 and Q3
o order the data and find the median (doesnt have to be an
actual data point)
o if number of data points (n) is odd, then leave the overall
median out of the computation but if n is even, use all
observations
o Q1 = median of lower half of data and Q3 = median of
upper half of data
5 Number Summary and Box Plot
Box plot: symmetry of box and whiskers
o Width of brackets give info about spread (measure of
spread = quartiles)
o Central box spans quartiles
o Center line of box = median
o 50% data above median and 50% below
o whisker on left longer = left skewed
o range: entire length of box plot from max to min
o IQR: Q3 Q1
Standard Deviation
Affected by strong skewnesss and outliers
Has same unit of measurement of original observations
Is either 0 or positive
o S=0 then data doesnt vary and all the numbers in the data
set are equal
* 1-var stats L1 gives mean, (x), S.d (Sx), 5 number summary
basically
Steps of Statistical Problem-Solving

1. State the question and determine what needs to be measured on


each unit
2. Formulate what statistical procedure should be used to analyze
the data and answer the research question
a. Decide how to collect appropriate data, summarize and
graph data, analyze data for inference
3. Collect data and data summarization
a. Collect data: Use scientific surveys or designed
experiments
i. Federal/state/local surveys, institutional, private
research firm surveys
ii. Lab studies/experiements, clinical trials, university
sponsored experiments, hospital stidues
b. Data summarization: obtain graphs and numerical
summaries appropriate for the data tht will aid in
answering the research question interpret these graphs
in contex
c. Data analyisis for inference: compute values necessary for
stastical inference
4. Conclude
a. Draw conclusions based on results in step 3
NOT affected by outliers/skewness: (robust)
IQR because measures range between q1 and Q3
Affected by outliers/skewness:
Mean
Standard deviation
Variance
Range

Lesson 3
Explanatory variable (x) may explain/influence changes in the response
variable
Comes first and predicts response variables
Response variable (y) measures outcome on each individual
Scatterplots
To graph bivariate data, determine scales on x and y axes (dont
have to be same scale) and plot each (x,y) pair
Direction
o Positive: positively sloped; as x increases, y increases
o Negative: downward slope
Form
o Linear (straight line)
o Nonlinear (curvature) quadratic, exponential, etc.
Strength of relationship
o Strong: points concentrated about the form
o Weak:
Correlation Coefficient (r)
A number that gives a measure of the direction and strength of
the linear relationship between two quantitative variables, x and
y
o Direction: + or
o Strength: 0, weakest; 1, strongest
-1 </ r </ 1
o r=-1: points on straight line with negative slope
o r =0: no linear association/pattern; points appear randomly
o r=1: points on straight line with positive slope
has no units of measure
doesnt distinguish between explanatory and response variables
NOT appropriate for curvilinear relationships
Is affected by outliers

Gives idea about how concentrated data are about regression


line

Least Squares Regression Line


Models linear relationship between x and y and can be used to
predict a value for the response variable for a specific value of
the explanatory variable
o Represents percent of variance in y that can be explained
by changes in x
Stats calc__ LinReg(a+bx) L1, L2
Coefficient of determination (R2)
o Matches exactly with strength of fit; is easier to access
virtually than r
o Tells percentage of variation in y thats explained by least
squares regression line
o A measure of how successfully the regression explains the
response y
Extrapolation: BAD!! Use of regression line for predictions outside
the range of x-values used to obtain the line
Lesson 4
Experiment
Subjects assigned to group and then imposed treatments
Establishes causation
Explanatory variable or factor = set of treatments imposed on
subjects that may affect outcome of study
Purpose: to determine whether the treatments affect response
Three Principles of Experimentation
1. Randomization
2. Replication
3. Causation
Randomized Block Design (RBD)
An experimental design where the random assignment of
individuals to treatment is carried out separately within each
block
Block: group of individuals that are similar with respect to some
characteristic known before the experiment begins and that
characteristic is expected to affect the response to the
treatments
o Blocks are another form of control!! They control effects of
the variable that defines the blocks
Individuals in each block vary from individuals in other blocks
Recommended when individuals are similar within a block but
differ from block to block
Removes confounding of lurking variable with response variable

Matched Pairs
Special case of RBD
Explanatory variable: two treatments (makes a pair, duh); more
than 2 = normal RBD
o Ex. twins each get treatment, 2 treatments for each
individual, measurements before and after treatment on
each individual
Cautons about Experimentation
Hidden bias: bias thats introduced by not treating all individuals
equally after treatments are applied
Placebo effect positive response for subjects taking placebo due
to confident in doctor and hope in medication .. thats why
control group with placebo is necessary
Observational Studies
Researchers observe individuals and record info about variables
of interest
No treatment is imposed!! And individuals self-select which
treatment to receive
o Observe them in neutral way without changing course of
their livs
Influential factors can be controlled
All sample surveys are observational
Often have confounding = fail to have clear causal conclusions
Bad Sampling: non-probability sample chosen using personal
judgement or human subjectivity
Convenience sample = produces unrepresentative sample
Voluntary response sample: individuals choose themselves =
consists of ppl with really strong opiniions who are more likely to
respond
Mall-intercept sample: mall shoppers are interviewed = retired,
middle class, and teens are overrepresented whereas poor are
underrepresented
Quota sample: individuals selected to fill quotas; most
interviewers use own preferences in choosing individuals to
sample
Simple Random Samples (SRS)
A sample size of n chosen from the population in such a way
that every set of n individuals has an equal chance to be part of
the sample actually selected
In other words, every possible sample from the population has
equal chance of selection
Sample taken from entire population
Stratified Sample

Classify individuals of a population into groups (strata) according


to some known characteristic prior to survey
Then choose probability sample within each stratum
Combine the samples from all strata to form complete sample
Sources of Sample Survey Bias
Undercoverage: occurs because some groups in a population are
left out when the sample is choise
o Ex. a sample of household excludes homeless, busy
people, etc.
Non-response: when an individual chosen in the sample refuses
to provide answers or cant be contacted
Response: when the respondent gives responses that influence
results in a systematic way
o Respondents often lie when asked about illegal and
unpopular behavior or a topic they dont know about
o Ex. cheating on tests, most will say they dont
Interviewer: when an interviewer influeces the response in a
systematic way
o Because of his/her social position, poor training, gender,
etc.
Question wording: when questions have leading phrases, loaded
words, or ambiguities that influence response
Confounding: two variables are confounding when their effects on a
third variable cant be distinguished from each other

Lesson 5
Probability
Describes what happens in many trials and how likely an event is
to occur describes what happens in long run
P = 0 event is impossible and will never occur
P = 1 event is certain and will occur on every trial
Discrete probability models
Sample space made up of a list of individual, discrete outcomes
(whole numbers)
Continuous Probability Models
Probabilities = area under a density curve

o Area = 1 corresponding to total probability = 1


o This total area under curve represents whole population
(sample space_
o Probability of a single value = 0 because were looking at
intervals/ranges of values
Density curve: the line over bars of histogram
o Doesnt describe outliers
o Similar to regression lines used to make predictions and
observe linear trend
The area under the curve and above any ranges of values on the
horizontal axis is the probability of an outcome in that range
o Have intervals of outcomes rather than individual
outcomes
Conditional Probability

Disjoint: events A and B have no outcomes in common or can never


occur together
Ex. male and pregnant are disjoint whereas male and Caucasian
are not
P (A or B) = P(A U B) = P(A) + P(B)
Independent Events
Knowing that one event is true or has happened does NOT
change the probability of the other event
Ex. male and pregnant is not independent because knowing that
the individual is male tells you the probability that hes pregnant
is 0
Screening Tests
Diagnosis specificity = P(negative | no disease)
Diagnosis sensitivity = P (positive | disease)
Positive predictive value (PPV) = P (disease | positive) = # of
true positives/ (# true + # false positives)

Lesson 6
Normal Distributions

También podría gustarte