Você está na página 1de 55

SI-2102 Analisis Statistik dan

Probabilitas
Tujuan Perkuliahan

Mahasiswa menguasai dasar-dasar statistik dan


probabilitas dan dapat mengambil keputusan
dalam ketidakpastian serta mengerti
perencanaan berdasarkan probabilitas terutama
berkaitan dengan bidang teknik sipil.
Pendahuluan
Statistika: Teori dan Metodologi untuk
analisis data kuantitatif dari sampel
observasi dalam hubungan-hubungan
yang telah di hipotesakan
Alat untuk perencanaan dan kajian
Ilmu Statistika membantu analist
yang memiliki tumpukan data untuk
menghasilkan susunan yang teratur
dan penyederhananaan dari hal yang
kompleks dan tidak beraturan.
Pendahuluan

Kebanyakan fenomena atau proses yang berkaitan dengan ilmu rekayasa


mengandung ketidakpastian dimana hasil sesungguhnya adalah tidak
dapat diramalkan.

Fenomena seperti ini dapat diketahui pada pengamatan suatu percobaan


dimana terdapat perbedaan hasil dari satu percobaan terhadap percobaan
selanjutnya dibawah kondisi yang sama. Disana selalu terdapat selang nilai
pengukuran atau pengamatan. Selanjutnya nilai untuk selang tertentu bisa
terjadi lebih sering dibanding nilai lain.

Pengaruh ketidakpastian dalam suatu desain dan perencanaan dalam ilmu


rekayasa adalah penting dan evaluasi pengaruhnya pada kemampuan dan
desain sistem rekayasa semestinya memasukkan konsep dan metode
kemungkinan atau probabilitas.

Untuk mengerti ketidakpastian pada data dan perbedaan yang ada dapat
menggunakan alat bantu tampilan grafis. Tampilan grafis ini dapat berupa
line diagram/bar chart, dot diagram, histogram, frequency polygon,
duration curve dan lain-lain.
Contoh: Data Intensitas Hujan Tahunan
DAS Esopus Creek (1918 1946)

Karakteristik data pada tabel


dapat dilihat secara grafis dalam
bentuk histogram atau diagram
frekuensi.

Untuk tujuan membandingkan


dengan fungsi kerapatan
probabilitas (Probability Density
Function) teori diperlukan
diagram frekuensi.
Diagram frekuensi
Histogram atau diagram frekuensi, memberikan gambar grafis dari
frekuensi relatif berbagai pengamatan atau pengukuran.

Untuk tujuan rekayasa secara umum, ringkasan dari sekelompok


pengamatan lebih berguna dibandingkan histogram yang lebih detail. Ini
termasuk didalamnya nilai rata-rata (mean-value) dan pengukuran
dispersi.

Kuantitas seperti ini dapat dievaluasi dari histogram yang diberikan dimana
secara statistik selalu ditetapkan dalam bentuk rata-rata sample (sample
mean) dan standar deviasi standar (sample standard deviation).

Jika data yang tercatat dari sebuah variabel menunjukan suatu


penyebaran, seperti yang digambarkan sebelumnya, nilai dari variabel
tidak dapat diperkirakan dengan kepastian. Variabel seperti ini disebut
suatu variabel acak (random variable) dan nilainya (atau selang nilai)
dapat diperkirakan hanya dengan suatu probabilitas yang sesuai.
Jika dua (atau lebih) variabel acak terlibat, karakteristik satu variabel
bergantung pada nilai variabel lain.

Contoh: Hubungan Debit rata2 tahunan terhadap area draenase daerah


Honolulu: Todd dan Meyer (1971)
Statistik
Terdapat 2 tipe statistik
Statistik Deskriptif (Descriptive
Statistics): meliputi tabulasi,
penyederhanaan, dan penjelasan data. Atau
menyimpulkan data yang kompleks dengan
suatu nilai.
Statistik Inferensial (Inferential
Statistics): perkiraan karakteristik dari
suatu populasi berdasarkan pengetahuan
karakteristik suatu sample dalam populasi
tersebut.
Perkiraan Statistik
Setiap anggota dalam
populasi mempunyai
kesempatan yang sama
untuk terpilih sebagai
Populasi sampel.

Parameter-Parameter

Sampel Acak
Perkiraan

Statistik
Statistik Deskriptif, Skala Pengukuran (1)

Nominal
Tidak terdapat properti numerik atau
quantitatif, klasifikasi group atau kategori
Gender: Pria atau wanita
Bidang: Struktur atau Sumber Daya Air
Ordinal
Digunakan untuk mengurutkan level
variabel yang sedang di analisis. Tidak ada
nilai spesifik yang ditempatkan dalam skala
rating tersebut.
Rating hotel: bintang 4, bintang 3, bintang
2, dan bintang 1
Statistik Deskriptif, Skala Pengukuran (2)
Interval
Perbedaan antar nilai dalam skala dan interval
tersebut berukuran sama. Tidak ada nilai nol.
Dapat digunakan pembanding nilai pengukuran
Temperatur: Perbedaan antara 20 dan 30 derajat
adalah sama dengan perbedaan antara 30 dan 40
derajat. Kita tidak bisa bilang bahwa 40 derajat dua
kali lebih panas dari 20 derajat, hanya 20 derajat
lebih panas.
Rasio
Skala yang mempunyani titik nol yang
mengindikasikan nilai variabel tersebut tidak ada.
Dapat dijadikan rasio
Berat: 100 kg adalah setengahnya dari 200 kg
Statistik Deskriptif, Distribusi
Frekuensi
Dalam tabel, distribusi frekwensi di
bentuk dengan me-resume data dalam
bentuk nilai frekwensi observasi dalam
setiap kategori, skor, atau interval skor.

Dalam grafik, distribusi frekuensi


dibentuk dengan meresume data dalam
bentuk histogram atau poligon frekuensi
Distribusi frekuensi, histogram dan
poligon frekuensi
50

40

30
Frequency

20

10

0
22.5 27.5 32.5 37.5 42.5 47.5 52.5 57.5
25.0 30.0 35.0 40.0 45.0 50.0 55.0 60.0

Age in years
Tabel Frekwensi
Penyusunan Tabel Frekwensi
Pengelompokan data menjadi kelas
Mencari jumlah item/sampel dalam setiap kelas
Disusun agar data mudah di mengerti
Pertimbangan
Pengelompokan tidak overlap
Jumlah kelas umumnya antara 5 dan 18
Jika memungkinkan, pengelompokan memiliki lebar
yang sama, walaupun kadangkala lebar yang berbeda
diperlukan
Setiap observasi hanya terdapat dalam satu kelas
Jumlah Kelas
Cukup kecil untuk menampilkan summary/resume
Cukup besar untuk menampilkan karakteristik yang relevan
Kelas dengan batas yang paling kecil harus memasukan
nilai data terkecil
Kelas dengan batas yang paling besar harus memasukan
nilai data terbesar
Histogram
Visually displays the information from a frequency table
Plot the group boundaries on the horizontal axis-use a constant linear
scale
frequency (relative frequency) on the vertical axis
draw a vertical bar for each group
where the area of the bar is proportional to the (relative) frequency
for equal class widths, the height is proportional to the frequency
Note that there are no gaps between the bars for continuous data.
to convert to a relative frequency histogram, or a percentage histogram,
just change the vertical scale
Contoh Lain
Statistik Deskriptif
Kurva Normal Curva Bimodal

Positively Negatively
Skewed Skewed
Ogive(Cumulative Frequency
Polygon)
visually displays the information from a frequency
table
on the horizontal axis, the group boundaries are
drawn on linear scale
on the vertical axis, the percentages or proportions
For the first point, plot (lower boundary of lowest
class, 0)
then, for each class, plot (upper class boundary,
cumulative frequency) on the x and y axis
respectively
join the points
Histogram dan Poligon
Frekwensi
Penyajian Histogram dapat disampaikan dalam poligon frekwensi
Perbandingan dengan kelompok data lain dapat disajian secara
superimpose
Kecenderungan Tengah: Central Tendency
Modus (Mode)
Nilai yang mempunyai frekuensi paling besar
3 3 3 4 4 4 5 5 5 6 6 6 6: Modus=6
3 3 3 4 4 4 5 5 6 6 7 7 8: Modus adalah 3 dan 4
Nilai Tengah (Median)
Nilai yang membagi dua grup nilai dimana 50 % berada di atas
dan 50 % berada di bawah nilai median
3 3 3 5 8 8 8: Median=5
3 3 5 6: Median=4 (Rata-rata dari 2 nilai yang terdapat di
tengah)
Nilai Rerata (Mean)
Nilai yang selalu di utamakan, dan satu-satunya properti central
tendency yang digunakan dalam analisis statistika lanjut.
Lebih akurat dan reliabel
Cocok bagi perhitungan aritmatik
Pada umumnya menjumlahkan semua nilai dibagi dengan
banyaknya nilai.
2 3 4 6 10: Mean=5 (25/5)
Review: Measure of Central Tendency
Advantages:
it is easy to compute
combines well
i.e. the mean of
combined sample is the
weighted mean of two
sample means
where the weights are
proportions in each
sample
corresponds to the
'centre of gravity' of the
data values
Disadvantages:
It is affected by outliers
or extreme values
Median dan Modus
Advantages of the Advantage of Mode
median none
as it is the central Disadvantage of the
observation it is Mode
not affected by
extreme values It may not exist
If it exists, it may be far
Disadvantage of from "centre" of data
the median
It doesn't combine
well
It requires ranking
the observations
Pemilihan Kecenderungan Tengah:
Mean
Dipengaruhi oleh pencilan (outlier)
Median
Tidak dipengaruhi oleh pencilan dalam jumlah
yang sedikit
Umumnya median digunakan untuk skewed data
(tapi tidak selalu)
Menggunakan satu nilai central tendency tidak selalu
cukup
Berhati-hati dengan perata-rataan, hanya merata-
ratakan ketika masuk akal (konteks)
Mean dan Median selalu berbeda kecuali bagi data
yang simetri
Properti distribusi frekuensi:
Spread/Variability/Dispersion
Rentang (Range)
Dihitung dengan mengurangi nilai tertinggi dengan
nilai terendah
Hanya digunakan untuk skala Ordinal, Interval, dan
Ratio scales dan data harus terurut
Contoh: 2 3 4 6 8 11 24 (Rentang=22)
Varian (Variance)
Jangkauan nilai dalam distribusi frekuensi (The extent
to which individual scores in a distribution of scores
differ from one another)
Standard Deviasi (Standard Deviation)
Akar kuadrat dari varian
Digunakan untuk menggambarkan dispersi dalam set
observasi pada sebuah distribusi
Review: measure of spread
Range
the largest observation minus the smallest observation
Xmax-Xmin
Advantage of Range
It is easy to calculate
Disadvantage
It is very influenced by outliers
Interquartile Range
shows the range of the middle 50% of observations
Defined as the difference between third quartile & first
quartile
Q3 - Q1
Advantage
It is not affected by outliers
Disadvantage
It is harder to calculate as it requires ranking
Review: measure of spread (2)
Variance, 2
This is the "average" of the squared
deviations from the mean
Variance
Note the use of the divisor of (n - 1) for
the sample variance. This makes the
sample variance an unbiased estimator
of the population variance
Advantages
good mathematical properties
Disadvantages
It is strongly influenced by outliers
It is also in squared units
hard to have a good idea about what size
it
Standard Deviation
This is the square root of variance
advantages
This statistic is in original units and is thus
directly comparable to the mean
disadvantages
It is influenced by outliers
Coefficient of Variation
This is the ratio of the standard
deviation to the mean
It is usually expressed as a
percentage
It measures the spread relative
to the average size
Estimating the mean and variance
from grouped data
With grouped data
have lost the information about where in the group each
observation lies
all we know is the group in which each observation lies
Therefore we assume each observation in a group lies at the
group midpoint
About Dispersion
Is the amount of spread or scatter
that occurs in data set
If values in set are clustered tightly
around their mean, measured
dispersion (std. dev.) is small
if standard deviation is small, items
grouped around their mean
if standard deviation is large data
values widely dispersed about their
mean
Aturan Praktis
Untuk distribusi data yang berbentuk
lonceng (mendekati normal)
Sekitar 68% dari observasi terdapat
dalam rentang satu standard deviasi
dari mean
Sekitar 95% dari observasi terdapat
dalam rentang dua standar deviasi dari
mean
Sekitar 99.7% dari observasi terdapat
dalam rentang tiga deviasi dari mean
Chebyshev's theorem
The proportion (percentage) of any
data set that lies within k standard
deviations of the mean (k is any
positive number greater than 1) is at
least
1 - (1/k2)
eg. k = 2 - at least 75% of items in a data
set lie within 2 standard deviations of the
mean, it doesn't matter how skewed the
data set is
Z-Scores dan T-Scores
Z-Scores
Most widely used standard score in statistics
It is the number of standard deviations above or below the
mean.
a Z score of 1.5 means that the score is 1.5 standard
deviations above the mean; a Z score of -1.5 means that the
score is 1.5 standard deviations below the mean
Always have the same meaning in all distributions
To find a percentile rank, first convert to a Z score and then
find percentile rank off a normal-curve table
T-Scores
Most commonly used standard score for reporting performance
May be converted from Z-scores and are always rounded to
two figures; therefore, eliminating decimals
Always reported in positive numbers
The mean is always 50 and the standard deviation is always
10.
a T-score of 70 is 2 SDs above the mean
a T-score of 20 is 3 SDs below the mean
Korelasi dan Regresi Linear
Korelasi atau Kovarian (Correlation/Covariation)

Koefisien korelasi adalah summary statistik dari


derajat keterkaitan atau hubunan antara dua
variabel

Dapat memililiki korelasi negatif atau positif

Regresi Linear
Tujuan dari persamaan regeresi adalah untuk
perkiraaan sampel baru observasi berdasarkan
temuan dari sampel sebelumnya.
Types of Statistical Analysis -
Descriptive
Quantify the degree of relationship
between variables
Parametric tests are used to test
hypotheses with stringent
assumptions about observations
e.g., t-test, ANOVA
Nonparametric tests are used with
data in a nominal or ordinal scale
e.g., Chi-Square, Mann-Whitney U,
Wilcoxon
Types of Statistical Analysis -
Inferential
Allow generalization about populations using data
from samples
Non-parametric
Non-parametric tests do not require any assumptions
about normal distribution, but are generally less
sensitive than parametric tests.
The test for nominal data is the Chi-Square test
The tests for ordinal data are the Kolmogorov-Smirnov
test, the Mann-Whitney U test, and the Wilcoxon
Matched-Pairs Signed-Ranks test
Parametric
The tests for interval and ratio data include the t-test
and etc
Statistics and Probability
Statistics: Procedures for describing, analyzing,
and interpreting quantitative data
The choice of statistical technique should be
guided by the research design and the type of
data collected
Probability simply represents a judgment about
likelihood of outcomes, i.e., how likely is it that I
could obtain a result like this purely by chance?
Statistical inferences
significant very unlikely the effect would occur by
chance, e.g. less than 5%
not significant - results are likely to have occurred by
chance
Statistik Inferensial: Sampling (1)
Sampling relates to the degree to which those
surveyed are representative of a specific
population

The sample frame is the set of people who


have the chance to respond to the survey

A question related to external validity is the


degree to which the sample frame corresponds
to the population to which the researcher
wants to apply the results (Fowler, 1988)
Sampling (2)
Two basic types: probability and non-
probability

Probability sampling (PS) can include


random sampling, stratified random
sampling, and cluster sampling

Non-probability sampling (NPS) can


include quota sampling, snowball
sampling, and convenience sampling
Random Sampling (PS)
Every unit has an equal chance of
selection

Although it is relatively simple,


members of specific subgroups may
not be included in appropriate
proportions
Stratified Random Sampling
(PS)
The population is grouped according
to meaningful characteristics or strata

This method is more likely to reflect


the general population, and subgroup
analysis is possible

However, it can be time consuming


and costly
Systematic Sampling (PS)
Every xth unit is selected
(e.g., every other person entering the
gate was selected)

The method is convenient and close to


random sampling if the starting point is
randomly chosen

Recurring patterns can occur and should


be examined
Cluster/Multistage Sampling
(PS)
Natural groups are sampled and then
their members are sampled

This method is convenient and can


use existing units
Quota Sampling (NPS)
The population is divided into
subgroups and the sample is selected
based on the proportions of the
subgroups necessary to represent the
population

This method depends on reliable data


about the proportions in the
population
Convenience Sampling (NPS)
This method uses readily available groups or
units of individuals

It is practical and easy to use

However, it may produce a biased sample

Convenience sampling can be perfectly


acceptable if the purpose of the research is to
test a hypothesis that certain variables are
related to one another
Snowball Sampling (NPS)
Previously identified members
identify others

This method is useful when a list of


potential names is difficult to obtain

However, it may produce a biased


sample
Statistics & Parameters
A parameter is a value, usually unknown (and
which therefore has to be estimated), used to
represent a certain population characteristic.
For example, the population mean is a
parameter that is often used to indicate the
average value of a quantity

A statistic is a quantity that is calculated from


a sample of data. It is used to give
information about unknown values in the
corresponding population. For example, the
average of the data in a sample is used to
give information about the overall average in
the population from which that sample was
drawn.
The sampling distribution describes
probabilities associated with a statistic when a
random sample is drawn from a population
Interval Estimate & Sampling Distributions
Interval Estimate
A range or band within which the parameter is thought to
lie, instead of a single point or value as the estimate of
the parameter

Sampling Distributions

The sampling distribution of the mean is a frequency


distribution, not of observations, but of means of
samples, each based on n observations.

The standard error of the mean is used as an estimate of


the magnitude of sampling error. It is the standard
deviation of the sampling distribution of the sample
means.
Inferential Statistics
Confidence Intervals
Same as the percentage of cases in a
normal distribution that lie within 1, 2, or 3
standard deviations from the mean

Central Limit Theorem


States that the distribution of samples
(means, medians, variances, and most
other statistical measures) approaches a
normal distribution as the sample size, n,
increases
Resume: Statistic Deskriptif & Inferential
Deskriptif
A. For one variable ("univariate analysis"):
Measures of "CENTRAL TENDENCY") (averages) and of DISPERSION or
variance around that average.
Examples: Means, Modes, Medians, Standard Deviation, quartiles
Descriptive statistics for the strength of relationship between two
variables (bivariate analysis) or among a set of variables (multivariate
analysis) are measures of ASSOCIATION or correlation.
Inferential
Are measures of the SIGNIFICANCE of the relationship between two or
more variables. Significance refers to the probability that the findings
could be attributed to sampling error.
Appropriate statistics depend on the LEVEL OF MEASUREMENT OF THE
DEPENDENT VARIABLE (and of the independent variable).

55

Você também pode gostar