HW1 Updated Sep9 2014

CSci 5523
Homework 1
Due: Sep 17, 2014
Note: Only a subset of questions will be graded. However, you are required to submit solutions to
all eight questions (1 - 8). Practice question solutions should not be turned in for grading. You are
encouraged to try them out on your own.
For questions specific to this homework please contact Xi Chen and Guruprasad Nayak.
Q1.
Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative
(nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one
interpretation, so briefly indicate your reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
1. Intensity of a pixel in an 8-bit Grayscale image (http://en.wikipedia.org/wiki/Grayscale)
2. Frequency of light incident on Earths surface (http://en.wikipedia.org/wiki/Frequency#Light)
3. Brightness as measured by peoples judgments
4. HTML color codes (for example, Yellow is encoded as FFFF00. http://html-colorcodes.info/color-names/)
5. Year, as measured on the Gregorian calendar
Q2.
Decide which of the similarity measures described in Chapter 2 would be most appropriate for the
following situations and why.
1. Suppose two of your friends are numismatists (collecting coins from different countries as a
hobby). You also have coins from various countries. You want to decide which friend has the most
similar collection to you. Hint: You can represent each collection as a vector of length 196 of the
official independent countries of the world, where the corresponding entry denotes the number of
coins collected for that country. State any assumption you make about the coin collections.
2. Suppose you measure the temperatures every day at a particular location (e.g., the airport) in 10
widely distributed major cities across the country. Similarity is to be computed between the
temperatures of these cities today and same day of last month. (You have a vector of all 10 cities
temperatures for today and a vector for all 10 cities temperatures for the same day of last month.)
3. A nutritionist wants to measure the dissimilarity of you and your friend based on following
attributes: your height (in meters), weight (in pounds), and your dietary requirement (in calories).
Note that the feature set includes only continuous features.
Q3.
Mapping fire activity is an important problem for supporting climate and carbon cycle studies as
well as forest management. Automated approaches to address this problem use reflectance data
collected from satellites orbiting the Earth to learn a classifier that can differentiate burned areas on
the ground from the unburned ones. Consider the output of two such classification algorithms
The input to the algorithm is the reflectance in 7 wavelength ranges for each 1km by 1km region in
a 1200km by1200km region (tile), i.e., the data is a matrix with 1200*1200=1,440,000 rows and
7 columns, having the reflectance values for each of the 1,440,000 pixels in each of the 7
wavelength regions.
The algorithms output a probability of fire activity for each pixel on the day the satellite imagery
was taken. This probability is also part of the data.
1. We are interested in knowing if the two algorithms agree on the presence or absence of fire at a
given location (pixel) a True/False answer at each location. How would you pre-process the given
data to answer this?
2. The consensus of the two algorithms on the fire activity in the given region (the whole image)
can be quantified by counting the number of such agreements and disagreements with respect to the
predictions of burned/unburned. Considering the output of each algorithm on the whole region as a
vector of length 1,440,000 (one value for each pixel) and doing the preprocessing in step 1, suggest
a suitable similarity measure to quantify the consensus between the two outputs. Note that
consensus between the outputs at a location implies that they both agree that it is a positive (burned
location) or they both call it a negative (unburned location). (For this question, assume that the
number of fire and non-fire pixels is roughly equal)
3. Typically, the number of fire pixels is much less in comparison to the number of non-fire pixels
(<0.1%). Does this have any effect on the similarity measure suggested in part 2? Explain.
4. Given that the data in hand is spatial in nature, how would you address the issue of missing
reflectance data for a certain pixel?
Q4
For each of the following vectors, x and y, calculate the values of the indicated similarity or
distance measures:
1. x = (1 1 0 0 0), y = (0 0 0 1 1) Jaccard, Cosine, Euclidean, Correlation, mutual information
2. x = (0 1 2 4 5 3), y = (5 6 7 9 10 8) Cosine, Euclidean, Correlation
3. x = (0 1 2 4 5 3), y = (5 5 5 5 5 5) Cosine, Euclidean, Correlation
4. x = (0 1 2 4 5 3), y = (5 5 5 5 5 5.0001) Cosine, Euclidean, Correlation

Q5
You have two similarity measures: Dice coefficient and Roger and Tanimoto coefficient, which are
defined as below.
Dice coefficient
S_DICE = 2a/(2a+b+c)
Roger and Tanimoto coefficient
S_RT = (a+d)/(a+2b+2c+d)
Where, a is the number of features that equal to 1 in both objects; b is the number of features that
are equal to 1 in the first object but are equal to 0 in the second object; c is the number of features
that are equal to 0 in the first object but are equal to 1 in the second object; and d is the number of
features that are equal to 0 in both objects.
Answer the following questions.
1. Is the Dice coefficient appropriate for asymmetric variables? Why?
2. Is the Roger and Tanimoto coefficient appropriate for asymmetric variables? Why?
3. Here, the task is to compute the similarity between two documents. Each document is
represented as a vector of binary features, where each feature is a word in a dictionary. In this
vector, a 1 indicates the presence of the word and a 0 indicates its absence. In this scenario, which
measure (Dice coefficient or Roger and Tanimoto coefficient) will you choose? Why?
Q6
Both the L1 distance and correlation are widely used to compare two time series. The most
appropriate measure depends on the specific problem or domain requirements. Decide the proper
measure for each of the following scenarios in climate data analysis:
1. We want to compare the temperatures of two cities for 100 days with respect to their levels.
2. We want to compare the temperatures of two cities for 100 days with respect to the trends (up
and down) that occur in temperature during the 100 days.
Q7.
Data reduction sampling, dimensionality reduction, or selecting a subset of features is necessary
or useful for a wide variety of reasons, but can be problematic if information necessary to the
analysis is lost in the process. The following questions explore several issues at a conceptual level.
1. Assume the property of interest is the rate, at which a particular event occurs, i.e.,
rate = number of times a particular type of event occurs / total number of all events.
a) If the event occurs at a rate of 0.001, i.e., 0.1% of the time, then what problems, if any, would
you encounter in trying to estimate the rate from a single sample of size 100?
b) If the event occurs at a rate of 0.50, i.e., 50% of the time, then what problems, if any, would you
encounter in trying to estimate the rate from a single sample of size 100?
2. You are given a data set of 10,000,000 time series, each of which records the temperature of the
Earth at a particular location on the surface of the Earth daily for 10 years. The locations are
arranged in a regular grid that covers the surface of the Earth. (Details of the exact nature of the
grid are unimportant. The important fact is that each point has neighbors to the left and right, up
and down.) Note that temperature displays considerable autocorrelation, i.e., the temperature at a
given location and time is similar to that of nearby locations and times. The size of the data needs
to be reduced so that you can apply your favorite data analysis algorithm. Both aggregation and
sampling could be used to reduce the amount of data.
a). If you use aggregation, would you aggregate over location or time or both?
b). How would you use the spatial and temporal autocorrelation of temperature to guide you in
aggregating the data?
c). If you use sampling, would you sample over location or time or both?
d). Would you prefer aggregation or sampling or both? (You can argue any of these as long as you
support your answer.)
Q8.
1. You are given a set of m objects that is divided into K groups, where the ith group is of size mi. If
the goal is to obtain a sample of size n<m, what is the difference between the following two
sampling schemes? (Assume sampling with replacement.)
(a) We randomly select n * mi / m elements from each group.
(b) We randomly select n elements from the data set, without regard for the group to which an
object belongs.
2. Consider the problem of learning a classifier for forest fire mapping as described in question 3.
The classification algorithm needs to be trained to recognize the signatures of burned pixels (and
how they are different from non-burned ones). Typically, there is a huge imbalance in the number
of fire (<0.1% of the total area) and non-fire pixels in any given region. Keeping this in mind,
answer the following questions:
(a) What would happen if one were to randomly sample some locations (pixels) for training the
classifier ignoring the presence/absence of fire at the location during sampling?
(b) How would stratified sampling help here?
Practice questions:
Q1
Consider a data set where the objects are images from a weather satellite and each image consists
of one million pixels. (Assume that each pixel consists of a real value representing the brightness.
Also, assume that the images are snapshots of different areas and do not represent images of the
same area at successive intervals in time.) The data can be represented as record data, where each
image is a record (object) and each pixel is an attribute.
1. Is there any spatial autocorrelation? Explain.
2. An image often has missing values for scattered pixels. (A pixel is missing, but those around it
are not.) Which of the three techniques for handling missing values (p. 40-41) would be the most
appropriate for this situation and why?
3. Some images are missing large blocks of pixels. Which of the three techniques for handling
missing values (p. 40-41) would be the most appropriate for this situation and why?
4. Would any of the following proximity measures correlation, cosine, or Euclidean be suitable for
computing similarity/distance among images? Explain.
Q2
Consider a document-term matrix, where tfij is the frequency of the ith word (term) in the jth
document and m is the number of documents. Consider the variable transformation that is defined
by
tfij = tfij * log (m/dfi)
(1)
where dfi is the number of documents in which the ith term appears and is known as the document
frequency of the term. This transformation is known as the inverse document frequency
transformation.
1) What is the effect of this transformation if a term occurs in one document? In every document?
2) What might be the purpose of this transformation?
Q3
Justify your answers.
1. Is the Jaccard coefficient for two binary strings (i.e., string of 0s and 1s) always greater than or
equal to their cosine similarity?
2. The cosine measure can range between [-1,1]. Give an example of a type of data for which the
cosine measure will always be non-negative.
Q4
Consider the following distance measure:
d(x, y) = |max(x) max(y)|
where x and y are real-valued vectors.
1. Let x = [3, 4, 2, 5, 1] and y = [8, 6, 9, 3, 0]. Compute their distance using the above
measure.
2. State whether the above measure has the following properties.
i. Positive definiteness
ii. Symmetry
iii. Triangle Inequality
iv. Is d (x, y) a metric?
Q5
The population for a clinical study has 500 Asian, 1000 Hispanic and 500 Native American people.
What is good way of sampling this population to ensure that the distribution of various
subpopulations is maintained if only 100 samples have to be chosen? Give the distribution of the
various sub-populations in the final sample.
Q6
Suppose you want to analyze the blood-pressure data collected for 100 Intensive Care Unit patients
measured at every single hour over a period of month. However, many of the patients have missing
values over some time points. Among the three techniques for missing value estimation discussed
in the book (Page 41): (i) eliminating data objects, (ii) estimating missing values, and (iii)
designing robust algorithm, which one will you prefer and why? Explain briefly.
Q7
Classify the following attributes as binary, discrete, or continuous. Also classify them as qualitative
(nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpretation,
so briefly indicate your reasoning if you think there may be some ambiguity.
Example: Age in years.
Answer: Discrete, quantitative, ratio
a.
b.
c.
d.
Age as measured by whether the age in years is 30 (value = 0) or greater than 30 (value = 1).
Speed of a vehicle measured in mph.
Intensity of rain as indicated using the values: no rain, intermittent rain, incessant rain.
Different flavors of ice cream.
Q8
A group of biologists conducts a field study to evaluate the relative occurrence of different types of birds at
a number of locations. There data is collected in a table where
the rows correspond to locations,
the columns correspond to different species of birds, and
the (i,j)th entry is the number of birds of the jth species at the ith location.

The following table indicates the representation, but is intended only for illustration.

Location/Species
Blue Jay
Crow
Robin
Sparrow
Alabama
30
50
Arkansas
15
50
Wisconsin
20
10

Answer the following questions based on above information
a. To which type of data described in Chapter 2 is this data most similar?
b. Describe the attribute type.
c. What proximity measure would you use if you wanted to find areas that were similar in terms of
the percentage of birds of each species? Explain.
d. Suppose that you only care about the presence or absence of a bird species at a location. How
would you transform the data and to which type of data set in Chapter 2 would this correspond?
e. What types of pairs of locations (represented as pairs of rows in the data matrix) would yield the
same similarity score even after performing the transformation proposed in (d). Also provide the
similarity measure used.

Q9
Label each of the following similarity measures as good or bad for finding similarity in document-term data.
Provide a one-line justification for each answer you provide.
a. Correlation
b. Cosine
c. Euclidean
Q10
For each of the following questions, give a True / False answer and a one sentence justification.
a.
b.
c.
d.
There are cases in which Euclidean distance may not be symmetric, i.e., d(x,y) d(y,x).
Quantitative variables are always continuous.
If the correlation of two attributes is -0.95, then they are completely unrelated.
Let a1, a2, and a3 be three attributes. If correlation of (a1, a2) = 0.5 and correlation of (a2, a3) = 0.5,
then correlation of (a1, a3) = 0.5.
e. The Hamming distance between two binary strings (i.e., strings of 0s and 1s), will never exceed the
Euclidean distance.

Q11
There is a group of n female students, and another group of n male students. Two n-dimensional vectors A
and B record the heights of the two groups of students respectively. Consider the variable transformation
that is defined by
A = (A - mean(A)) / std(A)
(1)
B = (B - mean(B)) / std(B)
(2)
where, std stands for the standard deviation of a vector.

a. What will be the effect of this transformation?
b. How will the correlation between A and B change after this transformation?

HW1 Updated Sep9 2014

Enviado por

Dados do documento

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

HW1 Updated Sep9 2014

Enviado por

Direitos autorais:

Formatos disponíveis

CSci 5523

Due: Sep 17, 2014

4. x = (0 1 2 4 5 3), y = (5 5 5 5 5 5.0001) Cosine, Euclidean, Correlation

where, std stands for the standard deviation of a vector.

Você também pode gostar