Segment queries in query logs into one or more query word sequences that maximize overall probability for the query
Determine frequent n-grams (n-word sequences) and count the query word sequences where all adjacent pairs of words in the sequence are frequent n-grams
Filter out non-compound or non-phrasal word sequences by requiring a compound/phrase to appear at both the beginning and the end of some queries (but not necessarily in the same query)
Construct a feature vector for each n-gram in a corpus,
including a count for each feature/in the feature vector
Determine value of each feature/in the feature vector as the point-wise mutual information MI between the n-gram and the feature/
Determine a similarity value between two n-grams as the cosine of the angle between their feature vectors using the values of the features in the
feature vectors
Generate extraction/contraction table of pairs of compounds where one compound is a substring of another compound with respective counts, e.g., from query logs, or similarity values