Type: | Package |
Title: | Fit Multi-Modal Mallows' Models to Ranking Data |
Version: | 1.1 |
Date: | 2020-01-23 |
Depends: | combinat |
Description: | An EM algorithm to fit Mallows' Models to full or partial rankings, with or without ties. Based on Adkins and Flinger (1998) <doi:10.1080/03610929808832223>. |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
Packaged: | 2020-02-06 17:51:43 UTC; adrien |
Repository: | CRAN |
Date/Publication: | 2020-02-07 01:10:16 UTC |
NeedsCompilation: | no |
Author: | Erik Gregory [aut, cre] |
Maintainer: | Erik Gregory <egregory2011@gmail.com> |
Fit Multi-modal Mallows' models to ranking data.
Description
Fits the Mallows' model to ranking data. Data can be partially or fully-ranked.
Details
Package: | RMallow |
Type: | Package |
Version: | 1.0 |
Date: | 2012-02-18 |
License: | GPL (>= 2) |
Author(s)
Erik Gregory Maintainer: <egregory2007@yahoo.com>
References
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
"Estimating a Population Distribution of Sequences of k Items from Cross- Sectional Data". Laurel A. Smith (Beckett) and Denis A. Evans. Journal of the Royal Statistical Society. Series C (Applied Statistics). Vol. 40, No. 1 (1991), pp.31-42. Blackwell Publishing for the Royal Statistical Society. Accessed 16/08/2010. http://www.jstor.org/stable/2347903 .
"A Non-iterative procedure for maximum likelihood estimation of the parameters of Mallows' Model Based on Partial Rankings". Laura Adkins and Michael Flinger. Communication in Statistics - Theory and Methods, 27:9, 2199-2220. 1998, Marchel Dekker, Inc. http://dx.doi.org/10.1080/03610929808832223 .
All Kendall's distances between two sets of rankings.
Description
Calculates all of the Kendall's distances between two different sets of rankings.
Usage
AllKendall(r, seqs, data.info = NULL)
Arguments
r |
One set of sequences. |
seqs |
Another set of sequences. |
data.info |
Optional argument, a 0/1/NA matrix specifying all of the relevant information to calculate Kendall's difference for "r". Used for efficiency in "Solve". |
Value
Matrix where output[i, j] represents the distance from sequence "i" in "r" to sequence "j" in "seqs".
Author(s)
Erik Gregory
Examples
data1 <- do.call("rbind", list(1:5, 5:1, c(3, 2, 1, 4, 5)))
data2 <- do.call("rbind", list(1:5, 5:1))
# AllKendall(data1, data2)
Calculate all distances between a set of sequences and a fixed sequence.
Description
Used to calculate the sequence Kendall distance distribution in N! space.
Usage
AllSeqDists(seqs)
Arguments
seqs |
Matrix or data frame of sequences. |
Value
Vector of the distances from the sequences to 1:N.
Author(s)
Erik Gregory
Fit Mallows model N times and select most likely model. The EM algorithm to fit Multi-Modal Mallows' models is prone to getting stuck in local maxima, so we run it several times and selec the best one.
Description
Fit Mallows model N times and select most likely model. The EM algorithm to fit Multi-Modal Mallows' models is prone to getting stuck in local maxima, so we run it several times and selec the best one.
Usage
BestFit(datas, N, iter, G)
Arguments
N |
number of times to run the model |
iter |
maximum number of iterations for each run |
G |
Number of cluster centers |
datas |
data set to fit |
Value
best fitting model.
Calculate the normalizing coefficient for Mallow's model in a sequence space.
Description
Calculate the normalizing coefficient, as a function of the lambda parameter, and the size of the sequence space.
Usage
C_lam(lambda, dists = NULL, dists.table = NULL)
Arguments
lambda |
Spread parameter for Mallows' model. |
dists |
Vector of all distances from each sequence to 1:N |
dists.table |
Table version of "dists" above. |
Value
Normalizing coefficient of Mallows' model in N! space with lambda = lambda.
Author(s)
Erik Gregory
Constructs sequences from Kendall Information matricies.
Description
Sequences in a fully-ordered sequence space have a unique Kendall Information vector associated with them. This function creates the sequence from the Kendall information vector.
Usage
ConstructSeqs(prefs, n.abils)
Arguments
prefs |
Ordering preference between columns in the data. 1 cooresponds to an increase, 0 to a decrease. |
n.abils |
Number of columns in the original data set. |
Value
List of fully-ordered sequences, one for each row of prefs.
Author(s)
Erik Gregory
Examples
ConstructSeqs(matrix(c(1, 1, 1, 0, 0, 0), nrow = 1), 4)
# Should output (4, 1, 2, 3)
Calculate the Kendall distance distribution in N! space.
Description
This function counts the number of fully-ordered vectors at each distance in N! space.
Usage
DistanceDistribution(N = 3)
Arguments
N |
Integer value, greater than or equal to 3. |
Value
Table-like structure, where the names represent the distance from the modal sequence of each sequence in N! space, and the values represent the number of sequences at that distance in the sequence space.
Author(s)
Erik Gregory
The Expectation step of the EM algorithm.
Description
Assigns each ranking the probability that it belongs to each cluster, given current parameters.
Usage
EStep(R, r, p, lambda, G, N, C, all.dists = NULL)
Arguments
R |
Current cluster modal sequences. |
r |
The data of partial or full rankings. |
p |
The proportion of the data currently assigned to each cluster. |
lambda |
The lambda parameters from Mallow's model for each cluster. |
G |
Number of clusters, length(R). |
N |
Number of rows in the data. |
C |
Vector of normalizing coefficients for the clusters. |
all.dists |
For efficiency, provide all of the Kendall distances between each sequence and each cluster mode. |
Value
Matrix where output[i, j] represents the current probability that subject "i" belongs to cluster "j".
Author(s)
Erik Gregory
References
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Formats the data in the "Solve" function for output.
Description
Data formatting function.
Usage
FormatOut(R, p, lambda, z, datas, likelihood)
Arguments
R |
The modal sequences. |
p |
Proportion of data in each cluster. |
lambda |
Mallows' spread parameters for each cluster. |
z |
Probability of cluster membership for each individual. |
datas |
Matrix of partial sequences. |
likelihood |
Vector of the log-likelihood of the model at each iteration. |
Value
R |
The modal sequences |
p |
Proportion in each cluster |
lambda |
Spread parameters for each cluster |
datas |
Rankings merged with their cluster membership, distance from each cluster center, and probability of each cluster membership |
min.like |
Likelihood at each iteration |
Author(s)
Erik Gregory
All information used to calculate Kendall's distance.
Description
Performs each column-wise comparison on a matrix of sequences. A 0 value denotes that there is an increase between the two columns, 1 a decrease, and NA indicates that the column values are identical in the row.
Usage
KendallInfo(r, inds = NULL)
Arguments
r |
Matrix of sequences. |
inds |
Possibly efficiency increase when doing repeated calculations, currently not used. |
Value
Matrix of 0s, 1s, and NAs representing pairwise comparisons of vector values.
Author(s)
Erik Gregory
References
http://en.wikipedia.org/wiki/Kendall_tau_distance
Objective function to determine lambda.
Description
Objective function to find the root of in calculating the lambda parameters for each cluster.
Usage
Lambda(lambda, rhs, dists, dists.table = NULL)
Arguments
lambda |
lambda value to calculate the function output at. |
rhs |
Right-hand side of the equation in the referenced paper. |
dists |
Not used. |
dists.table |
Table of distances between each sequence and the modal sequence in N! space. |
Value
Output of the objective function to determine the root of. Goal is zero.
Author(s)
Erik Gregory
References
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Likelihood of the data and parameters.
Description
Calculates the log-likelihood of the data with the current parameters and Kendall's distance.
Usage
Likelihood(z, p, C.lam, lambda, all.dists.data)
Arguments
z |
Probability of each cluster membership. |
p |
Proportion in each cluster. |
C.lam |
Vector of normalizing coefficients for Mallows' model. |
lambda |
Current spread parameters |
all.dists.data |
All distances from the data to the modal sequences. |
Value
Current log-likelihood of the data with the current parameters.
Author(s)
Erik Gregory
References
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Fits a Multi-Modal Mallows' model to ranking data.
Description
Fits the Multi-Modal Mallows' model to partial or full ranking data, using Kendall's metric and an EM algorithm. This is essentially metric sequence clustering.
Usage
Mallows(datas, G, iter = 10, hyp = NULL,
plot.like = FALSE)
Arguments
datas |
Matrix of partial or fully-ranked data. |
G |
Number of modes, 2 or greater. |
iter |
Maximum number of iterations. |
hyp |
Hypothesis sequence vector, to initialize one of the cluster centers at. |
plot.like |
Should the likelihood be printed at each iteration? |
Value
See output of FormatOut
Author(s)
Erik Gregory
References
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Calculates the table of Kendall distances in (N+1)! space, given those in N! space.
Description
This is identical to counting the number of fully-ordered vectors at each bubble sort distance in (N+1)! space.
Usage
NextTable(last.table, N.last)
Arguments
last.table |
Table of distances in N! space. |
N.last |
N |
Value
Table of distances in (N+1)! space.
Author(s)
Erik Gregory
Initialize sequence modes for the clustering process.
Description
Initialize sequence modes for the clustering process.
Usage
Rgen(G, hyp = NULL, abils)
Arguments
G |
number of cluster centers, including the hypothesis if provided |
hyp |
a single sequence of length |
abils |
number of items being ranked |
Value
A list of G cluster centers, each of length abils
Author(s)
Erik Gregory
Examples
Rgen(3, 1:5, 5)
Calculates distances in N! space.
Description
Calculates Kendall's distances of each sequence in N! space. This is VERY Inefficient for N >= 8. See DistanceDistribution for an astronomical improvement (possibly on the order of 10^10).
Usage
SeqDistribution(N)
Arguments
N |
Length of the ranking. Preferrably less than 9. |
Value
Vector of Kendall distances from 1:N to each sequence in N! space.
Author(s)
Erik Gregory
Change the form of ordered sequences.
Description
Simplifies sequences so that each tie group is only of distance 1 to the next tie group. For example, we would simplify (1, 1, 2, 4, 4, 5) to (1, 1, 2, 3, 3, 4).
Usage
SimplifySequences(loss.time)
Arguments
loss.time |
Matrix of sequences to be simplified. |
Value
Simplified sequences, as described in Description.
Author(s)
Erik Gregory
Update the Lambda parameters of clusters.
Description
Updates the Lambda parameters to maximize the likelihood of the data under Mallows' model.
Usage
UpdateLambda(r, R, z, G, dists.to.Rg, dists.table,
top.bound = 1000)
Arguments
r |
Matrix of partial rankings. |
R |
Current modal sequences. |
z |
Current probabilities of memberships in each cluster. |
G |
Number of modal sequences. |
dists.to.Rg |
Matrix of the distances between the data and the current modal sequences. |
dists.table |
Table of the distance distribution in N! space, under Kendall's metric. |
top.bound |
The maximum value for the lambda parameter. |
Value
Vector of new lambda parameters for the clusters.
Author(s)
Erik Gregory
References
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Update Proportion in each cluster.
Description
Updates the proportion of data assigned to each cluster.
Usage
UpdateP(z)
Arguments
z |
Probabilities that each sequence is in each cluster. |
Value
Proportion of data in each cluster.
Author(s)
Erik Gregory
References
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Update modal sequences in each cluster.
Description
Maximizes the likelihood of the data by updating the cluster centers of the model.
Usage
UpdateR(r, z, infos = NULL)
Arguments
r |
Matrix of sequences being clustered. |
z |
Probability of cluster membership for each sequence and each cluster. |
infos |
The KendallInfo matrix for "r". |
Value
New cluster centers for each cluster.
Author(s)
Erik Gregory
References
"Mixtures of distance-based models for ranking data". Thomas Brendan Murphy & Donal Martin. 1 April 2002. Computational Statistics & Data Analysis 41 (2003) 645-655.
Sample data set.
Description
Simple synthetic data set containing 3 modal sequences in 15! space, with some noise added.
Format
The format is: num [1:1700, 1:15] 1 15 1 15 15 12 10 4 1 15 ...
Examples
data(datas)
head(datas)
1980 APA Presidential Candidate ranking data.
Description
This data is a pre-processed version of the 1980 American Psychological Association Presidential candidate ranking data. It has uninformative rankings removed, and values pre-simplified into partial rankings.
Format
The format is: int [1:1378, 1:3] 1 1 1 1 2 2 1 1 2 2 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:1378] "1" "2" "3" "6" ... ..$ : chr [1:3] "Carter" "Reagan" "Anderson"
Source
The American Psychological Association, http://www.electionstudies.org/studypages/1980prepost/1980prepost.htm
Examples
data(elect)
head(elect)
Hello, World!
Description
Prints 'Hello, world!'.
Usage
hello()
Examples
hello()
Fitted version of the toy datas data set, with three modal sequences.
Description
The data has 3 modal sequences, and we can compare this to the two.mode data set.
Format
The format is: List of 5 $ R :List of 3 ..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ... ..$ : int [1:15] 1 3 5 7 9 2 4 6 8 10 ... ..$ : int [1:15] 15 14 13 12 11 10 9 8 7 6 ... $ p : num [1:3] 0.447 0.118 0.435 $ lambda : num [1:3] 2.01 1000 2.04 $ datas :'data.frame': 1700 obs. of 23 variables: ..$ X1 : num [1:1700] 1 15 1 15 15 12 10 4 1 15 ... ..$ X2 : num [1:1700] 2 14 2 14 14 13 13 12 2 14 ... ..$ X3 : num [1:1700] 3 13 3 13 13 2 4 6 3 13 ... ..$ X4 : num [1:1700] 4 12 4 12 12 8 7 1 4 12 ... ..$ X5 : num [1:1700] 5 11 5 11 11 9 14 5 5 11 ... ..$ X6 : num [1:1700] 6 10 6 10 10 1 8 10 6 10 ... ..$ X7 : num [1:1700] 7 9 7 9 9 15 1 13 7 9 ... ..$ X8 : num [1:1700] 8 8 8 8 8 10 9 9 8 8 ... ..$ X9 : num [1:1700] 9 7 9 7 7 6 5 14 9 7 ... ..$ X10 : num [1:1700] 10 6 10 6 6 11 11 8 10 6 ... ..$ X11 : num [1:1700] 11 5 11 5 5 3 15 2 11 5 ... ..$ X12 : num [1:1700] 12 4 12 4 4 14 12 11 12 4 ... ..$ X13 : num [1:1700] 13 3 13 3 3 7 2 7 13 3 ... ..$ X14 : num [1:1700] 14 2 14 2 2 5 3 15 14 2 ... ..$ X15 : num [1:1700] 15 1 15 1 1 4 6 3 15 1 ... ..$ clust : int [1:1700] 1 3 1 3 3 3 3 1 1 3 ... ..$ pvals.1: num [1:1700] 1.00 1.03e-91 1.00 2.04e-93 1.03e-91 ... ..$ pvals.2: num [1:1700] 0 0 0 0 0 0 0 0 0 0 ... ..$ pvals.3: num [1:1700] 1.02e-92 1.00 1.34e-93 1.00 1.00 ... ..$ seq : Factor w/ 3 levels "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15",..: 1 3 1 3 3 3 3 1 1 3 ... ..$ dists.1: num [1:1700] 0 105 0 105 105 61 58 46 0 105 ... ..$ dists.2: num [1:1700] 10 95 10 95 95 61 54 54 10 95 ... ..$ dists.3: num [1:1700] 105 0 105 0 0 44 47 59 105 0 ... $ min.like: num [1:100] -122710 -51439 -50310 -49976 -49718 ...
Examples
data(three.mode)
head(three.mode[[4]])
Two-mode Mallows' model fit to toy data set "datas"
Description
"datas" has 3 modes, but we observe here what happens when we try to fit it with 2 modal sequences. The most prominent modal sequences are 1:15, 15:1
Format
The format is: List of 5 $ R :List of 2 ..$ : int [1:15] 1 2 3 4 5 6 7 8 9 10 ... ..$ : int [1:15] 15 14 13 12 11 10 9 8 7 6 ... $ p : num [1:2] 0.557 0.443 $ lambda : num [1:2] 2.05 2.02 $ datas :'data.frame': 1700 obs. of 21 variables: ..$ X1 : num [1:1700] 1 15 1 15 15 12 10 4 1 15 ... ..$ X2 : num [1:1700] 2 14 2 14 14 13 13 12 2 14 ... ..$ X3 : num [1:1700] 3 13 3 13 13 2 4 6 3 13 ... ..$ X4 : num [1:1700] 4 12 4 12 12 8 7 1 4 12 ... ..$ X5 : num [1:1700] 5 11 5 11 11 9 14 5 5 11 ... ..$ X6 : num [1:1700] 6 10 6 10 10 1 8 10 6 10 ... ..$ X7 : num [1:1700] 7 9 7 9 9 15 1 13 7 9 ... ..$ X8 : num [1:1700] 8 8 8 8 8 10 9 9 8 8 ... ..$ X9 : num [1:1700] 9 7 9 7 7 6 5 14 9 7 ... ..$ X10 : num [1:1700] 10 6 10 6 6 11 11 8 10 6 ... ..$ X11 : num [1:1700] 11 5 11 5 5 3 15 2 11 5 ... ..$ X12 : num [1:1700] 12 4 12 4 4 14 12 11 12 4 ... ..$ X13 : num [1:1700] 13 3 13 3 3 7 2 7 13 3 ... ..$ X14 : num [1:1700] 14 2 14 2 2 5 3 15 14 2 ... ..$ X15 : num [1:1700] 15 1 15 1 1 4 6 3 15 1 ... ..$ clust : int [1:1700] 1 2 1 2 2 2 2 1 1 2 ... ..$ pvals.1: num [1:1700] 1.00 4.15e-94 1.00 4.15e-94 4.15e-94 ... ..$ pvals.2: num [1:1700] 5.4e-93 1.0 5.4e-93 1.0 1.0 ... ..$ seq : Factor w/ 2 levels "1 2 3 4 5 6 7 8 9 10 11 12 13 14 15",..: 1 2 1 2 2 2 2 1 1 2 ... ..$ dists.1: num [1:1700] 0 105 0 105 105 61 58 46 0 105 ... ..$ dists.2: num [1:1700] 105 0 105 0 0 44 47 59 105 0 ... $ min.like: num [1:100] -178063 -139298 -58290 -54074 -53902 ...
Examples
data(two.mode)
head(two.mode[[4]])
Bi-modal Mallow's model fit to the APA data set.
Description
The two-modes seem to divide well between Democrats and Republicans...
Format
The format is: List of 5 $ R :List of 2 ..$ : int [1:3] 1 3 2 ..$ : int [1:3] 3 1 2 $ p : num [1:2] 0.541 0.459 $ lambda : num [1:2] 2.19 2.32 $ datas :'data.frame': 1378 obs. of 9 variables: ..$ Carter : int [1:1378] 1 1 1 1 2 2 1 1 2 2 ... ..$ Reagan : int [1:1378] 1 2 2 2 1 1 2 3 1 1 ... ..$ Anderson: int [1:1378] 1 2 2 3 3 3 3 2 3 3 ... ..$ clust : int [1:1378] 1 1 1 1 2 2 1 1 2 2 ... ..$ pvals.1 : num [1:1378] 0.541 0.992 0.992 0.932 0.131 ... ..$ pvals.2 : num [1:1378] 0.45893 0.00809 0.00809 0.06802 0.86945 ... ..$ seq : Factor w/ 2 levels "1 3 2","3 1 2": 1 1 1 1 2 2 1 1 2 2 ... ..$ dists.1 : num [1:1378] 0 0 0 1 2 2 1 0 2 2 ... ..$ dists.2 : num [1:1378] 0 2 2 2 1 1 2 3 1 1 ... $ min.like: num [1:100] -6421 -3386 -2916 -2811 -2799 ...
Source
American Psychological Association http://www.electionstudies.org/studypages/anes_mergedfile_1980/anes_mergedfile_1980.htm
Examples
data(two.seq)
head(two.seq[[4]])