Clustal Omega  1.0.3
Data Structures | Defines | Enumerations | Functions
src/clustal/mbed.c File Reference
#include <assert.h>
#include <float.h>
#include "mbed.h"
#include "pair_dist.h"
#include "symmatrix.h"
#include "ktuple_pair.h"
#include "tree.h"
#include "util.h"
#include "progress.h"
#include "queue.h"
#include "log.h"
#include "kmpp/KMeans.h"
Include dependency graph for mbed.c:

Data Structures

struct  bisecting_kmeans_result_t

Defines

#define TIMING   0
#define FULL_WITHIN_CLUSTER_DISTANCES   1
#define COMPUTE_WITHIN_SUBCLUSTER_AVERAGE   0
#define USE_KMEANS_LLOYDS   0
#define log2(x)   (log(x) / 0.69314718055994530942)
#define NUMBER_OF_SEEDS(n)   pow(log2(((double)n)), 2)
#define SEED_SELECTION   SELECT_SEEDS_BY_LENGTH
#define USE_EUCLIDEAN_DISTANCE   1
#define PRINT_CLUSTER_DISTRIBUTION   0
#define TRACE   0

Enumerations

enum  SEED_SELECTION_TYPE { SELECT_SEEDS_RANDOMLY, SELECT_SEEDS_BY_LENGTH }

Functions

void FreeKMeansResult (bisecting_kmeans_result_t **prResult_p)
 Free KMeans result structure.
void NewKMeansResult (bisecting_kmeans_result_t **prKMeansResult_p)
 Allocate new KMeans result.
double EuclDist (const double *v1, const double *v2, const int dim)
 Calculate the euclidean distance between two vectors.
double CosDist (const double *v1, const double *v2, const int dim)
 Calculate the cosine distance between two vectors.
int SeqToVec (double **ppdSeqVec, mseq_t *prMSeq, int *piSeeds, const int iNumSeeds, const int iPairDistType)
 convert sequences into mbed-like (distance) vector representation. Seeds (prMSeq sequence indices) have to be picked before
int SeedSelection (int *piSeeds, int iNumSeeds, int iSelectionMethod, mseq_t *prMSeq)
 Select seeds to be used from an prMSeq.
void BisectingKmeans (bisecting_kmeans_result_t **prKMeansResult_p, const int iNObjs, const int iDim, double **ppdVectors, const int iMinRequiredObjsPerCluster, const int iMaxAllowedObjsPerCluster)
 Bisecting K-Means clustering. Repeatedly calls K-Means with a K of 2 until no cluster has more than iMaxAllowedObjsPerCluster.
int Mbed (tree_t **prMbedTree_p, mseq_t *prMSeq, const int iPairDistType, const char *pcGuidetreeOut)
 From scratch reimplementation of mBed: Blackshields et al. (2010); PMID 20470396.

Define Documentation

#define log2 (   x)    (log(x) / 0.69314718055994530942)
#define NUMBER_OF_SEEDS (   n)    pow(log2(((double)n)), 2)
#define TIMING   0
#define TRACE   0
#define USE_EUCLIDEAN_DISTANCE   1
#define USE_KMEANS_LLOYDS   0

Enumeration Type Documentation

Enumerator:
SELECT_SEEDS_RANDOMLY 
SELECT_SEEDS_BY_LENGTH 

Function Documentation

void BisectingKmeans ( bisecting_kmeans_result_t **  prKMeansResult_p,
const int  iNObjs,
const int  iDim,
double **  ppdVectors,
const int  iMinRequiredObjsPerCluster,
const int  iMaxAllowedObjsPerCluster 
)

Bisecting K-Means clustering. Repeatedly calls K-Means with a K of 2 until no cluster has more than iMaxAllowedObjsPerCluster.

Parameters:
[out]prKMeansResult_pResult of Bisecting KMeans. Will be allocated here. Caller has to free. See
See also:
FreeKMeansResult()
Parameters:
[in]iNObjsNumber of objects/sequences to cluster
[in]iDimDimensionality of input data
[in]ppdVectorseach row holds iDim points for this object's coordinates
[in]iMinRequiredObjsPerClusterMinimum number of objects per Cluster (inclusive)/
[in]iMaxAllowedObjsPerClusterMaximum number of objects per Cluster (inclusive). Function returns once no cluster contains more then this number of objects. Soft limit!
Note:
Convoluted code. Could use some restructuring. My apologies. AW
double CosDist ( const double *  v1,
const double *  v2,
const int  dim 
)

Calculate the cosine distance between two vectors.

Parameters:
[in]v1First vector with dim dimensions
[in]v2Second vector with dim dimensions
[in]dimDimensionality of v1 and v2
Returns:
cosine distance as double
Note:
Could probably be inlined. Also perfect for SSE
double EuclDist ( const double *  v1,
const double *  v2,
const int  dim 
)

Calculate the euclidean distance between two vectors.

Parameters:
[in]v1First vector with dim dimensions
[in]v2Second vector with dim dimensions
[in]dimDimensionality of v1 and v2
Returns:
euclidean distance as double
Note:
Could probably be inlined. Also perfect for SSE

Free KMeans result structure.

Parameters:
[out]prResult_pK-Means result to free
See also:
NewKMeansResult()
int Mbed ( tree_t **  prMbedTree_p,
mseq_t prMSeq,
const int  iPairDistType,
const char *  pcGuidetreeOut 
)

From scratch reimplementation of mBed: Blackshields et al. (2010); PMID 20470396.

Idea is a follows:

  • convert sequences into vectors of distances
  • cluster the vectors using k-means
  • cluster each of the k clusters using upgma (used cached distances from above?)
  • join the sub-clusters to create on tree (use UPGMA on k-means medoids)
Parameters:
[out]prMbedTree_pCreated upgma tree. will be allocated here. use FreeMuscleTree() to free
[in]prMSeqMultiple sequences
[in]iPairDistTypeDistance measure for pairwise alignments
[in]pcGuidetreeOutPassed down to GuideTreeUpgma()
Returns:
Zero on success, non-zero on error
void NewKMeansResult ( bisecting_kmeans_result_t **  prKMeansResult_p)

Allocate new KMeans result.

Parameters:
[out]prKMeansResult_pK-Means result to free
See also:
NewKMeansResult()
int SeedSelection ( int *  piSeeds,
int  iNumSeeds,
int  iSelectionMethod,
mseq_t prMSeq 
)

Select seeds to be used from an prMSeq.

Parameters:
[out]piSeedsWill store the indices of prMSeqs seqs used to be as seeds here. Must be preallocated.
[in]iNumSeedsNumber of seeds to be picked
[in]iSelectionMethodSeed selection method to be used
[in]prMSeqThe prMSeq structure to pick sequences from
Returns:
: Non-zero on failure
int SeqToVec ( double **  ppdSeqVec,
mseq_t prMSeq,
int *  piSeeds,
const int  iNumSeeds,
const int  iPairDistType 
)

convert sequences into mbed-like (distance) vector representation. Seeds (prMSeq sequence indices) have to be picked before

Parameters:
[out]ppdSeqVecPointer to preallocated matrix of size nseqs x iSeeds
[in]prMSeqSequences which are to be converted
[in]piSeedsArray of sequences in prMSeq which are to be used as seeds.
[in]iNumSeedsNumber of seeds/elements in piSeeds
[in]iPairDistTypeType of pairwise distance comparison