shark::Data< Type > Class Template Reference

Data container. More...

#include <shark/Data/Dataset.h>

+ Inheritance diagram for shark::Data< Type >:

Public Types

typedef batch_type & batch_reference
 
typedef batch_type const & const_batch_reference
 
typedef Batch< element_type >::reference element_reference
 
typedef Batch< element_type >::const_reference const_element_reference
 
typedef std::vector< std::size_t > IndexSet
 
typedef boost::iterator_range< typename Container::element_iterator > element_range
 
typedef boost::iterator_range< typename Container::const_element_iterator > const_element_range
 
typedef boost::iterator_range< typename Container::iterator > batch_range
 
typedef boost::iterator_range< typename Container::const_iterator > const_batch_range
 

Public Member Functions

 BOOST_STATIC_CONSTANT (std::size_t, DefaultBatchSize=256)
 Defines the default batch size of the Container. More...
 
const_element_range elements () const
 Returns the range of elements. More...
 
element_range elements ()
 Returns therange of elements. More...
 
const_batch_range batches () const
 Returns the range of batches. More...
 
batch_range batches ()
 Returns the range of batches. More...
 
std::size_t numberOfBatches () const
 Returns the number of batches of the set. More...
 
std::size_t numberOfElements () const
 Returns the total number of elements. More...
 
bool empty () const
 Check whether the set is empty. More...
 
element_reference element (std::size_t i)
 
const_element_reference element (std::size_t i) const
 
batch_reference batch (std::size_t i)
 
const_batch_reference batch (std::size_t i) const
 
 Data ()
 Constructor which constructs an empty set. More...
 
 Data (std::size_t numBatches)
 Construct a dataset with empty batches. More...
 
 Data (Data const &container, std::vector< std::size_t > batchSizes)
 Construct a dataset with different batch sizes as a copy of another dataset. More...
 
 Data (std::size_t size, element_type const &element, std::size_t batchSize=DefaultBatchSize)
 Construction with size and a single element. More...
 
void read (InArchive &archive)
 Read the component from the supplied archive. More...
 
void write (OutArchive &archive) const
 Write the component to the supplied archive. More...
 
virtual void makeIndependent ()
 This method makes the vector independent of all siblings and parents. More...
 
void splitBatch (std::size_t batch, std::size_t elementIndex)
 
self_type splice (std::size_t batch)
 Splits the container into two independent parts. The front part remains in the container, the back part is returned. More...
 
void append (self_type const &other)
 Appends the contents of another data object to the end. More...
 
void push_back (const_batch_reference batch)
 
template<class Range >
void repartition (Range const &batchSizes)
 Reorders the batch structure in the container to that indicated by the batchSizes vector. More...
 
std::vector< std::size_t > getPartitioning () const
 Creates a vector with the batch sizes of every batch. More...
 
void indexedSubset (IndexSet const &indices, self_type &subset) const
 Fill in the subset defined by the list of indices. More...
 
void indexedSubset (IndexSet const &indices, self_type &subset, self_type &complement) const
 Fill in the subset defined by the list of indices as well as its complement. More...
 
- Public Member Functions inherited from shark::ISerializable
virtual ~ISerializable ()
 Virtual d'tor. More...
 
void load (InArchive &archive, unsigned int version)
 Versioned loading of components, calls read(...). More...
 
void save (OutArchive &archive, unsigned int version) const
 Versioned storing of components, calls write(...). More...
 
 BOOST_SERIALIZATION_SPLIT_MEMBER ()
 

Protected Types

typedef detail::SharedContainer< Type > Container
 

Protected Attributes

Container m_data
 data More...
 

Friends

template<class InputT , class LabelT >
class LabeledData
 
template<class T >
bool operator== (const Data< T > &op1, const Data< T > &op2)
 
void swap (Data &a, Data &b)
 

Detailed Description

template<class Type>
class shark::Data< Type >

Data container.

The Data class is Shark's container for machine learning data. This container (and its sub-classes) is used for input data, labels, and model outputs.

The Data container organizes the data it holds in batches. This means, that it tries to find a good data representation for a whole set of, for example 100 data points, at the same time. If the type of data it stores is for example RealVector, the batches of this type are RealMatrices. This is good because most often operations on the whole matrix are faster than operations on the separate vectors. Nearly all operations of the set have to be interpreted in terms of the batch. Therefore the iterator interface will give access to the batches but not to single elements. For this separate element_iterators and const_element_iterators can be used.
There are a lot of these typedefs. The typical typedefs for containers like batch_type or iterator are chosen as types for the batch interface. For accessing single elements, a different set of typedefs is in place. Thus instead of iterator you must write element_iterator and instead of batch_type write element_type. Usually you should not use element_type except when you want to actually copy the data. Instead use element_reference or const_element_reference. Note that these are proxy objects and not actual references to element_type! A short example for these typedefs:
typedef Data<RealVector> Set;
Set data;
for(Set::element_iterator pos=data.elemBegin();pos!= data.elemEnd();++pos){
std::cout<<*pos<<" ";
Set::element_reference ref=*pos;
ref*=2;
std::cout<<*pos<<std::endl;
}
When you write C++11 code, this is of course much simpler:
Data<RealVector> data;
for(auto pos=data.elemBegin();pos!= data.elemEnd();++pos){
std::cout<<*pos<<" ";
auto ref=*pos;
ref*=2;
std::cout<<*pos<<std::endl;
}
Element wise accessing of elements is usually slower than accessing the batches. If possible, use direct batch access, or at least use the iterator interface to iterate over all elements. Random access to single elements is linear time, so use it wisely. Of course, when you want to use batches, you need to know the actual batch type. This depends on the actual type of the input. here are the rules: if the input is an arithmetic type like int or double, the result will be a vector of this (i.e. double->RealVector or Int->IntVector). For vectors the results are matrices as mentioned above. If the vector is sparse, so is the matrix. And for everything else the batch type is just a std::vector of the type, so no optimization can be applied.
When constructing the container the batchSize can be set. If it is not set by the user the default batchSize is chosen. A BatchSize of 0 corresponds to putting all data into a single batch. Beware that not only the data needs storage but also the various models during computation. So the actual amount of space to compute a batch can greatly exceed the batch size.

An additional feature of the Data class is that it can be used to create lazy subsets. So the batches of a dataset can be shared between various instances of the data class without additional memory overhead.

Warning
Be aware –especially for derived containers like LabeledData– that the set does not enforce structural consistency. When you change the structure of the data part for example by directly changing the size of the batches, the size of the labels is not enforced to change accordingly. Also when creating subsets of a set changing the parent will change it's siblings and conversely. The programmer needs to ensure structural integrity! For example this is dangerous:
void function(Data<unsigned int>& data){
Data<unsigned int> newData(...);
data=newData;
}
When data was originally a labeledData object, and newData has a different batch structure than data, this will lead to structural inconsistencies. When function is rewritten such that newData has the same structure as data, this code is perfectly fine. The best way to get around this problem is by rewriting the code as:
Data<unsigned int> function(){
Data<unsigned int> newData(...);
return newData;
}
Todo:
expand docu

Definition at line 144 of file Dataset.h.


The documentation for this class was generated from the following file: