|
| BOOST_STATIC_CONSTANT (std::size_t, DefaultBatchSize=256) |
| Defines the default batch size of the Container. More...
|
|
const_element_range | elements () const |
| Returns the range of elements. More...
|
|
element_range | elements () |
| Returns therange of elements. More...
|
|
const_batch_range | batches () const |
| Returns the range of batches. More...
|
|
batch_range | batches () |
| Returns the range of batches. More...
|
|
std::size_t | numberOfBatches () const |
| Returns the number of batches of the set. More...
|
|
std::size_t | numberOfElements () const |
| Returns the total number of elements. More...
|
|
bool | empty () const |
| Check whether the set is empty. More...
|
|
element_reference | element (std::size_t i) |
|
const_element_reference | element (std::size_t i) const |
|
batch_reference | batch (std::size_t i) |
|
const_batch_reference | batch (std::size_t i) const |
|
| Data () |
| Constructor which constructs an empty set. More...
|
|
| Data (std::size_t numBatches) |
| Construct a dataset with empty batches. More...
|
|
| Data (Data const &container, std::vector< std::size_t > batchSizes) |
| Construct a dataset with different batch sizes as a copy of another dataset. More...
|
|
| Data (std::size_t size, element_type const &element, std::size_t batchSize=DefaultBatchSize) |
| Construction with size and a single element. More...
|
|
void | read (InArchive &archive) |
| Read the component from the supplied archive. More...
|
|
void | write (OutArchive &archive) const |
| Write the component to the supplied archive. More...
|
|
virtual void | makeIndependent () |
| This method makes the vector independent of all siblings and parents. More...
|
|
void | splitBatch (std::size_t batch, std::size_t elementIndex) |
|
self_type | splice (std::size_t batch) |
| Splits the container into two independent parts. The front part remains in the container, the back part is returned. More...
|
|
void | append (self_type const &other) |
| Appends the contents of another data object to the end. More...
|
|
void | push_back (const_batch_reference batch) |
|
template<class Range > |
void | repartition (Range const &batchSizes) |
| Reorders the batch structure in the container to that indicated by the batchSizes vector. More...
|
|
std::vector< std::size_t > | getPartitioning () const |
| Creates a vector with the batch sizes of every batch. More...
|
|
void | indexedSubset (IndexSet const &indices, self_type &subset) const |
| Fill in the subset defined by the list of indices. More...
|
|
void | indexedSubset (IndexSet const &indices, self_type &subset, self_type &complement) const |
| Fill in the subset defined by the list of indices as well as its complement. More...
|
|
virtual | ~ISerializable () |
| Virtual d'tor. More...
|
|
void | load (InArchive &archive, unsigned int version) |
| Versioned loading of components, calls read(...). More...
|
|
void | save (OutArchive &archive, unsigned int version) const |
| Versioned storing of components, calls write(...). More...
|
|
| BOOST_SERIALIZATION_SPLIT_MEMBER () |
|
template<class Type>
class shark::Data< Type >
Data container.
The Data class is Shark's container for machine learning data. This container (and its sub-classes) is used for input data, labels, and model outputs.
- The Data container organizes the data it holds in batches. This means, that it tries to find a good data representation for a whole set of, for example 100 data points, at the same time. If the type of data it stores is for example RealVector, the batches of this type are RealMatrices. This is good because most often operations on the whole matrix are faster than operations on the separate vectors. Nearly all operations of the set have to be interpreted in terms of the batch. Therefore the iterator interface will give access to the batches but not to single elements. For this separate element_iterators and const_element_iterators can be used.
- There are a lot of these typedefs. The typical typedefs for containers like batch_type or iterator are chosen as types for the batch interface. For accessing single elements, a different set of typedefs is in place. Thus instead of iterator you must write element_iterator and instead of batch_type write element_type. Usually you should not use element_type except when you want to actually copy the data. Instead use element_reference or const_element_reference. Note that these are proxy objects and not actual references to element_type! A short example for these typedefs:
typedef Data<RealVector> Set;
Set data;
for(Set::element_iterator pos=data.elemBegin();pos!= data.elemEnd();++pos){
std::cout<<*pos<<" ";
Set::element_reference ref=*pos;
ref*=2;
std::cout<<*pos<<std::endl;
}
When you write C++11 code, this is of course much simpler: Data<RealVector> data;
for(auto pos=data.elemBegin();pos!= data.elemEnd();++pos){
std::cout<<*pos<<" ";
auto ref=*pos;
ref*=2;
std::cout<<*pos<<std::endl;
}
- Element wise accessing of elements is usually slower than accessing the batches. If possible, use direct batch access, or at least use the iterator interface to iterate over all elements. Random access to single elements is linear time, so use it wisely. Of course, when you want to use batches, you need to know the actual batch type. This depends on the actual type of the input. here are the rules: if the input is an arithmetic type like int or double, the result will be a vector of this (i.e. double->RealVector or Int->IntVector). For vectors the results are matrices as mentioned above. If the vector is sparse, so is the matrix. And for everything else the batch type is just a std::vector of the type, so no optimization can be applied.
- When constructing the container the batchSize can be set. If it is not set by the user the default batchSize is chosen. A BatchSize of 0 corresponds to putting all data into a single batch. Beware that not only the data needs storage but also the various models during computation. So the actual amount of space to compute a batch can greatly exceed the batch size.
An additional feature of the Data class is that it can be used to create lazy subsets. So the batches of a dataset can be shared between various instances of the data class without additional memory overhead.
- Warning
- Be aware –especially for derived containers like LabeledData– that the set does not enforce structural consistency. When you change the structure of the data part for example by directly changing the size of the batches, the size of the labels is not enforced to change accordingly. Also when creating subsets of a set changing the parent will change it's siblings and conversely. The programmer needs to ensure structural integrity! For example this is dangerous:
void function(Data<unsigned int>& data){
Data<unsigned int> newData(...);
data=newData;
}
When data was originally a labeledData object, and newData has a different batch structure than data, this will lead to structural inconsistencies. When function is rewritten such that newData has the same structure as data, this code is perfectly fine. The best way to get around this problem is by rewriting the code as: Data<unsigned int> function(){
Data<unsigned int> newData(...);
return newData;
}
- Todo:
- expand docu
Definition at line 144 of file Dataset.h.