TabularMSA.
loc
¶Slice the MSA on first axis by index label, second axis by position.
State: Experimental as of 0.4.1.
This will return an object with the following interface:
msa.loc[seq_idx]
msa.loc[seq_idx, pos_idx]
msa.loc(axis='sequence')[seq_idx]
msa.loc(axis='position')[pos_idx]
Parameters: | seq_idx : label, slice, 1D array_like (bool or label)
pos_idx : (same as seq_idx), optional
axis : {‘sequence’, ‘position’, 0, 1, None}, optional
|
---|---|
Returns: | TabularMSA, GrammaredSequence, Sequence
|
See also
Notes
If the slice operation results in a TabularMSA
without any
sequences, the MSA’s positional_metadata
will be unset.
When the MSA’s index is a pd.MultiIndex
a tuple may be given to
seq_idx to indicate the slicing operations to perform on each
component index.
Examples
First we need to set up an MSA to slice:
>>> from skbio import TabularMSA, DNA
>>> msa = TabularMSA([DNA("ACGT"), DNA("A-GT"), DNA("AC-T"),
... DNA("ACGA")], index=['a', 'b', 'c', 'd'])
>>> msa
TabularMSA[DNA]
---------------------
Stats:
sequence count: 4
position count: 4
---------------------
ACGT
A-GT
AC-T
ACGA
>>> msa.index
Index(['a', 'b', 'c', 'd'], dtype='object')
When we slice by a scalar we get the original sequence back out of the MSA:
>>> msa.loc['b']
DNA
--------------------------
Stats:
length: 4
has gaps: True
has degenerates: False
has definites: True
GC-content: 33.33%
--------------------------
0 A-GT
Similarly when we slice the second axis by a scalar we get a column of the MSA:
>>> msa.loc[..., 1]
Sequence
-------------
Stats:
length: 4
-------------
0 C-CC
Note: we return an skbio.Sequence
object because the column of an
alignment has no biological meaning and many operations defined for the
MSA’s sequence dtype would be meaningless.
When we slice both axes by a scalar, operations are applied left to right:
>>> msa.loc['a', 0]
DNA
--------------------------
Stats:
length: 1
has gaps: False
has degenerates: False
has definites: True
GC-content: 0.00%
--------------------------
0 A
In other words, it exactly matches slicing the resulting sequence object directly:
>>> msa.loc['a'][0]
DNA
--------------------------
Stats:
length: 1
has gaps: False
has degenerates: False
has definites: True
GC-content: 0.00%
--------------------------
0 A
When our slice is non-scalar we get back an MSA of the same dtype:
>>> msa.loc[['a', 'c']]
TabularMSA[DNA]
---------------------
Stats:
sequence count: 2
position count: 4
---------------------
ACGT
AC-T
We can similarly slice out a column of that:
>>> msa.loc[['a', 'c'], 2]
Sequence
-------------
Stats:
length: 2
-------------
0 G-
Slice syntax works as well:
>>> msa.loc[:'c']
TabularMSA[DNA]
---------------------
Stats:
sequence count: 3
position count: 4
---------------------
ACGT
A-GT
AC-T
Notice how the end label is included in the results. This is different from how positional slices behave:
>>> msa.loc[[True, False, False, True], 2:3]
TabularMSA[DNA]
---------------------
Stats:
sequence count: 2
position count: 1
---------------------
G
G
Here we sliced the first axis by a boolean vector, but then restricted the columns to a single column. Because the second axis was given a nonscalar we still recieve an MSA even though only one column is present.
Duplicate labels can be an unfortunate reality in the real world, however loc is capable of handling this:
>>> msa.index = ['a', 'a', 'b', 'c']
Notice how the label ‘a’ happens twice. If we were to access ‘a’ we get back an MSA with both sequences:
>>> msa.loc['a']
TabularMSA[DNA]
---------------------
Stats:
sequence count: 2
position count: 4
---------------------
ACGT
A-GT
Remember that iloc can always be used to differentiate sequences with duplicate labels.
More advanced slicing patterns are possible with different index types.
Let’s use a pd.MultiIndex:
>>> msa.index = [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
Here we will explicitly set the axis that we are slicing by to make things easier to read:
>>> msa.loc(axis='sequence')['a', 0]
DNA
--------------------------
Stats:
length: 4
has gaps: False
has degenerates: False
has definites: True
GC-content: 50.00%
--------------------------
0 ACGT
This selected the first sequence because the complete label was provided. In other words (‘a’, 0) was treated as a scalar for this index.
We can also slice along the component indices of the multi-index:
>>> msa.loc(axis='sequence')[:, 1]
TabularMSA[DNA]
---------------------
Stats:
sequence count: 2
position count: 4
---------------------
A-GT
ACGA
If we were to do that again without the axis argument, it would look like this:
>>> msa.loc[(slice(None), 1), ...]
TabularMSA[DNA]
---------------------
Stats:
sequence count: 2
position count: 4
---------------------
A-GT
ACGA
Notice how we needed to specify the second axis. If we had left that out we would have simply gotten the 2nd column back instead. We also lost the syntactic sugar for slice objects. These are a few of the reasons specifying the axis preemptively can be useful.