Skip to content

Dtype Brainstorming

Matti Picus edited this page Oct 22, 2018 · 4 revisions

Dtypes in NumPy

A brainstorming session at the SciPy 2018 sprints July 14, 2018

User Stories

Motivations

Replacing subclassing, which is quite fragile.

Dtype requirements

  • Hold data (e.g., categories, datetime64, units)

  • Needs to be able to override dtype specific functionality:

    • Arithmetic
    • Ufuncs
    • Sorting
    • Coercion rules
  • Handle life cycle (e.g. GEOS/shapely)

  • Push API up to the ndarray

    • For example can a unit dtype push a method .to up to the ndarray class to convert to a different unit
    • Or can a datetime dtype push a .year or .dayofweek up to the ndarray class
    • This can be done -- but should it be done?
  • Two use cases: writing high-level dtypes in Python and low-level dtypes in C:

    • We need new capabilities for C dtypes:
      • At the C level, the current interface is quite cumbersome. It would be nice to have something easier for use with C/C++/Cython.
      • At a low-level, ufunc loops need access to dtype metadata (e.g., this is why we don't have ufuncs for strings in NumPy)
      • A new primitive data type for pointers would be broadly useful (e.g., for managing strings or geometric objects).
    • We need to be able to write custom dtypes in Python
      • This would be particularly useful for high level dtypes like units or categorical, which can be written in terms of a primitive data types plus some metadata.
      • Ideally, custom dtypes would reuse existing protocols for duck arrays, e.g., __array_ufunc__ and __array_function__.
  • Mechanism for extended dtypes to go from strings to dtypes

    • Parse dtype='my_dtype[options]' into the dtype constructor somehow.
    • DSL? Are we parsing?
    • Handle conflicting names by convention (maybe raise a warning)
    • Possibly need a registration mechanism (so np.array([1, 2, 3], dtype='my_dtype') would work)
  • Scalar types should not need to be NumPy scalars?

  • Should it allow for mix-in like paradigms (say have mydtype based off of np.float64)?

  • Should we have some thing like isinstance(dtype, (np.float64, np.float32))?

  • Should not require every .dtype attribute to be a NumPy dtype (e.g., pandas_series.dtype == np.dtype(np.float64) current breaks)

Suggestion: strawman proposal for what writing a dtype should look like.

Straw implementations

Units

From Nathan, based on unyt_array

import numpy as np


class float64_with_unit(np.dtype):
    array_dtype = np.float64
    unit = None

    def __init__(self, unit):
        self.unit = unit

    def __array_ufunc_proxy__(self, ufunc, method, *input_dtypes, **kwargs):
        # do ufunc dispatch
        raise NotImplementedError

    def __array_function_proxy__(self, function, *input_dtypes, **kwargs):
        # do function dispatch
        raise NotImplementedError

    def __setstate__(self, state):
        # do pickle serialization
        raise NotImplementedError

Comments

  • Do we want to give dtypes the ability to change all functions, or just ufuncs
  • __array_func__ should call dtype.coerce

From Ryan, out of thin air:

import numpy as np

class UnitDType(np.dtype):
    _ndarray_api = ['convert']
    
    def __init__(self, unit, baseType=np.float64):
        self._unit = unit
        self._base = baseType

    def convert(self, unit):
        # astype()?
        pass

    def __add__(self, other):
        self._check(other):
        self._base.add(self, other)

    def __mul__(self, other):
        self._base.mul(self, other)
        self._update_dimensionality(other)

    def _check(self, other):
        if self._dimensions != other._dimensions:
            raise UnitError

    def _update_dimensionality(self, other)
        self._dimensions[...]


a = np.ones((5,), dtype=UnitDtype('meters'))

b = np.ones((5,), dtype=UnitDType('seconds'))

a + b  # UnitError

a * b == np.ones((5,), dtype=UnitDType('meters/seconds'))

Comments

  • .astype('units[ft]') could work, but it would be nice to specify just .convert('ft')
  • __add__ etc should be handled by __array_ufunc__
  • Units here might be a specific case of something more general

Strings

From Stephan:

class VariableLengthString(np.LogicalDtype):
    physical_dtype = np.object
    name = 'String'

    def __array_ufunc__(self, ufunc, method, args, **kwargs):
        if any(not isinstance(a.dtype, VariableLengthString)
               for a in args):
            return NotImplemented
        physical_args = tuple(a.astype(object) for a in args)
        result = getattr(ufunc, method)(*physical_args, **kwargs)
        return result.astype(VariableLengthString)

    def __array_function__(self, func, types, args, kwargs):
        # can't do it! types only exposes type information, not dtype

    def __dtype_promote__(self, dtypes):
        if all(d in [VariableLengthString, np.unicode_, np.string_]
               for d in dtypes):
            return VariableLengthString()
        return NotImplemented

    def __array_coerce__(self, array, casting):
        if array.dtype.kind == 'U':
            result = array.astype(object)
            result.dtype = VariableLengthString()
        elif array.dtype.kind == 'S':
            # decode as ascii? raise?
        elif array.dtype.kind == 'O':
            # check for all string object
        else:
            raise TypeError
        return result

I used LogicalDtype above to say that this is based off of another dtype so that numpy knows how to handle it. I just want to implement a little bit on top of that.

The __array__function__ protocol doesn't work that well because the dtype wasn't explicitly provided for all of the arrays.

Categorical

From Joris

class CategoricalDtype():
    
    def __init__(self, categories, ordered=False):
        self.categories = categories
        self.ordered = ordered
    
    @classmethod
    def _construct_dtype_from_string()

    def _array_constructor(self, values):
        # convert values to codes
        codes = ...
        # update self to reflect values
        np.array(codes, dtype=self)

    def _array_repr(self):
        # override the repr of the array with this dtype
    
    def _validate_scalar(self):
        # validate if scalar can be stored in the array


np.array(['Red', 'Green', 'Blue', 'Red'], dtype=CategoricalDtype())
np.array(['Red', 'Green', 'Blue', 'Red'], dtype=CategoricalDtype(categories=['Red', 'Green', 'Blue', 'Yellow']))

Comments

  • Don't need to implement __add__, etc.. due to __array_ufunc__
  • Should we limit the functions that can to go in __array_function__ for dtypes? Do we need __array_function__?
  • Mixins for units -- don't want to write a separate dtype for each variation.
  • Should dtypes specify width

Protocols, Inheritence,

  • protocols __array_ufunc__, __array_function__
  • inheritence - subclassing dtype. Probably not a good idea
  • duckdtype - what is the minumum viable methods and attributes a dtype needs?

Misc

Themes

  • It would be great to get past subclassing
  • It would be nice to write something in Python
  • It would be nice to be able to interoperate between different array duck types using the same dtype
Clone this wiki locally
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy