| 1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168 |
- """
- The :mod:`sklearn.compose._column_transformer` module implements utilities
- to work with heterogeneous data and to apply different transformers to
- different columns.
- """
- # Author: Andreas Mueller
- # Joris Van den Bossche
- # License: BSD
- from collections import Counter
- from itertools import chain
- from numbers import Integral, Real
- import numpy as np
- from scipy import sparse
- from ..base import TransformerMixin, _fit_context, clone
- from ..pipeline import _fit_transform_one, _name_estimators, _transform_one
- from ..preprocessing import FunctionTransformer
- from ..utils import Bunch, _get_column_indices, _safe_indexing, check_pandas_support
- from ..utils._estimator_html_repr import _VisualBlock
- from ..utils._param_validation import HasMethods, Hidden, Interval, StrOptions
- from ..utils._set_output import _get_output_config, _safe_set_output
- from ..utils.metaestimators import _BaseComposition
- from ..utils.parallel import Parallel, delayed
- from ..utils.validation import (
- _check_feature_names_in,
- _num_samples,
- check_array,
- check_is_fitted,
- )
- __all__ = ["ColumnTransformer", "make_column_transformer", "make_column_selector"]
- _ERR_MSG_1DCOLUMN = (
- "1D data passed to a transformer that expects 2D data. "
- "Try to specify the column selection as a list of one "
- "item instead of a scalar."
- )
- class ColumnTransformer(TransformerMixin, _BaseComposition):
- """Applies transformers to columns of an array or pandas DataFrame.
- This estimator allows different columns or column subsets of the input
- to be transformed separately and the features generated by each transformer
- will be concatenated to form a single feature space.
- This is useful for heterogeneous or columnar data, to combine several
- feature extraction mechanisms or transformations into a single transformer.
- Read more in the :ref:`User Guide <column_transformer>`.
- .. versionadded:: 0.20
- Parameters
- ----------
- transformers : list of tuples
- List of (name, transformer, columns) tuples specifying the
- transformer objects to be applied to subsets of the data.
- name : str
- Like in Pipeline and FeatureUnion, this allows the transformer and
- its parameters to be set using ``set_params`` and searched in grid
- search.
- transformer : {'drop', 'passthrough'} or estimator
- Estimator must support :term:`fit` and :term:`transform`.
- Special-cased strings 'drop' and 'passthrough' are accepted as
- well, to indicate to drop the columns or to pass them through
- untransformed, respectively.
- columns : str, array-like of str, int, array-like of int, \
- array-like of bool, slice or callable
- Indexes the data on its second axis. Integers are interpreted as
- positional columns, while strings can reference DataFrame columns
- by name. A scalar string or int should be used where
- ``transformer`` expects X to be a 1d array-like (vector),
- otherwise a 2d array will be passed to the transformer.
- A callable is passed the input data `X` and can return any of the
- above. To select multiple columns by name or dtype, you can use
- :obj:`make_column_selector`.
- remainder : {'drop', 'passthrough'} or estimator, default='drop'
- By default, only the specified columns in `transformers` are
- transformed and combined in the output, and the non-specified
- columns are dropped. (default of ``'drop'``).
- By specifying ``remainder='passthrough'``, all remaining columns that
- were not specified in `transformers`, but present in the data passed
- to `fit` will be automatically passed through. This subset of columns
- is concatenated with the output of the transformers. For dataframes,
- extra columns not seen during `fit` will be excluded from the output
- of `transform`.
- By setting ``remainder`` to be an estimator, the remaining
- non-specified columns will use the ``remainder`` estimator. The
- estimator must support :term:`fit` and :term:`transform`.
- Note that using this feature requires that the DataFrame columns
- input at :term:`fit` and :term:`transform` have identical order.
- sparse_threshold : float, default=0.3
- If the output of the different transformers contains sparse matrices,
- these will be stacked as a sparse matrix if the overall density is
- lower than this value. Use ``sparse_threshold=0`` to always return
- dense. When the transformed output consists of all dense data, the
- stacked result will be dense, and this keyword will be ignored.
- n_jobs : int, default=None
- Number of jobs to run in parallel.
- ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
- ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
- for more details.
- transformer_weights : dict, default=None
- Multiplicative weights for features per transformer. The output of the
- transformer is multiplied by these weights. Keys are transformer names,
- values the weights.
- verbose : bool, default=False
- If True, the time elapsed while fitting each transformer will be
- printed as it is completed.
- verbose_feature_names_out : bool, default=True
- If True, :meth:`ColumnTransformer.get_feature_names_out` will prefix
- all feature names with the name of the transformer that generated that
- feature.
- If False, :meth:`ColumnTransformer.get_feature_names_out` will not
- prefix any feature names and will error if feature names are not
- unique.
- .. versionadded:: 1.0
- Attributes
- ----------
- transformers_ : list
- The collection of fitted transformers as tuples of
- (name, fitted_transformer, column). `fitted_transformer` can be an
- estimator, 'drop', or 'passthrough'. In case there were no columns
- selected, this will be the unfitted transformer.
- If there are remaining columns, the final element is a tuple of the
- form:
- ('remainder', transformer, remaining_columns) corresponding to the
- ``remainder`` parameter. If there are remaining columns, then
- ``len(transformers_)==len(transformers)+1``, otherwise
- ``len(transformers_)==len(transformers)``.
- named_transformers_ : :class:`~sklearn.utils.Bunch`
- Read-only attribute to access any transformer by given name.
- Keys are transformer names and values are the fitted transformer
- objects.
- sparse_output_ : bool
- Boolean flag indicating whether the output of ``transform`` is a
- sparse matrix or a dense numpy array, which depends on the output
- of the individual transformers and the `sparse_threshold` keyword.
- output_indices_ : dict
- A dictionary from each transformer name to a slice, where the slice
- corresponds to indices in the transformed output. This is useful to
- inspect which transformer is responsible for which transformed
- feature(s).
- .. versionadded:: 1.0
- n_features_in_ : int
- Number of features seen during :term:`fit`. Only defined if the
- underlying transformers expose such an attribute when fit.
- .. versionadded:: 0.24
- feature_names_in_ : ndarray of shape (`n_features_in_`,)
- Names of features seen during :term:`fit`. Defined only when `X`
- has feature names that are all strings.
- .. versionadded:: 1.0
- See Also
- --------
- make_column_transformer : Convenience function for
- combining the outputs of multiple transformer objects applied to
- column subsets of the original feature space.
- make_column_selector : Convenience function for selecting
- columns based on datatype or the columns name with a regex pattern.
- Notes
- -----
- The order of the columns in the transformed feature matrix follows the
- order of how the columns are specified in the `transformers` list.
- Columns of the original feature matrix that are not specified are
- dropped from the resulting transformed feature matrix, unless specified
- in the `passthrough` keyword. Those columns specified with `passthrough`
- are added at the right to the output of the transformers.
- Examples
- --------
- >>> import numpy as np
- >>> from sklearn.compose import ColumnTransformer
- >>> from sklearn.preprocessing import Normalizer
- >>> ct = ColumnTransformer(
- ... [("norm1", Normalizer(norm='l1'), [0, 1]),
- ... ("norm2", Normalizer(norm='l1'), slice(2, 4))])
- >>> X = np.array([[0., 1., 2., 2.],
- ... [1., 1., 0., 1.]])
- >>> # Normalizer scales each row of X to unit norm. A separate scaling
- >>> # is applied for the two first and two last elements of each
- >>> # row independently.
- >>> ct.fit_transform(X)
- array([[0. , 1. , 0.5, 0.5],
- [0.5, 0.5, 0. , 1. ]])
- :class:`ColumnTransformer` can be configured with a transformer that requires
- a 1d array by setting the column to a string:
- >>> from sklearn.feature_extraction import FeatureHasher
- >>> from sklearn.preprocessing import MinMaxScaler
- >>> import pandas as pd # doctest: +SKIP
- >>> X = pd.DataFrame({
- ... "documents": ["First item", "second one here", "Is this the last?"],
- ... "width": [3, 4, 5],
- ... }) # doctest: +SKIP
- >>> # "documents" is a string which configures ColumnTransformer to
- >>> # pass the documents column as a 1d array to the FeatureHasher
- >>> ct = ColumnTransformer(
- ... [("text_preprocess", FeatureHasher(input_type="string"), "documents"),
- ... ("num_preprocess", MinMaxScaler(), ["width"])])
- >>> X_trans = ct.fit_transform(X) # doctest: +SKIP
- For a more detailed example of usage, see
- :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py`.
- """
- _required_parameters = ["transformers"]
- _parameter_constraints: dict = {
- "transformers": [list, Hidden(tuple)],
- "remainder": [
- StrOptions({"drop", "passthrough"}),
- HasMethods(["fit", "transform"]),
- HasMethods(["fit_transform", "transform"]),
- ],
- "sparse_threshold": [Interval(Real, 0, 1, closed="both")],
- "n_jobs": [Integral, None],
- "transformer_weights": [dict, None],
- "verbose": ["verbose"],
- "verbose_feature_names_out": ["boolean"],
- }
- def __init__(
- self,
- transformers,
- *,
- remainder="drop",
- sparse_threshold=0.3,
- n_jobs=None,
- transformer_weights=None,
- verbose=False,
- verbose_feature_names_out=True,
- ):
- self.transformers = transformers
- self.remainder = remainder
- self.sparse_threshold = sparse_threshold
- self.n_jobs = n_jobs
- self.transformer_weights = transformer_weights
- self.verbose = verbose
- self.verbose_feature_names_out = verbose_feature_names_out
- @property
- def _transformers(self):
- """
- Internal list of transformer only containing the name and
- transformers, dropping the columns. This is for the implementation
- of get_params via BaseComposition._get_params which expects lists
- of tuples of len 2.
- """
- try:
- return [(name, trans) for name, trans, _ in self.transformers]
- except (TypeError, ValueError):
- return self.transformers
- @_transformers.setter
- def _transformers(self, value):
- try:
- self.transformers = [
- (name, trans, col)
- for ((name, trans), (_, _, col)) in zip(value, self.transformers)
- ]
- except (TypeError, ValueError):
- self.transformers = value
- def set_output(self, *, transform=None):
- """Set the output container when `"transform"` and `"fit_transform"` are called.
- Calling `set_output` will set the output of all estimators in `transformers`
- and `transformers_`.
- Parameters
- ----------
- transform : {"default", "pandas"}, default=None
- Configure output of `transform` and `fit_transform`.
- - `"default"`: Default output format of a transformer
- - `"pandas"`: DataFrame output
- - `None`: Transform configuration is unchanged
- Returns
- -------
- self : estimator instance
- Estimator instance.
- """
- super().set_output(transform=transform)
- transformers = (
- trans
- for _, trans, _ in chain(
- self.transformers, getattr(self, "transformers_", [])
- )
- if trans not in {"passthrough", "drop"}
- )
- for trans in transformers:
- _safe_set_output(trans, transform=transform)
- if self.remainder not in {"passthrough", "drop"}:
- _safe_set_output(self.remainder, transform=transform)
- return self
- def get_params(self, deep=True):
- """Get parameters for this estimator.
- Returns the parameters given in the constructor as well as the
- estimators contained within the `transformers` of the
- `ColumnTransformer`.
- Parameters
- ----------
- deep : bool, default=True
- If True, will return the parameters for this estimator and
- contained subobjects that are estimators.
- Returns
- -------
- params : dict
- Parameter names mapped to their values.
- """
- return self._get_params("_transformers", deep=deep)
- def set_params(self, **kwargs):
- """Set the parameters of this estimator.
- Valid parameter keys can be listed with ``get_params()``. Note that you
- can directly set the parameters of the estimators contained in
- `transformers` of `ColumnTransformer`.
- Parameters
- ----------
- **kwargs : dict
- Estimator parameters.
- Returns
- -------
- self : ColumnTransformer
- This estimator.
- """
- self._set_params("_transformers", **kwargs)
- return self
- def _iter(self, fitted=False, replace_strings=False, column_as_strings=False):
- """
- Generate (name, trans, column, weight) tuples.
- If fitted=True, use the fitted transformers, else use the
- user specified transformers updated with converted column names
- and potentially appended with transformer for remainder.
- """
- if fitted:
- if replace_strings:
- # Replace "passthrough" with the fitted version in
- # _name_to_fitted_passthrough
- def replace_passthrough(name, trans, columns):
- if name not in self._name_to_fitted_passthrough:
- return name, trans, columns
- return name, self._name_to_fitted_passthrough[name], columns
- transformers = [
- replace_passthrough(*trans) for trans in self.transformers_
- ]
- else:
- transformers = self.transformers_
- else:
- # interleave the validated column specifiers
- transformers = [
- (name, trans, column)
- for (name, trans, _), column in zip(self.transformers, self._columns)
- ]
- # add transformer tuple for remainder
- if self._remainder[2]:
- transformers = chain(transformers, [self._remainder])
- get_weight = (self.transformer_weights or {}).get
- output_config = _get_output_config("transform", self)
- for name, trans, columns in transformers:
- if replace_strings:
- # replace 'passthrough' with identity transformer and
- # skip in case of 'drop'
- if trans == "passthrough":
- trans = FunctionTransformer(
- accept_sparse=True,
- check_inverse=False,
- feature_names_out="one-to-one",
- ).set_output(transform=output_config["dense"])
- elif trans == "drop":
- continue
- elif _is_empty_column_selection(columns):
- continue
- if column_as_strings:
- # Convert all columns to using their string labels
- columns_is_scalar = np.isscalar(columns)
- indices = self._transformer_to_input_indices[name]
- columns = self.feature_names_in_[indices]
- if columns_is_scalar:
- # selection is done with one dimension
- columns = columns[0]
- yield (name, trans, columns, get_weight(name))
- def _validate_transformers(self):
- if not self.transformers:
- return
- names, transformers, _ = zip(*self.transformers)
- # validate names
- self._validate_names(names)
- # validate estimators
- for t in transformers:
- if t in ("drop", "passthrough"):
- continue
- if not (hasattr(t, "fit") or hasattr(t, "fit_transform")) or not hasattr(
- t, "transform"
- ):
- # Used to validate the transformers in the `transformers` list
- raise TypeError(
- "All estimators should implement fit and "
- "transform, or can be 'drop' or 'passthrough' "
- "specifiers. '%s' (type %s) doesn't." % (t, type(t))
- )
- def _validate_column_callables(self, X):
- """
- Converts callable column specifications.
- """
- all_columns = []
- transformer_to_input_indices = {}
- for name, _, columns in self.transformers:
- if callable(columns):
- columns = columns(X)
- all_columns.append(columns)
- transformer_to_input_indices[name] = _get_column_indices(X, columns)
- self._columns = all_columns
- self._transformer_to_input_indices = transformer_to_input_indices
- def _validate_remainder(self, X):
- """
- Validates ``remainder`` and defines ``_remainder`` targeting
- the remaining columns.
- """
- self._n_features = X.shape[1]
- cols = set(chain(*self._transformer_to_input_indices.values()))
- remaining = sorted(set(range(self._n_features)) - cols)
- self._remainder = ("remainder", self.remainder, remaining)
- self._transformer_to_input_indices["remainder"] = remaining
- @property
- def named_transformers_(self):
- """Access the fitted transformer by name.
- Read-only attribute to access any transformer by given name.
- Keys are transformer names and values are the fitted transformer
- objects.
- """
- # Use Bunch object to improve autocomplete
- return Bunch(**{name: trans for name, trans, _ in self.transformers_})
- def _get_feature_name_out_for_transformer(
- self, name, trans, column, feature_names_in
- ):
- """Gets feature names of transformer.
- Used in conjunction with self._iter(fitted=True) in get_feature_names_out.
- """
- column_indices = self._transformer_to_input_indices[name]
- names = feature_names_in[column_indices]
- if trans == "drop" or _is_empty_column_selection(column):
- return
- elif trans == "passthrough":
- return names
- # An actual transformer
- if not hasattr(trans, "get_feature_names_out"):
- raise AttributeError(
- f"Transformer {name} (type {type(trans).__name__}) does "
- "not provide get_feature_names_out."
- )
- return trans.get_feature_names_out(names)
- def get_feature_names_out(self, input_features=None):
- """Get output feature names for transformation.
- Parameters
- ----------
- input_features : array-like of str or None, default=None
- Input features.
- - If `input_features` is `None`, then `feature_names_in_` is
- used as feature names in. If `feature_names_in_` is not defined,
- then the following input feature names are generated:
- `["x0", "x1", ..., "x(n_features_in_ - 1)"]`.
- - If `input_features` is an array-like, then `input_features` must
- match `feature_names_in_` if `feature_names_in_` is defined.
- Returns
- -------
- feature_names_out : ndarray of str objects
- Transformed feature names.
- """
- check_is_fitted(self)
- input_features = _check_feature_names_in(self, input_features)
- # List of tuples (name, feature_names_out)
- transformer_with_feature_names_out = []
- for name, trans, column, _ in self._iter(fitted=True):
- feature_names_out = self._get_feature_name_out_for_transformer(
- name, trans, column, input_features
- )
- if feature_names_out is None:
- continue
- transformer_with_feature_names_out.append((name, feature_names_out))
- if not transformer_with_feature_names_out:
- # No feature names
- return np.array([], dtype=object)
- return self._add_prefix_for_feature_names_out(
- transformer_with_feature_names_out
- )
- def _add_prefix_for_feature_names_out(self, transformer_with_feature_names_out):
- """Add prefix for feature names out that includes the transformer names.
- Parameters
- ----------
- transformer_with_feature_names_out : list of tuples of (str, array-like of str)
- The tuple consistent of the transformer's name and its feature names out.
- Returns
- -------
- feature_names_out : ndarray of shape (n_features,), dtype=str
- Transformed feature names.
- """
- if self.verbose_feature_names_out:
- # Prefix the feature names out with the transformers name
- names = list(
- chain.from_iterable(
- (f"{name}__{i}" for i in feature_names_out)
- for name, feature_names_out in transformer_with_feature_names_out
- )
- )
- return np.asarray(names, dtype=object)
- # verbose_feature_names_out is False
- # Check that names are all unique without a prefix
- feature_names_count = Counter(
- chain.from_iterable(s for _, s in transformer_with_feature_names_out)
- )
- top_6_overlap = [
- name for name, count in feature_names_count.most_common(6) if count > 1
- ]
- top_6_overlap.sort()
- if top_6_overlap:
- if len(top_6_overlap) == 6:
- # There are more than 5 overlapping names, we only show the 5
- # of the feature names
- names_repr = str(top_6_overlap[:5])[:-1] + ", ...]"
- else:
- names_repr = str(top_6_overlap)
- raise ValueError(
- f"Output feature names: {names_repr} are not unique. Please set "
- "verbose_feature_names_out=True to add prefixes to feature names"
- )
- return np.concatenate(
- [name for _, name in transformer_with_feature_names_out],
- )
- def _update_fitted_transformers(self, transformers):
- # transformers are fitted; excludes 'drop' cases
- fitted_transformers = iter(transformers)
- transformers_ = []
- self._name_to_fitted_passthrough = {}
- for name, old, column, _ in self._iter():
- if old == "drop":
- trans = "drop"
- elif old == "passthrough":
- # FunctionTransformer is present in list of transformers,
- # so get next transformer, but save original string
- func_transformer = next(fitted_transformers)
- trans = "passthrough"
- # The fitted FunctionTransformer is saved in another attribute,
- # so it can be used during transform for set_output.
- self._name_to_fitted_passthrough[name] = func_transformer
- elif _is_empty_column_selection(column):
- trans = old
- else:
- trans = next(fitted_transformers)
- transformers_.append((name, trans, column))
- # sanity check that transformers is exhausted
- assert not list(fitted_transformers)
- self.transformers_ = transformers_
- def _validate_output(self, result):
- """
- Ensure that the output of each transformer is 2D. Otherwise
- hstack can raise an error or produce incorrect results.
- """
- names = [
- name for name, _, _, _ in self._iter(fitted=True, replace_strings=True)
- ]
- for Xs, name in zip(result, names):
- if not getattr(Xs, "ndim", 0) == 2:
- raise ValueError(
- "The output of the '{0}' transformer should be 2D (scipy "
- "matrix, array, or pandas DataFrame).".format(name)
- )
- def _record_output_indices(self, Xs):
- """
- Record which transformer produced which column.
- """
- idx = 0
- self.output_indices_ = {}
- for transformer_idx, (name, _, _, _) in enumerate(
- self._iter(fitted=True, replace_strings=True)
- ):
- n_columns = Xs[transformer_idx].shape[1]
- self.output_indices_[name] = slice(idx, idx + n_columns)
- idx += n_columns
- # `_iter` only generates transformers that have a non empty
- # selection. Here we set empty slices for transformers that
- # generate no output, which are safe for indexing
- all_names = [t[0] for t in self.transformers] + ["remainder"]
- for name in all_names:
- if name not in self.output_indices_:
- self.output_indices_[name] = slice(0, 0)
- def _log_message(self, name, idx, total):
- if not self.verbose:
- return None
- return "(%d of %d) Processing %s" % (idx, total, name)
- def _fit_transform(self, X, y, func, fitted=False, column_as_strings=False):
- """
- Private function to fit and/or transform on demand.
- Return value (transformers and/or transformed X data) depends
- on the passed function.
- ``fitted=True`` ensures the fitted transformers are used.
- """
- transformers = list(
- self._iter(
- fitted=fitted, replace_strings=True, column_as_strings=column_as_strings
- )
- )
- try:
- return Parallel(n_jobs=self.n_jobs)(
- delayed(func)(
- transformer=clone(trans) if not fitted else trans,
- X=_safe_indexing(X, column, axis=1),
- y=y,
- weight=weight,
- message_clsname="ColumnTransformer",
- message=self._log_message(name, idx, len(transformers)),
- )
- for idx, (name, trans, column, weight) in enumerate(transformers, 1)
- )
- except ValueError as e:
- if "Expected 2D array, got 1D array instead" in str(e):
- raise ValueError(_ERR_MSG_1DCOLUMN) from e
- else:
- raise
- def fit(self, X, y=None):
- """Fit all transformers using X.
- Parameters
- ----------
- X : {array-like, dataframe} of shape (n_samples, n_features)
- Input data, of which specified subsets are used to fit the
- transformers.
- y : array-like of shape (n_samples,...), default=None
- Targets for supervised learning.
- Returns
- -------
- self : ColumnTransformer
- This estimator.
- """
- # we use fit_transform to make sure to set sparse_output_ (for which we
- # need the transformed data) to have consistent output type in predict
- self.fit_transform(X, y=y)
- return self
- @_fit_context(
- # estimators in ColumnTransformer.transformers are not validated yet
- prefer_skip_nested_validation=False
- )
- def fit_transform(self, X, y=None):
- """Fit all transformers, transform the data and concatenate results.
- Parameters
- ----------
- X : {array-like, dataframe} of shape (n_samples, n_features)
- Input data, of which specified subsets are used to fit the
- transformers.
- y : array-like of shape (n_samples,), default=None
- Targets for supervised learning.
- Returns
- -------
- X_t : {array-like, sparse matrix} of \
- shape (n_samples, sum_n_components)
- Horizontally stacked results of transformers. sum_n_components is the
- sum of n_components (output dimension) over transformers. If
- any result is a sparse matrix, everything will be converted to
- sparse matrices.
- """
- self._check_feature_names(X, reset=True)
- X = _check_X(X)
- # set n_features_in_ attribute
- self._check_n_features(X, reset=True)
- self._validate_transformers()
- self._validate_column_callables(X)
- self._validate_remainder(X)
- result = self._fit_transform(X, y, _fit_transform_one)
- if not result:
- self._update_fitted_transformers([])
- # All transformers are None
- return np.zeros((X.shape[0], 0))
- Xs, transformers = zip(*result)
- # determine if concatenated output will be sparse or not
- if any(sparse.issparse(X) for X in Xs):
- nnz = sum(X.nnz if sparse.issparse(X) else X.size for X in Xs)
- total = sum(
- X.shape[0] * X.shape[1] if sparse.issparse(X) else X.size for X in Xs
- )
- density = nnz / total
- self.sparse_output_ = density < self.sparse_threshold
- else:
- self.sparse_output_ = False
- self._update_fitted_transformers(transformers)
- self._validate_output(Xs)
- self._record_output_indices(Xs)
- return self._hstack(list(Xs))
- def transform(self, X):
- """Transform X separately by each transformer, concatenate results.
- Parameters
- ----------
- X : {array-like, dataframe} of shape (n_samples, n_features)
- The data to be transformed by subset.
- Returns
- -------
- X_t : {array-like, sparse matrix} of \
- shape (n_samples, sum_n_components)
- Horizontally stacked results of transformers. sum_n_components is the
- sum of n_components (output dimension) over transformers. If
- any result is a sparse matrix, everything will be converted to
- sparse matrices.
- """
- check_is_fitted(self)
- X = _check_X(X)
- fit_dataframe_and_transform_dataframe = hasattr(
- self, "feature_names_in_"
- ) and hasattr(X, "columns")
- if fit_dataframe_and_transform_dataframe:
- named_transformers = self.named_transformers_
- # check that all names seen in fit are in transform, unless
- # they were dropped
- non_dropped_indices = [
- ind
- for name, ind in self._transformer_to_input_indices.items()
- if name in named_transformers
- and isinstance(named_transformers[name], str)
- and named_transformers[name] != "drop"
- ]
- all_indices = set(chain(*non_dropped_indices))
- all_names = set(self.feature_names_in_[ind] for ind in all_indices)
- diff = all_names - set(X.columns)
- if diff:
- raise ValueError(f"columns are missing: {diff}")
- else:
- # ndarray was used for fitting or transforming, thus we only
- # check that n_features_in_ is consistent
- self._check_n_features(X, reset=False)
- Xs = self._fit_transform(
- X,
- None,
- _transform_one,
- fitted=True,
- column_as_strings=fit_dataframe_and_transform_dataframe,
- )
- self._validate_output(Xs)
- if not Xs:
- # All transformers are None
- return np.zeros((X.shape[0], 0))
- return self._hstack(list(Xs))
- def _hstack(self, Xs):
- """Stacks Xs horizontally.
- This allows subclasses to control the stacking behavior, while reusing
- everything else from ColumnTransformer.
- Parameters
- ----------
- Xs : list of {array-like, sparse matrix, dataframe}
- """
- if self.sparse_output_:
- try:
- # since all columns should be numeric before stacking them
- # in a sparse matrix, `check_array` is used for the
- # dtype conversion if necessary.
- converted_Xs = [
- check_array(X, accept_sparse=True, force_all_finite=False)
- for X in Xs
- ]
- except ValueError as e:
- raise ValueError(
- "For a sparse output, all columns should "
- "be a numeric or convertible to a numeric."
- ) from e
- return sparse.hstack(converted_Xs).tocsr()
- else:
- Xs = [f.toarray() if sparse.issparse(f) else f for f in Xs]
- config = _get_output_config("transform", self)
- if config["dense"] == "pandas" and all(hasattr(X, "iloc") for X in Xs):
- pd = check_pandas_support("transform")
- output = pd.concat(Xs, axis=1)
- output_samples = output.shape[0]
- if any(_num_samples(X) != output_samples for X in Xs):
- raise ValueError(
- "Concatenating DataFrames from the transformer's output lead to"
- " an inconsistent number of samples. The output may have Pandas"
- " Indexes that do not match."
- )
- # If all transformers define `get_feature_names_out`, then transform
- # will adjust the column names to be consistent with
- # verbose_feature_names_out. Here we prefix the feature names if
- # verbose_feature_names_out=True.
- if not self.verbose_feature_names_out:
- return output
- transformer_names = [
- t[0] for t in self._iter(fitted=True, replace_strings=True)
- ]
- # Selection of columns might be empty.
- # Hence feature names are filtered for non-emptiness.
- feature_names_outs = [X.columns for X in Xs if X.shape[1] != 0]
- names_out = self._add_prefix_for_feature_names_out(
- list(zip(transformer_names, feature_names_outs))
- )
- output.columns = names_out
- return output
- return np.hstack(Xs)
- def _sk_visual_block_(self):
- if isinstance(self.remainder, str) and self.remainder == "drop":
- transformers = self.transformers
- elif hasattr(self, "_remainder"):
- remainder_columns = self._remainder[2]
- if (
- hasattr(self, "feature_names_in_")
- and remainder_columns
- and not all(isinstance(col, str) for col in remainder_columns)
- ):
- remainder_columns = self.feature_names_in_[remainder_columns].tolist()
- transformers = chain(
- self.transformers, [("remainder", self.remainder, remainder_columns)]
- )
- else:
- transformers = chain(self.transformers, [("remainder", self.remainder, "")])
- names, transformers, name_details = zip(*transformers)
- return _VisualBlock(
- "parallel", transformers, names=names, name_details=name_details
- )
- def _check_X(X):
- """Use check_array only on lists and other non-array-likes / sparse"""
- if hasattr(X, "__array__") or sparse.issparse(X):
- return X
- return check_array(X, force_all_finite="allow-nan", dtype=object)
- def _is_empty_column_selection(column):
- """
- Return True if the column selection is empty (empty list or all-False
- boolean array).
- """
- if hasattr(column, "dtype") and np.issubdtype(column.dtype, np.bool_):
- return not column.any()
- elif hasattr(column, "__len__"):
- return (
- len(column) == 0
- or all(isinstance(col, bool) for col in column)
- and not any(column)
- )
- else:
- return False
- def _get_transformer_list(estimators):
- """
- Construct (name, trans, column) tuples from list
- """
- transformers, columns = zip(*estimators)
- names, _ = zip(*_name_estimators(transformers))
- transformer_list = list(zip(names, transformers, columns))
- return transformer_list
- # This function is not validated using validate_params because
- # it's just a factory for ColumnTransformer.
- def make_column_transformer(
- *transformers,
- remainder="drop",
- sparse_threshold=0.3,
- n_jobs=None,
- verbose=False,
- verbose_feature_names_out=True,
- ):
- """Construct a ColumnTransformer from the given transformers.
- This is a shorthand for the ColumnTransformer constructor; it does not
- require, and does not permit, naming the transformers. Instead, they will
- be given names automatically based on their types. It also does not allow
- weighting with ``transformer_weights``.
- Read more in the :ref:`User Guide <make_column_transformer>`.
- Parameters
- ----------
- *transformers : tuples
- Tuples of the form (transformer, columns) specifying the
- transformer objects to be applied to subsets of the data.
- transformer : {'drop', 'passthrough'} or estimator
- Estimator must support :term:`fit` and :term:`transform`.
- Special-cased strings 'drop' and 'passthrough' are accepted as
- well, to indicate to drop the columns or to pass them through
- untransformed, respectively.
- columns : str, array-like of str, int, array-like of int, slice, \
- array-like of bool or callable
- Indexes the data on its second axis. Integers are interpreted as
- positional columns, while strings can reference DataFrame columns
- by name. A scalar string or int should be used where
- ``transformer`` expects X to be a 1d array-like (vector),
- otherwise a 2d array will be passed to the transformer.
- A callable is passed the input data `X` and can return any of the
- above. To select multiple columns by name or dtype, you can use
- :obj:`make_column_selector`.
- remainder : {'drop', 'passthrough'} or estimator, default='drop'
- By default, only the specified columns in `transformers` are
- transformed and combined in the output, and the non-specified
- columns are dropped. (default of ``'drop'``).
- By specifying ``remainder='passthrough'``, all remaining columns that
- were not specified in `transformers` will be automatically passed
- through. This subset of columns is concatenated with the output of
- the transformers.
- By setting ``remainder`` to be an estimator, the remaining
- non-specified columns will use the ``remainder`` estimator. The
- estimator must support :term:`fit` and :term:`transform`.
- sparse_threshold : float, default=0.3
- If the transformed output consists of a mix of sparse and dense data,
- it will be stacked as a sparse matrix if the density is lower than this
- value. Use ``sparse_threshold=0`` to always return dense.
- When the transformed output consists of all sparse or all dense data,
- the stacked result will be sparse or dense, respectively, and this
- keyword will be ignored.
- n_jobs : int, default=None
- Number of jobs to run in parallel.
- ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
- ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
- for more details.
- verbose : bool, default=False
- If True, the time elapsed while fitting each transformer will be
- printed as it is completed.
- verbose_feature_names_out : bool, default=True
- If True, :meth:`ColumnTransformer.get_feature_names_out` will prefix
- all feature names with the name of the transformer that generated that
- feature.
- If False, :meth:`ColumnTransformer.get_feature_names_out` will not
- prefix any feature names and will error if feature names are not
- unique.
- .. versionadded:: 1.0
- Returns
- -------
- ct : ColumnTransformer
- Returns a :class:`ColumnTransformer` object.
- See Also
- --------
- ColumnTransformer : Class that allows combining the
- outputs of multiple transformer objects used on column subsets
- of the data into a single feature space.
- Examples
- --------
- >>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
- >>> from sklearn.compose import make_column_transformer
- >>> make_column_transformer(
- ... (StandardScaler(), ['numerical_column']),
- ... (OneHotEncoder(), ['categorical_column']))
- ColumnTransformer(transformers=[('standardscaler', StandardScaler(...),
- ['numerical_column']),
- ('onehotencoder', OneHotEncoder(...),
- ['categorical_column'])])
- """
- # transformer_weights keyword is not passed through because the user
- # would need to know the automatically generated names of the transformers
- transformer_list = _get_transformer_list(transformers)
- return ColumnTransformer(
- transformer_list,
- n_jobs=n_jobs,
- remainder=remainder,
- sparse_threshold=sparse_threshold,
- verbose=verbose,
- verbose_feature_names_out=verbose_feature_names_out,
- )
- class make_column_selector:
- """Create a callable to select columns to be used with
- :class:`ColumnTransformer`.
- :func:`make_column_selector` can select columns based on datatype or the
- columns name with a regex. When using multiple selection criteria, **all**
- criteria must match for a column to be selected.
- For an example of how to use :func:`make_column_selector` within a
- :class:`ColumnTransformer` to select columns based on data type (i.e.
- `dtype`), refer to
- :ref:`sphx_glr_auto_examples_compose_plot_column_transformer_mixed_types.py`.
- Parameters
- ----------
- pattern : str, default=None
- Name of columns containing this regex pattern will be included. If
- None, column selection will not be selected based on pattern.
- dtype_include : column dtype or list of column dtypes, default=None
- A selection of dtypes to include. For more details, see
- :meth:`pandas.DataFrame.select_dtypes`.
- dtype_exclude : column dtype or list of column dtypes, default=None
- A selection of dtypes to exclude. For more details, see
- :meth:`pandas.DataFrame.select_dtypes`.
- Returns
- -------
- selector : callable
- Callable for column selection to be used by a
- :class:`ColumnTransformer`.
- See Also
- --------
- ColumnTransformer : Class that allows combining the
- outputs of multiple transformer objects used on column subsets
- of the data into a single feature space.
- Examples
- --------
- >>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
- >>> from sklearn.compose import make_column_transformer
- >>> from sklearn.compose import make_column_selector
- >>> import numpy as np
- >>> import pandas as pd # doctest: +SKIP
- >>> X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'],
- ... 'rating': [5, 3, 4, 5]}) # doctest: +SKIP
- >>> ct = make_column_transformer(
- ... (StandardScaler(),
- ... make_column_selector(dtype_include=np.number)), # rating
- ... (OneHotEncoder(),
- ... make_column_selector(dtype_include=object))) # city
- >>> ct.fit_transform(X) # doctest: +SKIP
- array([[ 0.90453403, 1. , 0. , 0. ],
- [-1.50755672, 1. , 0. , 0. ],
- [-0.30151134, 0. , 1. , 0. ],
- [ 0.90453403, 0. , 0. , 1. ]])
- """
- def __init__(self, pattern=None, *, dtype_include=None, dtype_exclude=None):
- self.pattern = pattern
- self.dtype_include = dtype_include
- self.dtype_exclude = dtype_exclude
- def __call__(self, df):
- """Callable for column selection to be used by a
- :class:`ColumnTransformer`.
- Parameters
- ----------
- df : dataframe of shape (n_features, n_samples)
- DataFrame to select columns from.
- """
- if not hasattr(df, "iloc"):
- raise ValueError(
- "make_column_selector can only be applied to pandas dataframes"
- )
- df_row = df.iloc[:1]
- if self.dtype_include is not None or self.dtype_exclude is not None:
- df_row = df_row.select_dtypes(
- include=self.dtype_include, exclude=self.dtype_exclude
- )
- cols = df_row.columns
- if self.pattern is not None:
- cols = cols[cols.str.contains(self.pattern, regex=True)]
- return cols.tolist()
|