A Scala feature transformation library for data science and machine learning
APACHE-2.0 License
Featran, also known as Featran77 or F77 (get it?), is a Scala library for feature transformation. It aims to simplify the time consuming task of feature engineering in data science and machine learning processes. It supports various collection types for feature extraction and output formats for feature representation.
Most feature transformation logic requires two steps, one global aggregation to summarize data followed by one element-wise mapping to transform them. For example:
[min, max]
We can implement this in a naive way using reduce
and map
.
case class Point(score: Double, label: String)
val data = Seq(Point(1.0, "a"), Point(2.0, "b"), Point(3.0, "c"))
val a = data
.map(p => (p.score, p.score, Set(p.label)))
.reduce((x, y) => (math.min(x._1, y._1), math.max(x._2, y._2), x._3 ++ y._3))
val features = data.map { p =>
(p.score - a._1) / (a._2 - a._1) :: a._3.toList.sorted.map(s => if (s == p.label) 1.0 else 0.0)
}
But this is unmanageable for complex feature sets. The above logic can be easily expressed in Featran.
import com.spotify.featran._
import com.spotify.featran.transformers._
val fs = FeatureSpec.of[Point]
.required(_.score)(MinMaxScaler("min-max"))
.required(_.label)(OneHotEncoder("one-hot"))
val fe = fs.extract(data)
val names = fe.featureNames
val features = fe.featureValues[Seq[Double]]
Featran also supports these additional features.
DataSet
s, Scalding TypedPipe
s, Scio SCollection
s and Spark RDD
sExample
Protobuf, XGBoost LabeledPoint
and NumPy .npy
fileSee Examples (source) for detailed examples. See transformers package for a complete list of available feature transformers.
See ScalaDocs for current API documentation.
Feature includes the following artifacts:
featran-core
- core library, support for extraction from Scala collections and output as Scala collections, Breeze dense and sparse vectorsfeatran-java
- Java interface, see JavaExample.java
featran-flink
- support for extraction from Flink DataSet
featran-scalding
- support for extraction from Scalding TypedPipe
featran-scio
- support for extraction from Scio SCollection
featran-spark
- support for extraction from Spark RDD
featran-tensorflow
- support for output as TensorFlow Example
Protobuffeatran-xgboost
- support for output as XGBoost LabeledPoint
featran-numpy
- support for output as NumPy .npy
fileCopyright 2016-2017 Spotify AB.
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0