parquet-go is a pure-go implementation of reading and writing the parquet format file.
Add the parquet-go library to your $GOPATH/src and install dependencies:
go get github.com/xitongsys/parquet-go
The example/
directory contains several examples.
The local_flat.go
example creates some data and writes it out to the example/output/flat.parquet
file.
cd $GOPATH/src/github.com/xitongsys/parquet-go/example
go run local_flat.go
The local_flat.go
code shows how it's easy to output structs
from Go programs to Parquet files.
There are two types in Parquet: Primitive Type and Logical Type. Logical types are stored as primitive types.
Primitive Type | Go Type |
---|---|
BOOLEAN | bool |
INT32 | int32 |
INT64 | int64 |
INT96(deprecated) | string |
FLOAT | float32 |
DOUBLE | float64 |
BYTE_ARRAY | string |
FIXED_LEN_BYTE_ARRAY | string |
Logical Type | Primitive Type | Go Type |
---|---|---|
UTF8 | BYTE_ARRAY | string |
INT_8 | INT32 | int32 |
INT_16 | INT32 | int32 |
INT_32 | INT32 | int32 |
INT_64 | INT64 | int64 |
UINT_8 | INT32 | int32 |
UINT_16 | INT32 | int32 |
UINT_32 | INT32 | int32 |
UINT_64 | INT64 | int64 |
DATE | INT32 | int32 |
TIME_MILLIS | INT32 | int32 |
TIME_MICROS | INT64 | int64 |
TIMESTAMP_MILLIS | INT64 | int64 |
TIMESTAMP_MICROS | INT64 | int64 |
INTERVAL | FIXED_LEN_BYTE_ARRAY | string |
DECIMAL | INT32,INT64,FIXED_LEN_BYTE_ARRAY,BYTE_ARRAY | int32,int64,string,string |
LIST | - | slice |
MAP | - | map |
Parquet-go supports type alias such type MyString string
. But the base type must follow the table instructions.
Some type convert functions: converter.go
All types
All types
INT32, INT64, INT_8, INT_16, INT_32, INT_64, UINT_8, UINT_16, UINT_32, UINT_64, TIME_MILLIS, TIME_MICROS, TIMESTAMP_MILLIS, TIMESTAMP_MICROS
BYTE_ARRAY, UTF8
BYTE_ARRAY, UTF8
omitstats=true
to a field tag.There are three repetition types in Parquet: REQUIRED, OPTIONAL, REPEATED.
Repetition Type | Example | Description |
---|---|---|
REQUIRED | V1 int32 `parquet:"name=v1, type=INT32"` |
No extra description |
OPTIONAL | V1 *int32 `parquet:"name=v1, type=INT32"` |
Declare as pointer |
REPEATED | V1 []int32 `parquet:"name=v1, type=INT32, repetitiontype=REPEATED"` |
Add 'repetitiontype=REPEATED' in tags |
Bool bool `parquet:"name=bool, type=BOOLEAN"`
Int32 int32 `parquet:"name=int32, type=INT32"`
Int64 int64 `parquet:"name=int64, type=INT64"`
Int96 string `parquet:"name=int96, type=INT96"`
Float float32 `parquet:"name=float, type=FLOAT"`
Double float64 `parquet:"name=double, type=DOUBLE"`
ByteArray string `parquet:"name=bytearray, type=BYTE_ARRAY"`
FixedLenByteArray string `parquet:"name=FixedLenByteArray, type=FIXED_LEN_BYTE_ARRAY, length=10"`
Utf8 string `parquet:"name=utf8, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
Int_8 int32 `parquet:"name=int_8, type=INT32, convertedtype=INT32, convertedtype=INT_8"`
Int_16 int32 `parquet:"name=int_16, type=INT32, convertedtype=INT_16"`
Int_32 int32 `parquet:"name=int_32, type=INT32, convertedtype=INT_32"`
Int_64 int64 `parquet:"name=int_64, type=INT64, convertedtype=INT_64"`
Uint_8 int32 `parquet:"name=uint_8, type=INT32, convertedtype=UINT_8"`
Uint_16 int32 `parquet:"name=uint_16, type=INT32, convertedtype=UINT_16"`
Uint_32 int32 `parquet:"name=uint_32, type=INT32, convertedtype=UINT_32"`
Uint_64 int64 `parquet:"name=uint_64, type=INT64, convertedtype=UINT_64"`
Date int32 `parquet:"name=date, type=INT32, convertedtype=DATE"`
Date2 int32 `parquet:"name=date2, type=INT32, convertedtype=DATE, logicaltype=DATE"`
TimeMillis int32 `parquet:"name=timemillis, type=INT32, convertedtype=TIME_MILLIS"`
TimeMillis2 int32 `parquet:"name=timemillis2, type=INT32, logicaltype=TIME, logicaltype.isadjustedtoutc=true, logicaltype.unit=MILLIS"`
TimeMicros int64 `parquet:"name=timemicros, type=INT64, convertedtype=TIME_MICROS"`
TimeMicros2 int64 `parquet:"name=timemicros2, type=INT64, logicaltype=TIME, logicaltype.isadjustedtoutc=false, logicaltype.unit=MICROS"`
TimestampMillis int64 `parquet:"name=timestampmillis, type=INT64, convertedtype=TIMESTAMP_MILLIS"`
TimestampMillis2 int64 `parquet:"name=timestampmillis2, type=INT64, logicaltype=TIMESTAMP, logicaltype.isadjustedtoutc=true, logicaltype.unit=MILLIS"`
TimestampMicros int64 `parquet:"name=timestampmicros, type=INT64, convertedtype=TIMESTAMP_MICROS"`
TimestampMicros2 int64 `parquet:"name=timestampmicros2, type=INT64, logicaltype=TIMESTAMP, logicaltype.isadjustedtoutc=false, logicaltype.unit=MICROS"`
Interval string `parquet:"name=interval, type=BYTE_ARRAY, convertedtype=INTERVAL"`
Decimal1 int32 `parquet:"name=decimal1, type=INT32, convertedtype=DECIMAL, scale=2, precision=9"`
Decimal2 int64 `parquet:"name=decimal2, type=INT64, convertedtype=DECIMAL, scale=2, precision=18"`
Decimal3 string `parquet:"name=decimal3, type=FIXED_LEN_BYTE_ARRAY, convertedtype=DECIMAL, scale=2, precision=10, length=12"`
Decimal4 string `parquet:"name=decimal4, type=BYTE_ARRAY, convertedtype=DECIMAL, scale=2, precision=20"`
Decimal5 int32 `parquet:"name=decimal5, type=INT32, logicaltype=DECIMAL, logicaltype.precision=10, logicaltype.scale=2"`
Map map[string]int32 `parquet:"name=map, type=MAP, convertedtype=MAP, keytype=BYTE_ARRAY, keyconvertedtype=UTF8, valuetype=INT32"`
List []string `parquet:"name=list, type=MAP, convertedtype=LIST, valuetype=BYTE_ARRAY, valueconvertedtype=UTF8"`
Repeated []int32 `parquet:"name=repeated, type=INT32, repetitiontype=REPEATED"`
Type | Support |
---|---|
CompressionCodec_UNCOMPRESSED | YES |
CompressionCodec_SNAPPY | YES |
CompressionCodec_GZIP | YES |
CompressionCodec_LZO | NO |
CompressionCodec_BROTLI | NO |
CompressionCodec_LZ4 | YES |
CompressionCodec_ZSTD | YES |
Read/Write a parquet file need a ParquetFile interface implemented
type ParquetFile interface {
io.Seeker
io.Reader
io.Writer
io.Closer
Open(name string) (ParquetFile, error)
Create(name string) (ParquetFile, error)
}
Using this interface, parquet-go can read/write parquet file on different platforms. All the file sources are at parquet-go-source. Now it supports(local/hdfs/s3/gcs/memory).
Four Writers are supported: ParquetWriter, JSONWriter, CSVWriter, ArrowWriter.
ParquetWriter is used to write predefined Golang structs. Example of ParquetWriter
JSONWriter is used to write JSON strings Example of JSONWriter
CSVWriter is used to write data format similar with CSV(not nested) Example of CSVWriter
ArrowWriter is used to write parquet files using Arrow Schemas Example of ArrowWriter
Two Readers are supported: ParquetReader, ColumnReader
ParquetReader is used to read predefined Golang structs Example of ParquetReader
ColumnReader is used to read raw column data. The read function return 3 slices([value], [RepetitionLevel], [DefinitionLevel]) of the records. Example of ColumnReader
If the parquet file is very big (even the size of parquet file is small, the uncompressed size may be very large), please don't read all rows at one time, which may induce the OOM. You can read a small portion of the data at a time like a stream-oriented file.
RowGroupSize
and PageSize
may influence the final parquet file size. You can find the details from here. You can reset them in ParquetWriter
pw.RowGroupSize = 128 * 1024 * 1024 // default 128M
pw.PageSize = 8 * 1024 // default 8K
There are four methods to define the schema: go struct tags, Json, CSV, Arrow metadata. Only items in schema will be written and others will be ignored.
type Student struct {
Name string `parquet:"name=name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY"`
Age int32 `parquet:"name=age, type=INT32, encoding=PLAIN"`
Id int64 `parquet:"name=id, type=INT64"`
Weight float32 `parquet:"name=weight, type=FLOAT"`
Sex bool `parquet:"name=sex, type=BOOLEAN"`
Day int32 `parquet:"name=day, type=INT32, convertedtype=DATE"`
Ignored int32 //without parquet tag and won't write
}
JSON schema can be used to define some complicated schema, which can't be defined by tag.
type Student struct {
NameIn string
Age int32
Id int64
Weight float32
Sex bool
Classes []string
Scores map[string][]float32
Ignored string
Friends []struct {
Name string
Id int64
}
Teachers []struct {
Name string
Id int64
}
}
var jsonSchema string = `
{
"Tag": "name=parquet_go_root, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=name, inname=NameIn, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=age, inname=Age, type=INT32, repetitiontype=REQUIRED"},
{"Tag": "name=id, inname=Id, type=INT64, repetitiontype=REQUIRED"},
{"Tag": "name=weight, inname=Weight, type=FLOAT, repetitiontype=REQUIRED"},
{"Tag": "name=sex, inname=Sex, type=BOOLEAN, repetitiontype=REQUIRED"},
{"Tag": "name=classes, inname=Classes, type=LIST, repetitiontype=REQUIRED",
"Fields": [{"Tag": "name=element, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"}]
},
{
"Tag": "name=scores, inname=Scores, type=MAP, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=key, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=value, type=LIST, repetitiontype=REQUIRED",
"Fields": [{"Tag": "name=element, type=FLOAT, repetitiontype=REQUIRED"}]
}
]
},
{
"Tag": "name=friends, inname=Friends, type=LIST, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=element, repetitiontype=REQUIRED",
"Fields": [
{"Tag": "name=name, inname=Name, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=id, inname=Id, type=INT64, repetitiontype=REQUIRED"}
]}
]
},
{
"Tag": "name=teachers, inname=Teachers, repetitiontype=REPEATED",
"Fields": [
{"Tag": "name=name, inname=Name, type=BYTE_ARRAY, convertedtype=UTF8, repetitiontype=REQUIRED"},
{"Tag": "name=id, inname=Id, type=INT64, repetitiontype=REQUIRED"}
]
}
]
}
`
md := []string{
"name=Name, type=BYTE_ARRAY, convertedtype=UTF8, encoding=PLAIN_DICTIONARY",
"name=Age, type=INT32",
"name=Id, type=INT64",
"name=Weight, type=FLOAT",
"name=Sex, type=BOOLEAN",
}
schema := arrow.NewSchema(
[]arrow.Field{
{Name: "int64", Type: arrow.PrimitiveTypes.Int64},
{Name: "float64", Type: arrow.PrimitiveTypes.Float64},
{Name: "str", Type: arrow.BinaryTypes.String},
},
nil,
)
InName
. Field name in parquet file we call it ExName
. Function common.HeadToUpper
converts ExName
to InName
. There are some restriction:name
and Name
.PARGO_PREFIX_
is a reserved string, which you'd better not use it as a name prefix. (#294)\x01
as the delimiter of fields to support .
in some field name.(dot_in_name.go, #349)Marshal/Unmarshal is the most time consuming process in writing/reading. To improve the performance, parquet-go can use multiple goroutines to marshal/unmarshal the objects. You can set the concurrent number parameter np
in the Read/Write initial functions.
func NewParquetReader(pFile ParquetFile.ParquetFile, obj interface{}, np int64) (*ParquetReader, error)
func NewParquetWriter(pFile ParquetFile.ParquetFile, obj interface{}, np int64) (*ParquetWriter, error)
func NewJSONWriter(jsonSchema string, pfile ParquetFile.ParquetFile, np int64) (*JSONWriter, error)
func NewCSVWriter(md []string, pfile ParquetFile.ParquetFile, np int64) (*CSVWriter, error)
func NewArrowWriter(arrowSchema *arrow.Schema, pfile source.ParquetFile, np int64) (*ArrowWriter error)
Example file | Descriptions |
---|---|
local_flat.go | write/read parquet file with no nested struct |
local_nested.go | write/read parquet file with nested struct |
read_partial.go | read partial fields from a parquet file |
read_partial2.go | read sub-struct from a parquet file |
read_without_schema_predefined.go | read a parquet file and no struct/schema predefined needed |
read_partial_without_schema_predefined.go | read sub-struct from a parquet file and no struct/schema predefined needed |
json_schema.go | define schema using json string |
json_write.go | convert json to parquet |
convert_to_json.go | convert parquet to json |
csv_write.go | special csv writer |
column_read.go | read raw column data and return value,repetitionLevel,definitionLevel |
type.go | example for schema of types |
type_alias.go | example for type alias |
writer.go | create ParquetWriter from io.Writer |
keyvalue_metadata.go | write keyvalue metadata |
dot_in_name.go |
. in filed name |
arrow_to_parquet.go | write/read parquet file using arrow definition |
Please start to use it and give feedback or just star it! Help is needed and anything is welcome.