Data Processing Using Multiple Transforms
The transformation framework is designed to operate on arbitrary arrays of bytes with a specialization for columnar data, generally contained in parquet files and read as pyarrow tables.
In general, transforms can be written to process any type of binary data, to be interpreted by the transform.
The table-specific transforms are written to process a table to, for example:
- Remove duplicates or non supported documents, for example, exact and fuzzy dedup, or language selection
- Transform files, for example coalesce or split files or add/remove some of the columns
- Annotate the tables to add additional data such as document quality score, language, etc.
- Derive statistics on data in one or more columns
- Filter the table to remove or edit rows and/or columns, for example to remove rows from blocked domains.
The table is generally expected to have something like the following minimal set of columns :
- URL source of the document (can be use for domain block listing)
- Document id
- Contents of the actual document to be used for LLM training
The following might be an example sequence of transforms (applied to tables, though the model is equivalent for binary data transformations)
The ordering of the transforms can change depending on the requirements on the training data.
Each transform is generally run on a set of input files to produce a set of output files. The transforms are generally sequenced/stitched together, each accepting the completed set of output files from a previous transform in the sequence.
Transforms are generally executed in one of the available "runtimes" with scalability provided by the Ray and Spark runtimes.