SageMaker provides algorithms that are tailored to the analysis of textual documents used in natural language processing, document classification or summarization, topic modeling or classification, and language transcription or translation.
- BlazingText algorithm—a highly optimized implementation of the Word2vec and text classification algorithms that scale to large datasets easily. It is useful for many downstream natural language processing (NLP) tasks.
- Latent Dirichlet Allocation (LDA) Algorithm—an algorithm suitable for determining topics in a set of documents. It is an unsupervised algorithm, which means that it doesn't use example data with answers during training.
- Neural Topic Model (NTM) Algorithm—another unsupervised technique for determining topics in a set of documents, using a neural network approach.
- Object2Vec Algorithm—a general-purpose neural embedding algorithm that can be used for recommendation systems, document classification, and sentence embeddings.
- Sequence-to-Sequence Algorithm—a supervised algorithm commonly used for neural machine translation.
- Text Classification - TensorFlow—a supervised algorithm that supports transfer learning with available pretrained models for text classification.
Algorithm name | Channel name | Training input mode | File type | Instance class | Parallelizable |
---|---|---|---|---|---|
BlazingText | train | File or Pipe | Text file (one sentence per line with space-separated tokens) | GPU (single instance only) or CPU | No |
LDA | train and (optionally) test | File or Pipe | recordIO-protobuf or CSV | CPU (single instance only) | No |
Neural Topic Model | train and (optionally) validation, test, or both | File or Pipe | recordIO-protobuf or CSV | GPU or CPU | Yes |
Object2Vec | train and (optionally) validation, test, or both | File | JSON Lines | GPU or CPU (single instance only) | No |
Seq2Seq Modeling | train, validation, and vocab | File | recordIO-protobuf | GPU (single instance only) | No |
Text Classification - TensorFlow | training and validation | File | CSV | CPU or GPU | Yes (only across multiple GPUs on a single instance) |