class: center, middle # Natural Language Processing (NLP) and Deep Learning ## A simple introduction with Keras ### Based on real dataset from [Kaggle](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge) #### Use case: Toxic comments classification .right[**Thomas Chauvet**] .right[Data Scientist] .right[Twitter: [@ChauvetThomas](https://twitter.com/ChauvetThomas)] .right[Linkedin: [in/thomaschauvet/](https://www.linkedin.com/in/thomaschauvet/en/)] --- class: split-50 ## First things first, what is Machine Learning ? .center[**Learn representation from the data.**] With ML, we give data to a program and it will learn a model from them to predict an output. Based on: Mathematics, Statistics and Computer Science. Mainly two kind of ML: supervised and unsupervised. .column[ **Supervised** Model a target known in advance: - discrete target -> classification - continuous target -> regression ] .column[
**Unsupervised** Determine groups or "clusters" of data (of individuals, objects, images,...) defined by the algorithm without *a priori*. ]
*Note: Unsupervised learning is often called clustering*. --- ## What is Deep Learning ? --- ## What is NLP ? NLP stands for **Natural Language Processing**. Intersection of different fields: Machine learning and Linguistic mainly.
.center[Interdisciplinary field focus onthe interactions between computers and human natural languages.] --- ## Wich use cases for NLP? NLP is everywhere around us:
Mail: detect spam,
Translation: Google Translate, DeepL,
Personal assistants: Siri, Alexa, Google, etc,
Search: Google Search, website search.
Etc.
It also helps text analysis:
Sentiment analysis
Text summary
Keyword extraction,
Name entity recognition
Etc.
--- ## Toxic comments classification: a Kaggle challenge The objective is to **detect different categories of "toxic" comments** on Wikipedia forums. - just under 100,000 comments on the train set. - no metadata: only plain text. - **6 labels to predict**: "toxic", "severely toxic", "threat", "obscene", "insults" and "identity hate". .center[![proportion_each_label](proportion_each_label.png "proportion_each_label")] --- # A multi-label classification problem - Multi-label classification problem: *less classic than multi-class classification problem*. - Each comment can **belong to one or more target classes**. - Classes are not mutually exclusive. .center[![class_proportion](class_proportion.png "class_proportion")] *Note: we try different neural networks with this dataset.* --- # Text pre-processing: tokenization Use **tokenization** to manipulate text with Keras. Transform text into sequences, in order to have an **indexed list of words**. **Example** *"This is a sentence."* Becomes a list of tokens: `["This","is","a","sentence"]`. Then becomes an indexed list: `[1, 2, 3, 4]`. Keras code: ```python tokenizer = text.Tokenizer(num_words=max_features) tokenizer.fit_on_texts(list(X_train)) list_tokenized_train = tokenizer.texts_to_sequences(X_train) X_train = sequence.pad_sequences(list_tokenized_train, maxlen=maxlen) ``` *Note: You can also pad the sequence, i.e. add 0 at the beginning or end to have same size sequences.* --- # Text pre-processing: next steps #### Term Frequencies (TF) Count each word occurences in a document (*c*) and divide by the total number of words in the document (*n*): *tf=c/n*. #### Term Frequencies – Inverse Document Frequencies (TF-IDF) Weight *TF* in order to give more importance to less frequent words. .center[![tfidf](tfidf.png "tfidf")] #### .center[But we have an extremely sparse matrix with huge dimension] .right[.caption[[*https://www.quentinfily.fr*](https://www.quentinfily.fr/tf-idf-pertinence-lexicale/")]] --- class: split-30 # Text pre-processing: embedding .column[ .center[![one_hot_encoded_sparse](one_hot_encoded_sparse.png "one_hot_encoded_sparse")] .center[![word_embedding](word_embedding.png "word_embedding")] ] .column[ **Objective: associate a dense vector to each word in the vocabulary.** Possibility to learn it from the training data. - It will be a layer of the neural network that is randomly initialized. - Weights will be learned as the training progresses. - Once the model has been trained, each word in the vocabulary will be associated with a vector. Or **use a pre-trained embedding** like glove, word2vec, etc. often trained on lot of data. ] --- # Neural networks basics .center[![nn_basic](nn_basic.png "nn_basic")] .center[*Neural networks basic concepts.*] --- # A first simple neural network ```python max_features=2000; embedding_size = 50; maxlen=100; model = Sequential() model.add(Embedding(max_features, embedding_size, input_length=maxlen)) model.add(Flatten()) model.add(Dense(6, activation='sigmoid')) # Compile model model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) # Fit model model.fit( X_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(X_val, y_val)) ``` - max_features : size of the vocabulary. - max_len : number of words in the sequence/sentence (could be padded or truncated). #### .center[As easy as 1, 2, 3!] --- class: split-60 # A first simple neural network: without pre-trained embedding .column[ .center[![nn_without_pretrained_embedding](nn_without_pretrained_embedding.png "nn_without_pretrained_embedding")].caption[*Simple neural network dimensions and layers.*] ] .column[ Few remarks: - Why a `Flatten` layer ? Embedding return a 3D tensor, `Flatten` transform it in a classic 2D matrix. - Keras infer tensor dimensions across neural network layers. - `None` is displayed because we do not specify samples number (comments) entering in the network. `None` could be batch size or the size of the whole dataset. ] --- # A first simple neural network: with pre-trained embedding ```python # Define model model = Sequential() model.add(Embedding(max_features, embedding_matrix.shape[1],input_length=maxlen, weights=[embedding_matrix])) model.add(Flatten()) model.add(Dense(6, activation='sigmoid')) ``` We add `embedding_matrix` as `weights` parameters. It represents words from vocabulary in our documents in pre-trained matrix (like Glove, word2vec, etc). **Embedding matrix** : matrix of size *(vocabulary size) x (embedding size)*. Each row in the matrix is a word projection on the embedding. --- # Go deeper: "deep" neural network We can easily **add layers** to the neural network to create a "deep" neural network: ```python model = Sequential() model.add(Embedding(max_features, embedding_matrix.shape[1], input_length=maxlen, weights=[embedding_matrix])) model.add(Flatten()) model.add(Dense(64, activation='relu')) model.add(Dense(32, activation='relu')) model.add(Dense(16, activation='relu')) model.add(Dense(6, activation='sigmoid')) ``` #### .center[Congrats ! This is a deep neural network !] --- # Summary: for now - **Very easy to transform text** into a matrix to "feed" a neural network with tokenization. - Possibility to **use pre-trained models** to learn a representation of words in a continuous space. Especially useful when the dataset is relatively small. - A "simple" neural network with Keras only requires a **few lines of code**. - Very easy to **stack layers** to make a deep neural network (welcome in **Deep Learning** world !). #### But..: - A Dense network has **no memory**. - We want to process sequences: therefore we must be **aware of what happened before** (and even after). --- class: split-50 # Neural network with memory: Recurrent Neural Network (RNN) .column[ .center[![simple_rnn](simple_rnn.png "simple_rnn")] ] .column[ - A recurring neural network keeps in memory a state corresponding to its previous output. - Each input is processed sequentially: the output for a word in a sequence depends on the preceding word. - Warning: with each new input the state is reset to 0. The notion of memory is only valid within the sequence being processed! ] --- # Neural network with memory: RNN .center[![unrolled_rnn](unrolled_rnn.png "unrolled_rnn")] .center[**Output(t) = activation(W x input(t) + U x state(t) + bo)**] With: - *state(t) = output(t-1)*, - for example: *Activation = tanh*. --- # RNN with Keras ```python model = Sequential() model.add(Embedding(max_features, embedding_matrix.shape[1], weights=[embedding_matrix])) model.add(SimpleRNN(32)) model.add(Dense(6, activation='sigmoid')) ``` - Still very few lines of code to have a more complicated neural network. - `logloss` with a simple neural network **`0.0692`** vs. with a RNN **`0.0662`**. - A little bit better, but could we achieve a lower `logloss` ? --- # From RNN to Long Short-Term Memory (LSTM) - RNNs have difficulty retaining states over long sequences. - Introduction of LSTMs that add another sequence that will carry information from state to state. - The LSTMs are very similar to the RNNs with a more complex architecture. - In summary, LSTMs are used to inject information that has been passed at a given time t. - If you want to go further : [http://colah.github.io/posts/2015-08-Understanding-LSTMs/](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) --- # Schematic LSTM archictecture .center[![simple_lstm](simple_lstm.png "simple_lstm")] .center[Schematic LSTM archictecture with carry track to memorize important information over long sequence.] --- # LSTM in Keras ```python model = Sequential() model.add(Embedding(max_features, embedding_matrix.shape[1], weights=[embedding_matrix])) model.add(LSTM(32)) model.add(Dense(6, activation='sigmoid') ``` - Still easy as 1, 2, 3! - And we achieve a much better score on `logloss`: **0,0497** vs. **0,0662** with RNN. --- class: split-60 # Bi-directional LSTM - The sequences are processed from left to right (as if we read a sentence). - The sequences could be processed from left to right **AND** right to left. - The network could have information from the past but also from the future. .center[**This is the idea of bi-directional methods.**] .column[ ![bidirectional](bidirectional.png "bidirectional") ] .column[
*Warning*: very useful in the context of the NLP, you must be very careful when handling time series. Learning data from the future can be a very bad idea (*data-leakage*). ] --- # Bi-directional LSTM in Keras ```python model = Sequential() model.add(Embedding(max_features, embedding_matrix.shape[1], weights=[embedding_matrix])) model.add(Bidirectional(LSTM(32))) model.add(Dense(6, activation='sigmoid')) ```
- Still easy, really **thank you Keras** ! - `logloss` has been a little bit improved: **0,0497** vs. **0,0494**. - But, we do not tune our algorithm, and there is no complex parameters tuning. --- # In summary The theory can be a little complex (*hello LSTM*).... ... but **Keras is simple!** - Many pre-implemented templates with correct default settings, - Only a few lines of code are enough to make a first correct neural network. - When working with **sequences** (time series, text, etc.), **recurrent neural networks** (RNN, LSTM, GRU) are more appropriate. - Bidirectional methods can be very effective on NLP models. --- # Go further with neural networks and Keras - Add regularization: * Dropout, * L1 or L2 regularization. - Combine different models for stacking. - Add external features to the neural network. For example, in NLP, we could count number of words in uppercase, punctuations, etc. - Add Pooling layers to process the sequences. - Combine Convolutional Neural Network (CNN) with Reccurent Neural Network (RNN): useful for processing long sequences. - Add an "attention" mechanism: [https://distill.pub/2016/augmented-rnns/](https://distill.pub/2016/augmented-rnns/). --- # Ressources - [Keras](https://keras.io/) - [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python) (by Keras' creator François Chollet). *Some images of this presentation are from this book.* - [Glove embedding](https://nlp.stanford.edu/projects/glove/) *Last but not least:* - My repo about this project [here](https://github.com/thomas-chauvet/kaggle_toxic_comment_classification)