100% found this document useful (1 vote)
22 views

NLP

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
22 views

NLP

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 97

Fundamentos de

Procesamiento de
Lenguaje Natural Summer Camp
Summer Camp

Renato Hermoza Aragonés


● Bachiller en Ingeniería de Computación y Sistemas: Universidad San
Martín de Porres.
● Maestría en Informática con Mención en Ciencias de la
Computación: Pontificia Universidad Católica del Perú.
● PhD. in Medical Image Analysis, Computer Vision and Machine
Learning: The University of Adelaide.
● Current: Chief Scientist - Kashin

Correo: renato.hermoza@pucp.edu.pe
Web: https://renato145.github.io
Summer Camp

Índice
1. Intro
2. IMDB challenge
3. Classic methods for NLP
4. Deep learning for NLP
5. Language models
6. Fine-tuning
7. Transformer
8. Large language models
9. Future
Summer Camp

0. Some tools
- https://aistudio.google.com

- https://console.anthropic.com

- https://web2md.answer.ai/
1. Intro
IMDB challenge
Classic methods for NLP
Deep learning for NLP
Language models
Fine-tuning
Transformer
Large language models
Future
Summer Camp

1.1 Preguntas
Experiencia con:
● NLP?
● CV?
● DL?
● ML?
● Usando Llms?
Summer Camp

1.2 Setup
- https://colab.google/

- https://github.com/renato145/pucp_bootcamp_202401
Summer Camp

1.3 Que es NLP (Natural Language Processing)?


El Procesamiento de Lenguaje Natural (NLP) es un campo amplio abarca una variedad de tareas que
incluyen:
● Part-of-speech tagging: etiquetado de sustantivos, verbos, adjetivos.
● Named entity recognition NER: identificar nombres de personas, organizaciones, ubicaciones.
● Question answering.
● Speech recognition.
● Topic modeling: identificar los temas principales en un conjunto de documentos.
● Sentiment classification: Determinar si un comentario es positivo, negativo o neutral.
● Language modeling: predecir la siguiente palabra.
● Translation.
Summer Camp

1.4 NLP: un campo cambiante

Históricamente los correctores ortográficos han


requerido miles de líneas de código para
expresar reglas (Whitelaw et al., 2009).
Case: spell checkers
Usando métodos estadísticos se puede escribir
un corrector ortográfico en muchas menos
líneas de código (norvig-spell-correct).
Summer Camp

1.5 NLP: un campo cambiante

Case: best practices

link-to-tweet
Summer Camp

1.6 NLP: complejidad

"She killed the man with the tie."

● Was the man wearing the tie?


● Or was the tie the murder weapon?
Summer Camp

1.7 NLP: presente

● Large language models (LLMs)


● ChatGPT
Summer Camp

1.8 NLP: futuro

● Accesibilidad
● Modelos abiertos
● Regulaciones
● Problemas éticos
Summer Camp

1.9 Timeline
1997: LSTM (Hochreiter and Jürgen, 1997)
2007: Google translate: SMT
2011: IMDB dataset (Mass et. al., 2011)
2015: ImageNet (Russakovsky et. al., 2015)
2016: Google translate: GNMT
2017: ULMFiT (Howard and Ruder, 2018), Transformer architecture (Vaswani et al., 2017)
2018: ELMo (Peters et al., 2018), GPT-1 (Radford et al., 2018)
2019: GPT-2 (Solaiman et al., 2019)
2020: GPT-3 (Brown et al., 2020)
2023: GPT-4
Summer Camp

1.10 Librerías

● scikit-learn
● hugging-face
● PyTorch
● FastAI
● Axolotl
Intro

2.
1. IMDB challenge
Classic methods for NLP
Deep learning for NLP
Language models
Fine-tuning
Transformer
Large language models
Future
Summer Camp

2.1 IMDB challenge


Large Movie Review Dataset (Mass et. al., 2011)
Summer Camp

2.2 IMDB challenge


Intro
IMDB challenge

3.
1. Classic methods for NLP
Deep learning for NLP
Language models
Fine-tuning
Transformer
Large language models
Future
Summer Camp

3.1 Classic methods for NLP

?
Summer Camp

3.2 Classic methods for NLP

Word 1 Word 2 Word 3 Word 4 Word 5 …

Document 1 5 4

Document 2 1 2 1

Document 3 2 3 2


Summer Camp

3.3 Classic methods for NLP: TF-IDF


● TF: term-frequency.
● IDF: inverse document-frequency.
Summer Camp

3.3 Classic methods for NLP: TF-IDF


● TF: term-frequency.
● IDF: inverse document-frequency.
Ej:
● El término “playa” aparece 10 veces en un documento.
● En el documento aparecen un total de 100 términos.
Summer Camp

3.3 Classic methods for NLP: TF-IDF


● TF: term-frequency.
● IDF: inverse document-frequency.
Ej:
● El término “playa” aparece 10 veces en un documento.
● En el documento aparecen un total de 100 términos.
● Hay un total de 5000 documentos.
● El término “playa” aparece en 50 documentos.
Summer Camp

3.3 Classic methods for NLP: TF-IDF


● TF: term-frequency.
● IDF: inverse document-frequency.
Ej:
● El término “playa” aparece 10 veces en un documento.
● En el documento aparecen un total de 100 términos.
● Hay un total de 5000 documentos.
● El término “playa” aparece en 50 documentos.
Summer Camp

3.4 Classic methods for NLP: stemming and


lemmatization
Reducir palabras a su raíz:
● Lemmatization: usa reglas del lenguaje, los resultados son
palabras existentes.
● Stemming (poor-man’s lemmatization): corta la terminación de
las palabras para aproximar la raíz de las palabras, los resultados
pueden no ser palabras reales.
Summer Camp

1.5 NLP: un campo cambiante

Case: best practices

link-to-tweet
Summer Camp

3.5 Classic methods for NLP: stemming and


lemmatization
Librerías:
● NLTK (natural language toolkit)
● spacy
Summer Camp

3.6 Classic methods for NLP

Notebook 1
Intro
IMDB challenge
Classic methods for NLP

4.
1. Deep learning for NLP
Language models
Fine-tuning
Transformer
Large language models
Future
Summer Camp

4.1 Deep learning for NLP: embeddings

?
Summer Camp

4.2 Deep learning for NLP: embeddings

x1 x2 x3 x4 x6 …

Token 1 0.86 0.41 0.49 0.13 0.72

Token 2 1.03 0.34 0.31 0.25 0.69

Token 3 0.77 0.13 0.05 0.31 0.64


Summer Camp

4.3 Deep learning for NLP


Ajustar el modelo

Data x Modelo f(x) Optimización

Función de
Output y pérdida
Loss
deseado
Summer Camp

4.4 Deep learning for NLP: RNN


Recurrent neural networks

Understanding LSTMs by Chris Olah


Summer Camp

4.5 Deep learning for NLP: LSTM


Long Short Term Memory (Hochreiter and Jürgen, 1997)

Understanding LSTMs by Chris Olah


Summer Camp

4.6 Deep learning for NLP: LSTM

Text input

Tokens

RNN Layers
Embeddings Features Model head
(encoder)

Output
Summer Camp

4.7 Timeline
1997: LSTM (Hochreiter and Jürgen, 1997)
2007: Google translate: SMT
2011: IMDB dataset (Mass et. al., 2011)
2015: ImageNet (Russakovsky et. al., 2015)
2016: Google translate: GNMT
2017: ULMFiT (Howard and Ruder, 2018), Transformer architecture (Vaswani et al., 2017)
2018: ELMo (Peters et al., 2018), GPT-1 (Radford et al., 2018)
2019: GPT-2 (Solaiman et al., 2019)
2020: GPT-3 (Brown et al., 2020)
2023: GPT-4
Summer Camp

4.8 Deep learning for NLP

Notebook 2
Intro
IMDB challenge
Classic methods for NLP
Deep learning for NLP

5.
1. Language models
Fine-tuning
Transformer
Large language models
Future
Summer Camp

5.1 Language models


Tarea: estimar la siguiente palabra
Summer Camp

5.1 Language models


Tarea: estimar la siguiente palabra

La gata

La gata arañó

La gata arañó el

La gata arañó el mueble


Summer Camp

5.2 Language models


Text input

Tokens

AWD_LSTM
Embeddings Features Model head
(encoder)

Next word
Summer Camp

5.3 Language models

Notebook 3
Intro
IMDB challenge
Classic methods for NLP
Deep learning for NLP
Language models

6.
1. Fine-tuning
Transformer
Large language models
Future
Summer Camp

6.1 Fine-tuning
ImageNet (Russakovsky et. al., 2015)
Summer Camp

6.2 Fine-tuning

Features o
Características

Clasificación
(Red pre-entrenada)
Segmentación
Detección
Survival
....
Convoluciones
Summer Camp

6.3 Fine-tuning
Text input

Tokens

AWD_LSTM
Embeddings Features Model head
(encoder)

Next word
Summer Camp

6.3 Fine-tuning
New task
Text input

New model head


Tokens

AWD_LSTM
Embeddings Features Model head
(encoder)

Next word
Summer Camp

6.4 Timeline
1997: LSTM (Hochreiter and Jürgen, 1997)
2007: Google translate: SMT
2011: IMDB dataset (Mass et. al., 2011)
2015: ImageNet (Russakovsky et. al., 2015)
2016: Google translate: GNMT
2017: ULMFiT (Howard and Ruder, 2018), Transformer architecture (Vaswani et al., 2017)
2018: ELMo (Peters et al., 2018), GPT-1 (Radford et al., 2018)
2019: GPT-2 (Solaiman et al., 2019)
2020: GPT-3 (Brown et al., 2020)
2023: GPT-4
Summer Camp

6.5 Language models

Notebook 4
Intro
IMDB challenge
Classic methods for NLP
Deep learning for NLP
Language models
Fine-tuning

7.
1. Transformer
Large language models
Future
Summer Camp

7.1 Transformers
Attention Is All You Need (Vaswani et al., 2017).
Summer Camp

7.1 Transformers
Attention Is All You Need (Vaswani et al., 2017).
Summer Camp

7.1 Transformers
Attention Is All You Need (Vaswani et al., 2017).

Source: The annotated transformer


Summer Camp

7.1 Transformers
Attention Is All You Need (Vaswani et al., 2017).
Summer Camp

7.1 Transformers
Attention Is All You Need (Vaswani et al., 2017).
Summer Camp

7.2 Transformers: attention

d dims
Linear
layers
Q1

word-1 K1

V1
Summer Camp

7.2 Transformers: attention

d dims
Linear
layers
Q1

word-1 K1

V1

Linear
layers
Q2

word-2 K2

V2
Summer Camp

7.2 Transformers: attention

d dims
Linear
layers
Q1

word-1 K1

V1
K
Q1
2T
= scalar
Linear
layers
Q2

word-2 K2

V2
Summer Camp

7.3 Transformers: attention

word-1 word-2 word-3 word-4

word-1 x11 x12 x13 x14

word-2 x21 x22 x23 x24

word-3 x31 x32 x33 x34

word-4 x41 x42 x43 x44


Summer Camp

7.3 Transformers: attention

word-1 word-2 word-3 word-4

word-1 x11 x12 x13 x14

word-2 x21 x22 x23 x24 QKt [4x4]

word-3 x31 x32 x33 x34

word-4 x41 x42 x43 x44


Summer Camp

7.4 Transformers: attention

d dims

Linear
layers
Q

QKt [4x4] word-1 K

V [4xd]
Summer Camp

7.4 Transformers: attention

d dims

Linear
layers
Q

QKt [4x4] word-1 K

[4x4] V [4xd]
Summer Camp

7.4 Transformers: attention

[4x4] V [4xd] Attention [4xd]


Summer Camp

7.5 Transformers: attention

From Vaswani et al., 2017.


Summer Camp

7.6 Transformers
Input
word-1 word-2 word-3 word-4 word-5 … word-n
Summer Camp

7.6 Transformers
Input
word-1 word-2 word-3 word-4 word-5 … word-n

Embeddings (eg: 50 dims), shape: [50, n]


embs-1 embs-2 embs-3 embs-4 embs-5 … embs-n
Summer Camp

7.6 Transformers
Input
word-1 word-2 word-3 word-4 word-5 … word-n

Embeddings (eg: 50 dims), shape: [50, n]


embs-1 embs-2 embs-3 embs-4 embs-5 … embs-n

Positional encoding, shape: [50, n]


embs+pos-
1
embs+pos-
2
embs+pos-
3
embs+pos-
4
embs+pos-
5
… embs+pos-
n
Summer Camp

7.6 Transformers
Input
word-1 word-2 word-3 word-4 word-5 … word-n

Embeddings (eg: 50 dims), shape: [50, n]


embs-1 embs-2 embs-3 embs-4 embs-5 … embs-n

Positional encoding, shape: [50, n]


embs+pos-
1
embs+pos-
2
embs+pos-
3
embs+pos-
4
embs+pos-
5
… embs+pos-
n

Attention, shape: [50, n]


x-1 x-2 x-3 x-4 x-5 … x-n
Summer Camp

7.6 Transformers
Input
word-1 word-2 word-3 word-4 word-5 … word-n

Embeddings (eg: 50 dims), shape: [50, n]


embs-1 embs-2 embs-3 embs-4 embs-5 … embs-n

Positional encoding, shape: [50, n]


embs+pos-
1
embs+pos-
2
embs+pos-
3
embs+pos-
4
embs+pos-
5
… embs+pos-
n

Attention, shape: [50, n]


x-1 x-2 x-3 x-4 x-5 … x-n
Summer Camp

7.7 Transformers
Attention Is All You Need (Vaswani et al., 2017).
Summer Camp

7.8 Timeline
1997: LSTM (Hochreiter and Jürgen, 1997)
2007: Google translate: SMT
2011: IMDB dataset (Mass et. al., 2011)
2015: ImageNet (Russakovsky et. al., 2015)
2016: Google translate: GNMT
2017: ULMFiT (Howard and Ruder, 2018), Transformer architecture (Vaswani et al., 2017)
2018: ELMo (Peters et al., 2018), GPT-1 (Radford et al., 2018)
2019: GPT-2 (Solaiman et al., 2019)
2020: GPT-3 (Brown et al., 2020)
2023: GPT-4
Summer Camp

7.9 Transformers

From Hugging-Face.
Summer Camp

7.10 Transformers

Notebook 5
Summer Camp

7.11 Modern Bert


https://huggingface.co/blog/modernbert
Summer Camp

7.11 Modern Bert


https://huggingface.co/blog/modernbert

Principales mejoras:
- RoPE: Rotary Positional Embeddings (detalles).
- Alternating Attention.
Summer Camp

7.11 Modern Bert


RoPE: Rotary Positional Embeddings (detalles).
Summer Camp

7.11 Modern Bert


Alternating Attention.
Intro
IMDB challenge
Classic methods for NLP
Deep learning for NLP
Language models
Fine-tuning
Transformer

8.
1. Large language models
Future
Summer Camp

8.1 Large language models

From Hugging-Face.
Summer Camp

8.1 Large language models


GPT-3 (175 billion
parameters)

From Hugging-Face.
Summer Camp

8.1 Large language models


GPT-4 (1.76 trillion
parameters)

GPT-3 (175 billion


parameters)

From Hugging-Face.
Summer Camp

8.2 Large language models


Instruction tuning datasets:
● https://huggingface.co/datasets/Open-Orca/OpenOrca
● https://huggingface.co/datasets/Anthropic/hh-rlhf
Summer Camp

8.3 Large language models


RLHF: Reinforcement Learning from Human Feedback (Christiano et al., 2017).

From OpenAI.
Summer Camp

8.4 Large language models


DPO: Direct Preference Optimization (Rafailov, Rafael, et al. 2024).

https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K
Summer Camp

8.5 Timeline
1997: LSTM (Hochreiter and Jürgen, 1997)
2007: Google translate: SMT
2011: IMDB dataset (Mass et. al., 2011)
2015: ImageNet (Russakovsky et. al., 2015)
2016: Google translate: GNMT
2017: ULMFiT (Howard and Ruder, 2018), Transformer architecture (Vaswani et al., 2017)
2018: ELMo (Peters et al., 2018), GPT-1 (Radford et al., 2018)
2019: GPT-2 (Solaiman et al., 2019)
2020: GPT-3 (Brown et al., 2020)
2023: GPT-4
Summer Camp

8.6 Large language models

Notebook 6
Summer Camp

8.8 Large language models


https://github.com/axolotl-ai-cloud/axolotl
Intro
IMDB challenge
Classic methods for NLP
Deep learning for NLP
Language models
Fine-tuning
Transformer
Large language models

9.
1. Future
Summer Camp

9.1 Future
https://www.fast.ai/p osts/2023-09-04-learning-jumps/
Summer Camp

9.2 Future
Open source LLMs:
● https://huggingface.co/
● https://ai.meta.com/llama/
● https://mistral.ai/
Summer Camp

9.3 Future
https://www.nytimes.com/es/2023/12/27/espanol/new-york-times-demanda-openai-microsoft.html
Summer Camp

9.4 Future
"A framework for understanding unintended consequences of machine learning." (Suresh
and Guttag, 2019).
Summer Camp

9.5 Future
Summer Camp

9.6 Future
Summer Camp

9.7 Future
Ethics for Data Science
https://www.youtube.com/watch?v=krIVOb23EH8
Summer Camp

References
● Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.
● Maas, Andrew, et al. "Learning word vectors for sentiment analysis." Proceedings of the 49th annual meeting of the
association for computational linguistics: Human language technologies. 2011.
● Whitelaw, Casey, et al. "Using the web for language independent spellchecking and autocorrection." (2009).
● Russakovsky, Olga, et al. "Imagenet large scale visual recognition challenge." International journal of computer vision 115
(2015): 211-252.
● Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
● Christiano, Paul F., et al. "Deep reinforcement learning from human preferences." Advances in neural information processing
systems 30 (2017).
● Howard, Jeremy, and Sebastian Ruder. "Universal language model fine-tuning for text classification." arXiv preprint
arXiv:1801.06146 (2018).
● Peters, Matthew E. et al. “Deep Contextualized Word Representations.” arXiv preprint arXiv/1802.05365 (2018).
● Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018).
● Solaiman, Irene, et al. "Release strategies and the social impacts of language models." arXiv preprint arXiv:1908.09203 (2019).
● Suresh, Harini, and John V. Guttag. "A framework for understanding unintended consequences of machine learning." arXiv
preprint arXiv:1901.10002 2.8 (2019).
● Brown, Tom, et al. "Language models are few-shot learners." Advances in neural information processing systems 33 (2020):
1877-1901.
● Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces." arXiv preprint arXiv:2312.00752
(2023).
● Rafailov, Rafael, et al. "Direct preference optimization: Your language model is secretly a reward model." Advances in Neural
Information Processing Systems 36 (2024).

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy