What is Natural Language Processing?
Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. It combines computational linguistics with machine learning to process and analyze large amounts of natural language data.
NLP powers many applications we use daily, including virtual assistants, translation services, spam filters, and sentiment analysis tools. Understanding NLP is crucial for building intelligent systems that can interact with humans naturally.
The NLP Pipeline
Key Insight: NLP doesn't just process words as strings—it understands context, sentiment, relationships, and meaning, making it one of the most complex and exciting fields in AI.
Essential Python Libraries for NLP
NLTK
Natural Language Toolkit - The classic library for NLP in Python
- Tokenization
- Stemming & Lemmatization
- POS Tagging
- Named Entity Recognition
spaCy
Industrial-strength NLP with excellent performance
- Fast processing
- Pre-trained models
- Dependency parsing
- Entity recognition
Transformers
State-of-the-art models like BERT and GPT
- Pre-trained models
- Transfer learning
- Multiple languages
- Fine-tuning capabilities
Gensim
Topic modeling and document similarity
- Word2Vec
- Doc2Vec
- LDA
- Similarity detection
Step-by-Step NLP Implementation
Step 1: Install Required Libraries
Start by installing the essential NLP libraries for Python.
# Install essential NLP libraries
pip install nltk spacy gensim transformers
pip install scikit-learn pandas numpy matplotlib
# Download spaCy English model
python -m spacy download en_core_web_sm
# Download NLTK data
python -c "import nltk; nltk.download('popular')"
Step 2: Basic Text Preprocessing
Clean and prepare text data for analysis using NLTK.
import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
def preprocess_text(text):
text = text.lower()
text = re.sub(r'[^a-zA-Z\s]', '', text)
tokens = word_tokenize(text)
stop_words = set(stopwords.words('english'))
tokens = [token for token in tokens if token not in stop_words]
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(token) for token in tokens]
return ' '.join(tokens)
Preprocessing Steps: Lowercasing, removing special characters, tokenization, stopword removal, and lemmatization are essential for preparing text data for analysis.
Step 3: Sentiment Analysis with spaCy
Perform sentiment analysis on text data using spaCy's capabilities (TextBlob used for simple polarity in this example).
import spacy
from textblob import TextBlob
nlp = spacy.load("en_core_web_sm")
def analyze_sentiment(text):
blob = TextBlob(text)
sentiment = blob.sentiment
return {
'polarity': sentiment.polarity,
'subjectivity': sentiment.subjectivity,
'sentiment': 'positive' if sentiment.polarity > 0 else 'negative' if sentiment.polarity < 0 else 'neutral'
}
Step 4: Named Entity Recognition
Extract entities like people, organizations, and locations from text.
import spacy
nlp = spacy.load("en_core_web_sm")
def extract_entities(text):
doc = nlp(text)
entities = {}
for ent in doc.ents:
entities.setdefault(ent.label_, []).append(ent.text)
return entities
Entity Types: Common entity types include PERSON, ORGANIZATION, GPE (Geopolitical Entity), DATE, TIME, and MONEY.
Real-World NLP Applications
Chatbots & Virtual Assistants
AI-powered conversational agents that understand and respond to human queries naturally.
Machine Translation
Systems like Google Translate that convert text between different languages automatically.
Sentiment Analysis
Analyzing customer reviews, social media posts to determine public opinion and sentiment.
Text Classification
Categorizing emails as spam/ham, news articles by topic, or support tickets by urgency.
Complete NLP Project Example
Here's a complete example of building a simple text classifier:
import pandas as pd
#!/usr/bin/env python3
"""
Complete NLP classifier example:
- small sample dataset
- preprocessing (lowercase, remove non-alpha, tokenize, stopwords, lemmatize)
- TF-IDF vectorization
- Multinomial Naive Bayes training
- evaluation and example predictions
- save/load model with joblib
"""
pip install numpy pandas scikit-learn nltk joblib
import re
import pathlib
import joblib
import numpy as np
import pandas as pd
from typing import List, Dict
# sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
# NLTK
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
# Ensure required NLTK data is available
def ensure_nltk_data():
try:
nltk.data.find("tokenizers/punkt")
except LookupError:
nltk.download("punkt", quiet=True)
try:
nltk.data.find("corpora/stopwords")
except LookupError:
nltk.download("stopwords", quiet=True)
try:
nltk.data.find("corpora/wordnet")
except LookupError:
nltk.download("wordnet", quiet=True)
try:
nltk.data.find("taggers/averaged_perceptron_tagger")
except LookupError:
nltk.download("averaged_perceptron_tagger", quiet=True)
ensure_nltk_data()
# Preprocessing function
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))
def preprocess_text(text: str) -> str:
"""
Lowercase, remove non-alphabetic chars, tokenize, remove stopwords, lemmatize.
Returns cleaned single string.
"""
if not isinstance(text, str):
text = str(text)
# Lowercase
text = text.lower()
# Remove URLs, emails
text = re.sub(r"http\S+|www\S+|https\S+", " ", text)
text = re.sub(r"\S+@\S+", " ", text)
# Keep letters and spaces
text = re.sub(r"[^a-z\s]", " ", text)
# Collapse spaces
text = re.sub(r"\s+", " ", text).strip()
# Tokenize
tokens = word_tokenize(text)
# Remove stopwords and short tokens, lemmatize
tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop_words and len(tok) > 1]
return " ".join(tokens)
# Build a small sample dataset (for demonstration)
def build_sample_dataset() -> pd.DataFrame:
data = {
"text": [
"I love this product, it's amazing and works great!",
"Terrible experience. The product broke after two days. Worst purchase ever.",
"Fast shipping and excellent quality. Highly recommend.",
"Not satisfied with the quality. Very disappointed by customer service.",
"Great value for money. I'm very happy with this.",
"Waste of money. It didn't work as advertised.",
"The item arrived on time and matches the description.",
"Battery life is poor and the screen has dead pixels.",
"Superb performance! Exceeded my expectations.",
"Bad packaging, product was damaged when it arrived.",
"Okay product, not great but does the job.",
"I will buy this again — fantastic!",
"Do not buy this. Cheap materials and awful build.",
"Customer support helped me quickly and resolved my issue.",
"The features are exactly what I needed. Very pleased.",
"Awful — stopped working after a week.",
"Decent product for the price, acceptable build quality.",
"Excellent craftsmanship, premium feel and look.",
"The app frequently crashes. Unusable on my phone.",
"Five stars. Love it!",
],
# labels: positive / negative / neutral (we'll map to positive/negative for classifier demo)
"label": [
"positive", "negative", "positive", "negative", "positive",
"negative", "positive", "negative", "positive", "negative",
"neutral", "positive", "negative", "positive", "positive",
"negative", "neutral", "positive", "negative", "positive"
]
}
return pd.DataFrame(data)
# Map neutral to positive/negative for binary classification (optional)
def relabel_binary(df: pd.DataFrame) -> pd.DataFrame:
# Convert 'neutral' to 'positive' or 'negative' depending on simple rule or keep as neutral.
# For demo we'll keep binary: treat 'neutral' as 'positive' (simplest)
df = df.copy()
df["label"] = df["label"].replace({"neutral": "positive"})
return df
# Train pipeline
def train_pipeline(df: pd.DataFrame):
# Preprocess texts
df["processed"] = df["text"].apply(preprocess_text)
X = df["processed"].values
y = df["label"].values
# Train/test split (stratify to keep label balance)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# Vectorize
vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1,2))
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)
# Model
model = MultinomialNB(alpha=1.0)
model.fit(X_train_tfidf, y_train)
# Evaluate
y_pred = model.predict(X_test_tfidf)
acc = accuracy_score(y_test, y_pred)
clf_report = classification_report(y_test, y_pred, digits=4)
cm = confusion_matrix(y_test, y_pred, labels=np.unique(y))
print("=== EVALUATION ===")
print(f"Accuracy: {acc:.4f}")
print("\nClassification Report:")
print(clf_report)
print("Confusion Matrix:")
print(cm)
# Return objects for later use
return vectorizer, model
# Save and load helpers
def save_artifacts(vectorizer, model, folder="models", prefix="nlp"):
p = pathlib.Path(folder)
p.mkdir(parents=True, exist_ok=True)
vec_path = p / f"{prefix}_vectorizer.joblib"
model_path = p / f"{prefix}_model.joblib"
joblib.dump(vectorizer, vec_path)
joblib.dump(model, model_path)
print(f"Saved vectorizer -> {vec_path}")
print(f"Saved model -> {model_path}")
return vec_path, model_path
def load_artifacts(folder="models", prefix="nlp"):
p = pathlib.Path(folder)
vec_path = p / f"{prefix}_vectorizer.joblib"
model_path = p / f"{prefix}_model.joblib"
vectorizer = joblib.load(vec_path)
model = joblib.load(model_path)
return vectorizer, model
# Predict helper
def predict_texts(texts: List[str], vectorizer: TfidfVectorizer, model: MultinomialNB) -> List[Dict]:
processed = [preprocess_text(t) for t in texts]
X = vectorizer.transform(processed)
preds = model.predict(X)
probs = model.predict_proba(X) if hasattr(model, "predict_proba") else None
results = []
for i, t in enumerate(texts):
r = {"text": t, "processed": processed[i], "prediction": preds[i]}
if probs is not None:
# attach probability for predicted class
class_index = list(model.classes_).index(preds[i])
r["probability"] = float(probs[i, class_index])
results.append(r)
return results
# Main runnable
if __name__ == "__main__":
print("Building sample dataset...")
df = build_sample_dataset()
df = relabel_binary(df)
print("Dataset distribution:\n", df["label"].value_counts(), "\n")
print("Training pipeline...")
vectorizer, model = train_pipeline(df)
# Save artifacts
save_artifacts(vectorizer, model, folder="models", prefix="nlp_example")
# Example predictions
examples = [
"This product is fantastic and the battery lasts forever!",
"Absolute junk. It died in two days and support ignored me.",
"Delivery was okay, product works as expected."
]
print("\nExample predictions:")
vec, mdl = load_artifacts(folder="models", prefix="nlp_example")
results = predict_texts(examples, vec, mdl)
for r in results:
print("-" * 60)
print("Original:", r["text"])
print("Processed:", r["processed"])
print("Predicted:", r["prediction"], f"(prob {r.get('probability', 'n/a')})")
Next Steps in Your NLP Journey
Congratulations! You've taken your first steps into the fascinating world of Natural Language Processing. Continue practicing with different datasets and gradually tackle more complex NLP challenges.