Introduction to Natural Language Processing with Python

What is Natural Language Processing?

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and generate human language. It combines computational linguistics with machine learning to process and analyze large amounts of natural language data.

NLP powers many applications we use daily, including virtual assistants, translation services, spam filters, and sentiment analysis tools. Understanding NLP is crucial for building intelligent systems that can interact with humans naturally.

The NLP Pipeline

Text Preprocessing

Tokenization

Feature Extraction

Model Training

Evaluation

Key Insight: NLP doesn't just process words as strings—it understands context, sentiment, relationships, and meaning, making it one of the most complex and exciting fields in AI.

Essential Python Libraries for NLP

NLTK

Natural Language Toolkit - The classic library for NLP in Python

Tokenization
Stemming & Lemmatization
POS Tagging
Named Entity Recognition

spaCy

Industrial-strength NLP with excellent performance

Fast processing
Pre-trained models
Dependency parsing
Entity recognition

Transformers

State-of-the-art models like BERT and GPT

Pre-trained models
Transfer learning
Multiple languages
Fine-tuning capabilities

Gensim

Topic modeling and document similarity

Word2Vec
Doc2Vec
LDA
Similarity detection

Step-by-Step NLP Implementation

Step 1: Install Required Libraries

Start by installing the essential NLP libraries for Python.

Bash

# Install essential NLP libraries
pip install nltk spacy gensim transformers
pip install scikit-learn pandas numpy matplotlib

# Download spaCy English model
python -m spacy download en_core_web_sm

# Download NLTK data
python -c "import nltk; nltk.download('popular')"

Step 2: Basic Text Preprocessing

Clean and prepare text data for analysis using NLTK.

Python

import nltk
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return ' '.join(tokens)

Preprocessing Steps: Lowercasing, removing special characters, tokenization, stopword removal, and lemmatization are essential for preparing text data for analysis.

Step 3: Sentiment Analysis with spaCy

Perform sentiment analysis on text data using spaCy's capabilities (TextBlob used for simple polarity in this example).

Python

import spacy
from textblob import TextBlob

nlp = spacy.load("en_core_web_sm")

def analyze_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment
    return {
        'polarity': sentiment.polarity,
        'subjectivity': sentiment.subjectivity,
        'sentiment': 'positive' if sentiment.polarity > 0 else 'negative' if sentiment.polarity < 0 else 'neutral'
    }

Step 4: Named Entity Recognition

Extract entities like people, organizations, and locations from text.

Python

import spacy
nlp = spacy.load("en_core_web_sm")
def extract_entities(text):
    doc = nlp(text)
    entities = {}
    for ent in doc.ents:
        entities.setdefault(ent.label_, []).append(ent.text)
    return entities

Entity Types: Common entity types include PERSON, ORGANIZATION, GPE (Geopolitical Entity), DATE, TIME, and MONEY.

Real-World NLP Applications

Chatbots & Virtual Assistants

AI-powered conversational agents that understand and respond to human queries naturally.

Machine Translation

Systems like Google Translate that convert text between different languages automatically.

Sentiment Analysis

Analyzing customer reviews, social media posts to determine public opinion and sentiment.

Text Classification

Categorizing emails as spam/ham, news articles by topic, or support tickets by urgency.

Complete NLP Project Example

Here's a complete example of building a simple text classifier:

Python - Complete NLP Classifier

import pandas as pd
#!/usr/bin/env python3
"""
Complete NLP classifier example:
- small sample dataset
- preprocessing (lowercase, remove non-alpha, tokenize, stopwords, lemmatize)
- TF-IDF vectorization
- Multinomial Naive Bayes training
- evaluation and example predictions
- save/load model with joblib
"""

pip install numpy pandas scikit-learn nltk joblib

import re
import pathlib
import joblib
import numpy as np
import pandas as pd
from typing import List, Dict

# sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# NLTK
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Ensure required NLTK data is available
def ensure_nltk_data():
    try:
        nltk.data.find("tokenizers/punkt")
    except LookupError:
        nltk.download("punkt", quiet=True)
    try:
        nltk.data.find("corpora/stopwords")
    except LookupError:
        nltk.download("stopwords", quiet=True)
    try:
        nltk.data.find("corpora/wordnet")
    except LookupError:
        nltk.download("wordnet", quiet=True)
    try:
        nltk.data.find("taggers/averaged_perceptron_tagger")
    except LookupError:
        nltk.download("averaged_perceptron_tagger", quiet=True)

ensure_nltk_data()

# Preprocessing function
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

def preprocess_text(text: str) -> str:
    """
    Lowercase, remove non-alphabetic chars, tokenize, remove stopwords, lemmatize.
    Returns cleaned single string.
    """
    if not isinstance(text, str):
        text = str(text)
    # Lowercase
    text = text.lower()
    # Remove URLs, emails
    text = re.sub(r"http\S+|www\S+|https\S+", " ", text)
    text = re.sub(r"\S+@\S+", " ", text)
    # Keep letters and spaces
    text = re.sub(r"[^a-z\s]", " ", text)
    # Collapse spaces
    text = re.sub(r"\s+", " ", text).strip()
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords and short tokens, lemmatize
    tokens = [lemmatizer.lemmatize(tok) for tok in tokens if tok not in stop_words and len(tok) > 1]
    return " ".join(tokens)

# Build a small sample dataset (for demonstration)
def build_sample_dataset() -> pd.DataFrame:
    data = {
        "text": [
            "I love this product, it's amazing and works great!",
            "Terrible experience. The product broke after two days. Worst purchase ever.",
            "Fast shipping and excellent quality. Highly recommend.",
            "Not satisfied with the quality. Very disappointed by customer service.",
            "Great value for money. I'm very happy with this.",
            "Waste of money. It didn't work as advertised.",
            "The item arrived on time and matches the description.",
            "Battery life is poor and the screen has dead pixels.",
            "Superb performance! Exceeded my expectations.",
            "Bad packaging, product was damaged when it arrived.",
            "Okay product, not great but does the job.",
            "I will buy this again — fantastic!",
            "Do not buy this. Cheap materials and awful build.",
            "Customer support helped me quickly and resolved my issue.",
            "The features are exactly what I needed. Very pleased.",
            "Awful — stopped working after a week.",
            "Decent product for the price, acceptable build quality.",
            "Excellent craftsmanship, premium feel and look.",
            "The app frequently crashes. Unusable on my phone.",
            "Five stars. Love it!",
        ],
        # labels: positive / negative / neutral (we'll map to positive/negative for classifier demo)
        "label": [
            "positive", "negative", "positive", "negative", "positive",
            "negative", "positive", "negative", "positive", "negative",
            "neutral", "positive", "negative", "positive", "positive",
            "negative", "neutral", "positive", "negative", "positive"
        ]
    }
    return pd.DataFrame(data)

# Map neutral to positive/negative for binary classification (optional)
def relabel_binary(df: pd.DataFrame) -> pd.DataFrame:
    # Convert 'neutral' to 'positive' or 'negative' depending on simple rule or keep as neutral.
    # For demo we'll keep binary: treat 'neutral' as 'positive' (simplest)
    df = df.copy()
    df["label"] = df["label"].replace({"neutral": "positive"})
    return df

# Train pipeline
def train_pipeline(df: pd.DataFrame):
    # Preprocess texts
    df["processed"] = df["text"].apply(preprocess_text)
    X = df["processed"].values
    y = df["label"].values

    # Train/test split (stratify to keep label balance)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=42, stratify=y
    )

    # Vectorize
    vectorizer = TfidfVectorizer(max_features=1000, ngram_range=(1,2))
    X_train_tfidf = vectorizer.fit_transform(X_train)
    X_test_tfidf = vectorizer.transform(X_test)

    # Model
    model = MultinomialNB(alpha=1.0)
    model.fit(X_train_tfidf, y_train)

    # Evaluate
    y_pred = model.predict(X_test_tfidf)
    acc = accuracy_score(y_test, y_pred)
    clf_report = classification_report(y_test, y_pred, digits=4)
    cm = confusion_matrix(y_test, y_pred, labels=np.unique(y))

    print("=== EVALUATION ===")
    print(f"Accuracy: {acc:.4f}")
    print("\nClassification Report:")
    print(clf_report)
    print("Confusion Matrix:")
    print(cm)

    # Return objects for later use
    return vectorizer, model

# Save and load helpers
def save_artifacts(vectorizer, model, folder="models", prefix="nlp"):
    p = pathlib.Path(folder)
    p.mkdir(parents=True, exist_ok=True)
    vec_path = p / f"{prefix}_vectorizer.joblib"
    model_path = p / f"{prefix}_model.joblib"
    joblib.dump(vectorizer, vec_path)
    joblib.dump(model, model_path)
    print(f"Saved vectorizer -> {vec_path}")
    print(f"Saved model -> {model_path}")
    return vec_path, model_path

def load_artifacts(folder="models", prefix="nlp"):
    p = pathlib.Path(folder)
    vec_path = p / f"{prefix}_vectorizer.joblib"
    model_path = p / f"{prefix}_model.joblib"
    vectorizer = joblib.load(vec_path)
    model = joblib.load(model_path)
    return vectorizer, model

# Predict helper
def predict_texts(texts: List[str], vectorizer: TfidfVectorizer, model: MultinomialNB) -> List[Dict]:
    processed = [preprocess_text(t) for t in texts]
    X = vectorizer.transform(processed)
    preds = model.predict(X)
    probs = model.predict_proba(X) if hasattr(model, "predict_proba") else None
    results = []
    for i, t in enumerate(texts):
        r = {"text": t, "processed": processed[i], "prediction": preds[i]}
        if probs is not None:
            # attach probability for predicted class
            class_index = list(model.classes_).index(preds[i])
            r["probability"] = float(probs[i, class_index])
        results.append(r)
    return results

# Main runnable
if __name__ == "__main__":
    print("Building sample dataset...")
    df = build_sample_dataset()
    df = relabel_binary(df)

    print("Dataset distribution:\n", df["label"].value_counts(), "\n")

    print("Training pipeline...")
    vectorizer, model = train_pipeline(df)

    # Save artifacts
    save_artifacts(vectorizer, model, folder="models", prefix="nlp_example")

    # Example predictions
    examples = [
        "This product is fantastic and the battery lasts forever!",
        "Absolute junk. It died in two days and support ignored me.",
        "Delivery was okay, product works as expected."
    ]
    print("\nExample predictions:")
    vec, mdl = load_artifacts(folder="models", prefix="nlp_example")
    results = predict_texts(examples, vec, mdl)
    for r in results:
        print("-" * 60)
        print("Original:", r["text"])
        print("Processed:", r["processed"])
        print("Predicted:", r["prediction"], f"(prob {r.get('probability', 'n/a')})")

Next Steps in Your NLP Journey

Advanced Techniques

Learn about word embeddings, transformers, and advanced neural architectures

Build Projects

Create chatbots, text summarizers, or document classification systems

Multilingual NLP

Explore NLP techniques for different languages and cross-lingual applications

Production Deployment

Learn to deploy NLP models as APIs and integrate them into applications

Congratulations! You've taken your first steps into the fascinating world of Natural Language Processing. Continue practicing with different datasets and gradually tackle more complex NLP challenges.