Classifying Technical Documentation with TensorFlow: Building a GitHub README Classifier

In this tutorial, I’ll walk you through how to build a machine learning system that automatically classifies GitHub README files by programming language. This practical application of natural language processing demonstrates how to use TensorFlow to analyze and categorize technical documentation.

To provide a simple development environment we’re going to use a Jupyter Notebook. Each of the major code sections should be placed into individual cells to walk through the process. View the final notebook on GitHub.

README files make great practical examples as they contain essential information about a project’s purpose, installation instructions, and technical requirements. Automatically classifying these files has several applications:

Organizing and categorizing repositories
Identifying technology stacks
Finding similar projects
Improving search and discovery

Our goal is to build a classifier that can predict a repository’s programming language based solely on its README content. This is challenging because README files contain natural language mixed with code snippets, markdown formatting, and other elements. The processing we’re going to perform can be used for other documents as well.

Project Overview

We’ll create a complete classification pipeline:

Data Collection: Gather README files from popular GitHub repositories across different programming languages
Text Preprocessing: Clean and normalize text for machine learning
Feature Engineering: Convert text to numerical vectors using TensorFlow’s text vectorization
Model Creation: Build a deep learning classifier using bidirectional LSTMs
Training and Evaluation: Train the model and assess its performance
Deployment: Save the model for future use

1. Setting Up Our Environment

First, we’ll install the necessary libraries for our project. Create the first cell in your notebook and paste the following. By starting the first line with ! we’re telling the notebook to execute the command from the command line and not the python interpreter.

!pip install tensorflow pandas numpy matplotlib scikit-learn requests beautifulsoup4 tqdm

import tensorflow as tf
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from tqdm import tqdm

print(f'TensorFlow version: {tf.__version__}')

Each package plays a role in our machine learning pipeline:

TensorFlow: The core ML framework we’ll use to build, train, and deploy our deep learning model. It provides high-level APIs for building neural networks and managing the entire machine learning workflow.
NumPy and Pandas: Essential data manipulation libraries. NumPy handles numerical operations efficiently, while Pandas provides the DataFrame structure that will store our README dataset.
Matplotlib: For data visualization, particularly useful when analyzing our dataset distribution and model performance metrics.
scikit-learn: Provides machine learning utilities like train_test_split for dataset preparation and evaluation metrics we’ll use later to measure our model’s performance.
Requests: For making HTTP requests to the GitHub API to collect our README files.
BeautifulSoup4: A powerful HTML/XML parsing library that will help us clean and extract text from README files that might contain HTML elements.
tqdm: A smart progress bar library that gives us visual feedback during long-running operations like data collection and model training. This is especially helpful when processing large datasets.

When running this in a Jupyter notebook, you might see some warning messages during installation, but as long as the imports succeed, you’re ready to proceed with the next steps of building our classification system.

2. Data Collection from GitHub

We’ll create a GitHubDataCollector class to handle API requests and collect README files from popular repositories in our target languages. This class includes error handling and rate limit management, which is crucial since the GitHub API imposes rate limits (60 requests/hour for unauthenticated users, 5000 requests/hour with authentication). Depending on the number of repositories you’re collecting, this section may take some time to execute without a GitHub API key.

import requests
import base64
import time
from typing import List, Dict, Any

class GitHubDataCollector:
    def __init__(self, token: str = None):
        self.headers = {}
        if token:
            self.headers['Authorization'] = f'token {token}'
        self.base_url = 'https://api.github.com'
        self.rate_limit_remaining = 5000  # Default GitHub API limit
        self.rate_limit_reset = 0
    
    def _check_rate_limit(self):
        """Check and handle API rate limits"""
        if self.rate_limit_remaining <= 1:
            wait_time = max(0, self.rate_limit_reset - time.time())
            if wait_time > 0:
                print(f'Rate limit reached. Waiting {wait_time:.0f} seconds...')
                time.sleep(wait_time)
    
    def _update_rate_limit(self, response: requests.Response):
        """Update rate limit info from response headers"""
        self.rate_limit_remaining = int(response.headers.get('X-RateLimit-Remaining', 0))
        self.rate_limit_reset = int(response.headers.get('X-RateLimit-Reset', 0))
    
    def get_popular_repos(self, languages: List[str], stars: int = 1000) -> List[Dict[str, Any]]:
        """Get popular repositories for given languages with error handling"""
        repos = []
        for lang in tqdm(languages, desc='Fetching repositories'):
            try:
                self._check_rate_limit()
                query = f'language:{lang} stars:>{stars}'
                url = f'{self.base_url}/search/repositories?q={query}&sort=stars'
                response = requests.get(url, headers=self.headers)
                response.raise_for_status()
                self._update_rate_limit(response)
                
                data = response.json()
                repos.extend(data.get('items', []))
            except requests.exceptions.RequestException as e:
                print(f'Error fetching {lang} repositories: {e}')
        return repos
    
    def get_readme(self, owner: str, repo: str) -> str:
        """Get README content with error handling"""
        try:
            self._check_rate_limit()
            url = f'{self.base_url}/repos/{owner}/{repo}/readme'
            response = requests.get(url, headers=self.headers)
            response.raise_for_status()
            self._update_rate_limit(response)
            
            content = response.json()['content']
            return base64.b64decode(content).decode('utf-8')
        except requests.exceptions.RequestException as e:
            print(f'Error fetching README for {owner}/{repo}: {e}')
        except (KeyError, UnicodeDecodeError) as e:
            print(f'Error processing README for {owner}/{repo}: {e}')
        return None

Understanding the GitHubDataCollector Class

This class handles interaction with the GitHub API and includes several key components:

Initialization: The __init__ method sets up the headers for API requests, optionally adding an authentication token. Using a token significantly increases the rate limit from 60 to 5000 requests per hour.
Rate Limit Management: The _check_rate_limit and _update_rate_limit methods work together to track our remaining API requests and pause execution if we approach the limit. This prevents our script from failing with HTTP 429 errors (Too Many Requests).
Repository Collection: The get_popular_repos method retrieves popular repositories for each target language, filtering by minimum star count. The progress bar from tqdm provides visual feedback during this potentially long-running operation.
README Content Retrieval: The get_readme method gets the README content for a specific repository, handling base64 decoding (GitHub API returns encoded content) and potential errors like missing files or encoding issues.

Now, let’s use this class to collect our dataset:

# Initialize collector
collector = GitHubDataCollector()  # Add token='your_token' for authenticated requests

# Define target languages and minimum stars
languages = ['python', 'javascript', 'java', 'go', 'rust']
min_stars = 5000

# Collect repositories
print(f'Collecting repositories with at least {min_stars} stars...')
repos = collector.get_popular_repos(languages, stars=min_stars)
print(f'Found {len(repos)} repositories')

# Collect READMEs and create dataset
data = []
for repo in tqdm(repos, desc='Collecting READMEs'):
    try:
        readme = collector.get_readme(repo['owner']['login'], repo['name'])
        if readme:
            data.append({
                'name': repo['name'],
                'language': repo['language'],
                'stars': repo['stargazers_count'],
                'text': readme
            })
    except Exception as e:
        print(f'Error processing {repo["name"]}: {e}')

# Create DataFrame
df = pd.DataFrame(data)

# Display dataset statistics
print('\nDataset Statistics:')
print(f'Total samples: {len(df)}')
print('\nSamples per language:')
print(df['language'].value_counts())

Collecting GitHub README files rate limited due to not using an API token.

Executing the Data Collection

This code performs several important steps:

API Authentication: We initialize our collector without a token here, but you can add your GitHub token to increase the rate limit. Without the API token it will take multiple hours to retrieve all the README files for training.
Target Definition: We specify our five target programming languages (Python, JavaScript, Java, Go, and Rust) and a minimum star threshold of 5000 to focus on popular repositories with substantial documentation.
Repository Discovery: We fetch repositories matching our criteria and display the total count found.
README Collection: For each repository, we retrieve its README file and store it along with metadata in our dataset.
Dataset Creation: We convert the collected data into a Pandas DataFrame, making it easy to manipulate and analyze.
Dataset Exploration: We display basic statistics about our dataset, including the total number of samples and distribution across programming languages.

This data collection approach ensures we get high-quality examples from established projects while handling potential API errors and limitations. The result is a balanced dataset of README files that we can use to train our classifier.

3. Text Preprocessing

Before feeding text into our model, we need to clean and normalize it. README files contain markdown formatting, code blocks, URLs, and other elements that could confuse our model. These elements introduce noise that can distract from the core language patterns we want to learn. For example, code blocks might contain similar syntax across different languages, while HTML tags and URLs are common across all programming communities and don’t help with classification.

import re
from bs4 import BeautifulSoup

def preprocess_readme(text: str) -> str:
    if not text:
        return ''
    
    # Remove code blocks
    text = re.sub(r'```[^`]*```', '', text)
    
    # Remove URLs
    text = re.sub(r'http[s]?://\S+', '', text)
    
    # Remove markdown links
    text = re.sub(r'\[([^\[]+)\]\([^\)]+\)', r'\1', text)
    
    # Remove HTML tags
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    # Remove special characters and extra whitespace
    text = re.sub(r'[^\w\s]', ' ', text)
    text = ' '.join(text.split())
    
    return text.lower()

# Preprocess all README texts
print('Preprocessing README texts...')
df['processed_text'] = df['text'].apply(preprocess_readme)

# Display example of preprocessed text
print('Example of preprocessed text:')
print('Original:', df['text'].iloc[0][:200], '...')
print('Preprocessed:', df['processed_text'].iloc[0][:200], '...')

Understanding the Preprocessing Function

Our preprocess_readme function performs several critical cleaning operations:

Code Block Removal: We strip out code blocks (text between triple backticks) because they often contain language-specific syntax that might give away the answer too easily or mislead the model with similar syntax patterns across languages.
URL Removal: URLs typically don’t contain useful information for language classification and are common across all types of repositories.
Markdown Link Conversion: We convert markdown links like text to just text, preserving the descriptive content while removing the URL.
HTML Tag Removal: BeautifulSoup helps us strip any HTML tags that might be embedded in the README, ensuring we’re working with pure text.
Special Character Removal: We remove punctuation and special characters that aren’t relevant to the language classification task.
Text Normalization: We convert everything to lowercase and normalize whitespace to standardize the input.

This preprocessing pipeline significantly reduces noise in our data and helps the model focus on the natural language patterns that differentiate between programming communities. The final step shows a before-and-after example so we can verify our preprocessing is working correctly.

After preprocessing, our text data is much cleaner and more consistent, making it easier for the model to learn meaningful patterns rather than being distracted by formatting or structural elements that don’t indicate the programming language.

4. Text Vectorization with TensorFlow

Now we’ll convert the preprocessed text into numerical features that our neural network can understand. Deep learning models can’t work with raw text directly – they require numerical input data. TensorFlow’s TextVectorization layer provides a convenient way to transform our text into sequences of integers that represent words or tokens in our vocabulary:

# Create the text vectorization layer
max_features = 10000  # Maximum number of words to keep
sequence_length = 500  # Length of each sequence

vectorize_layer = tf.keras.layers.TextVectorization(
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length
)

# Adapt the layer to the text data
print('Adapting vectorization layer to the text data...')
vectorize_layer.adapt(df['processed_text'].values)

# Create training and validation sets
X = df['processed_text'].values
y = pd.get_dummies(df['language']).values  # One-hot encode the labels

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
print(f'Training samples: {len(X_train)}')
print(f'Validation samples: {len(X_val)}')

Understanding Text Vectorization

The code above performs several important operations:

Vocabulary Size Definition: We set max_features=10000 to limit our vocabulary to the 10,000 most frequent words in our dataset. This helps prevent overfitting and reduces the model’s memory requirements. Words not in this vocabulary will be treated as out-of-vocabulary tokens.
Sequence Length Standardization: We fix sequence_length=500 to ensure all input texts have the same length. Shorter texts will be padded with zeros, while longer texts will be truncated. This standardization is necessary because neural networks require fixed-size inputs.
Layer Creation and Adaptation: The TextVectorization layer first analyzes our corpus during the adapt() call to build a vocabulary mapping words to integer indices based on their frequency. This is a crucial step where the layer learns the vocabulary specific to our README dataset.
Output Mode Selection: We use output_mode='int' to convert each token to an integer index based on the vocabulary. Alternative modes include ‘binary’ (one-hot encoding) or ‘tf-idf’ weighting, but integer sequences work best for our deep learning approach.
Dataset Preparation: After vectorization, we create our features (X) from the processed text and prepare our target variables (y) using Pandas’ get_dummies() function. This function converts our categorical language labels (like ‘Python’, ‘JavaScript’) into a binary matrix format called “one-hot encoding” where each language is represented by a column of 0s and 1s. For example, if we have three languages [Python, JavaScript, Java], a Python repository would be encoded as [1, 0, 0], JavaScript as [0, 1, 0], and so on. This format is required for multi-class classification, as neural networks need numerical values rather than string labels to learn patterns.
Train-Validation Split: Finally, we split our data into training (80%) and validation (20%) sets using scikit-learn’s train_test_split(). The random state ensures reproducible results by fixing the random number generator seed.

This vectorization approach allows us to transform variable-length text documents into standardized numerical sequences that our TensorFlow model can efficiently process. The layer will be integrated directly into our model, allowing it to perform text processing as part of the inference pipeline.

5. Building the Deep Learning Model

We’ll create a neural network that combines embedding layers with bidirectional LSTMs to effectively understand the semantic content of README files. This architecture is particularly well-suited for text classification because it can capture both the meaning of individual words and their relationships within the document:

# Define the model
model = tf.keras.Sequential([
    vectorize_layer,
    tf.keras.layers.Embedding(max_features + 1, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(len(languages), activation='softmax')
])

# Compile the model
model.compile(
    loss='categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

# Model summary
model.summary()

print('Model Architecture Explanation:')
print('1. TextVectorization: Converts text to sequences of integer tokens')
print('2. Embedding: Converts tokens to dense vectors of size 64')
print('3. Bidirectional LSTM: Processes sequences in both directions')
print('4. Dense + Dropout: Final classification layers with regularization')
print('5. Output: Probability distribution over programming languages')

Understanding Our Model Architecture

This sequential model consists of several specialized layers that each serve a specific purpose:

TextVectorization Layer: This is the same layer we defined earlier. It’s included directly in the model so that raw text can be fed in during inference without separate preprocessing steps. The layer converts text strings into sequences of token indices.
Embedding Layer: This layer transforms our integer tokens into dense vector representations (64 dimensions per token). Unlike one-hot encoding which would create sparse vectors, embeddings capture semantic relationships between words. Similar words will have similar vectors in this embedding space. We use max_features + 1 as the input dimension to account for all tokens plus the out-of-vocabulary token.
Bidirectional LSTM Layers: Long Short-Term Memory (LSTM) networks are specialized recurrent neural networks that can remember patterns over long sequences. The bidirectional wrapper processes the text in both forward and backward directions, capturing context from both past and future tokens. The first LSTM layer returns sequences (one output per token) while the second aggregates this information.
- The first layer has 64 units and returns sequences for the next LSTM layer
- The second layer has 32 units and returns a single vector representing the entire document
Dense Layer with ReLU: This fully connected layer with 64 neurons learns higher-level features from the LSTM’s output. The ReLU (Rectified Linear Unit) activation function introduces non-linearity, allowing the model to learn complex patterns.
Dropout Layer: This important regularization technique randomly sets 50% of the inputs to zero during training, which prevents the model from becoming too dependent on any single feature and reduces overfitting.
Output Layer with Softmax: The final dense layer has one neuron per programming language (matching the length of our languages list). The softmax activation function converts these raw scores into probabilities that sum to 1, making it easy to interpret which language the model predicts.

For training, we use:

Categorical Cross-Entropy Loss: The standard loss function for multi-class classification problems
Adam Optimizer: An adaptive learning rate optimization algorithm that efficiently handles sparse gradients
Accuracy Metric: A straightforward way to monitor the percentage of correctly classified examples

This architecture balances complexity and performance, making it suitable for text classification tasks like ours without requiring excessive computational resources.

6. Training the Classifier

Now we’ll train our model, using callbacks for early stopping and model checkpointing. This step is where our neural network actually learns to recognize patterns in the README files that distinguish between programming languages. Training is an iterative process where the model makes predictions, measures errors, and adjusts its internal parameters to improve performance over time:

# Training parameters
epochs = 10
batch_size = 32

# Callbacks for training
callbacks = [
    tf.keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=2,
        restore_best_weights=True
    ),
    tf.keras.callbacks.ModelCheckpoint(
        'best_model.keras',
        monitor='val_accuracy',
        save_best_only=True
    ),
    tf.keras.callbacks.TensorBoard(
        log_dir='./logs',
        histogram_freq=1
    )
]

# Train the model
print('Training the model...')
history = model.fit(
    X_train,
    y_train,
    epochs=epochs,
    batch_size=batch_size,
    validation_data=(X_val, y_val),
    callbacks=callbacks
)

Understanding the Training Process

The training code above implements several important concepts:

Training Parameters:
- epochs=10 defines the maximum number of complete passes through the training dataset. Each epoch gives the model another opportunity to learn from all examples.
- batch_size=32 means we’ll process 32 examples at a time before updating the model’s weights. This batch processing makes training more efficient and helps with generalization.
Callbacks: These functions automatically execute during training to monitor progress and take actions:
- Early Stopping: This prevents overfitting by monitoring validation loss and stopping training when it stops improving. The patience=2 parameter means we’ll wait for 2 epochs without improvement before stopping. The restore_best_weights=True option ensures we keep the model version with the best performance, not necessarily the last one.
- Model Checkpoint: This saves the best model to disk whenever validation accuracy improves. The save_best_only=True parameter ensures we only overwrite previous checkpoints when the model improves.
- TensorBoard: This enables visualization of training metrics, model graphs, and weight distributions in TensorBoard, making it easier to analyze training progress and debug issues.
Model Fitting: The model.fit() call performs the actual training:
- It takes our training data (X_train and y_train)
- The validation data (X_val and y_val) is used to evaluate performance on unseen data after each epoch
- The callbacks we defined are executed at appropriate points in the training cycle
- The history object captures metrics from each epoch for later analysis

This training approach implements best practices like monitoring validation metrics and early stopping to prevent overfitting. By using callbacks, we ensure that even if training is interrupted, we’ll have saved the best model for later use.

During training, you’ll see progress output showing metrics for each epoch, including loss and accuracy values for both training and validation data. The training process typically takes 5-15 minutes depending on your hardware and the size of the dataset.

7. Model Evaluation

Let’s evaluate our model’s performance using a confusion matrix and classification report. These evaluation techniques will give us detailed insights into how well our model recognizes each programming language and where it might be making mistakes:

from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns

# Get predictions
y_pred = model.predict(X_val)
y_pred_classes = np.argmax(y_pred, axis=1)
y_val_classes = np.argmax(y_val, axis=1)

# Print classification report
print('Classification Report:')
print(classification_report(y_val_classes, y_pred_classes, target_names=languages))

# Plot confusion matrix
plt.figure(figsize=(10, 8))
cm = confusion_matrix(y_val_classes, y_pred_classes)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=languages, yticklabels=languages)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.show()

# Calculate per-class accuracy
class_accuracy = cm.diagonal() / cm.sum(axis=1)
print('Per-class Accuracy:')
for lang, acc in zip(languages, class_accuracy):
    print(f'{lang}: {acc:.2%}')

Understanding the Evaluation Metrics

The evaluation code performs several important analyses:

Prediction Generation: We start by using our trained model to predict the language of each README in the validation set. Since the model outputs probabilities (from the softmax activation), we use np.argmax() to convert these to class indices for both predictions and ground truth labels.
Classification Report: This comprehensive report from scikit-learn provides several metrics for each language:
- Precision: The percentage of positive predictions that were correct. High precision means the model rarely labels a README as a particular language when it isn’t.
- Recall: The percentage of actual positive cases that were correctly identified. High recall means the model rarely misses READMEs of a particular language.
- F1-score: The harmonic mean of precision and recall, providing a single metric that balances both concerns.
- Support: The number of occurrences of each language in the validation dataset.
Confusion Matrix Visualization: This heatmap shows the relationship between true and predicted labels. Each cell (i,j) contains the count of examples from class i that were predicted as class j. The diagonal represents correct predictions, while off-diagonal cells show misclassifications:
- Perfect classification would show high numbers only on the diagonal
- Common confusion patterns appear as brighter off-diagonal cells
- The visualization helps identify which languages are most commonly confused with each other
Per-class Accuracy: We calculate the accuracy for each language by dividing the number of correct predictions (diagonal values) by the total number of examples for that language. This gives us a clear picture of which languages our model recognizes most accurately.

This evaluation approach provides much more insight than a simple accuracy score. Looking at our initial evaluation we can tell we don’t have an accurate model. Let’s finish going through the code then we’ll improve the accuracy of our model.

8. Testing with Real Examples

Let’s test our model on some real repositories to see how it performs in practice. This step helps us validate that our classifier works on new data beyond our training and validation sets:

def predict_language(text):
    # Preprocess the text
    processed_text = preprocess_readme(text)
    
    # Convert to tf.data.Dataset
    input_ds = tf.data.Dataset.from_tensor_slices([processed_text]).batch(1)
    
    # Get prediction
    pred = model.predict(input_ds)[0]
    
    # Return probabilities for each language
    return {lang: float(prob) for lang, prob in zip(languages, pred)}

# Test with some popular repositories
test_repos = [
    ('tensorflow/tensorflow', 'Python'),
    ('vuejs/vue', 'JavaScript'),
    ('golang/go', 'Go'),
    ('rust-lang/rust', 'Rust'),
    ('spring-projects/spring-boot', 'Java')
]

for repo, expected_lang in test_repos:
    owner, name = repo.split('/')
    readme = collector.get_readme(owner, name)
    if readme:
        predictions = predict_language(readme)
        print(f'Repository: {repo}')
        print(f'Expected: {expected_lang}')
        print('Predictions:')
        for lang, prob in sorted(predictions.items(), key=lambda x: x[1], reverse=True)[:3]:
            print(f'{lang}: {prob:.2%}')

Further proof we’re far from accurate at this point.

Understanding the Testing Function

This real-world testing approach consists of two main components:

Prediction Helper Function: Our predict_language() function takes raw README text and:
- Applies the same preprocessing steps we used for training
- Creates a TensorFlow dataset with a batch size of 1
- Gets predictions from our model
- Returns a dictionary mapping each language to its prediction probability
Test Set Definition: We define a list of well-known repositories across different programming languages to test our model:
- TensorFlow (primarily Python)
- Vue.js (primarily JavaScript)
- Go (the Go programming language)
- Rust (the Rust programming language)
- Spring Boot (primarily Java)
Testing Loop: For each repository, we:
- Fetch its README content using our collector
- Make a prediction using our helper function
- Print the expected language
- Show the top 3 predicted languages with their probabilities

This approach gives us a qualitative assessment of how well our model performs on real-world examples that weren’t part of our training or validation datasets. It’s particularly interesting to see how the model handles repositories with mixed language content or those that discuss multiple programming languages in their documentation.

The output shows both the confidence of the model’s predictions (through probability scores) and whether it correctly identifies the primary language of each repository. If the model consistently performs well on these diverse examples, it suggests good generalization ability.

9. Saving and Deploying the Model

Finally, let’s save our model and vectorization layer for future use:

# Save the complete model
model.save('readme_classifier_model')

# Save the vectorization configuration
import pickle
with open('vectorizer_config.pkl', 'wb') as f:
    pickle.dump({
        'config': vectorize_layer.get_config(),
        'weights': vectorize_layer.get_weights()
    }, f)

print('Model and vectorizer saved successfully!')

# Example of loading and using the saved model
print('Loading and testing saved model...')
loaded_model = tf.keras.models.load_model('readme_classifier_model')

# Test with a simple Python code snippet
sample_text = 'import tensorflow as tf'
predictions = predict_language(sample_text)
print('Test prediction with loaded model:')
for lang, prob in sorted(predictions.items(), key=lambda x: x[1], reverse=True)[:3]:
    print(f'{lang}: {prob:.2%}')

Understanding Model Saving and Deployment

This final part of our pipeline focuses on making our model ready for production use:

Saving the Complete Model: The model.save() function saves the entire model architecture, weights, and optimizer state to disk. TensorFlow uses the SavedModel format, which preserves everything needed to use the model later. This includes:
- Model architecture (the layer structure we defined)
- Weights (the learned parameters)
- Optimizer state (helpful for resuming training)
- Metadata about the model
Preserving the Vectorization Layer: Since our vectorization layer is part of the model, it’s automatically saved with it. However, we also save its configuration and weights separately for flexibility. This allows us to:
- Use the same vocabulary mapping in other models
- Apply preprocessing separately if needed
- Inspect the vocabulary that the model learned
Loading and Testing: We immediately verify our saved model by:
- Loading it back from disk using tf.keras.models.load_model()
- Testing it with a simple Python code snippet
- Displaying the prediction results to confirm functionality

The saved model can now be used in various deployment scenarios:

Web applications using TensorFlow.js
Mobile applications using TensorFlow Lite
Server-side API endpoints using TensorFlow Serving
Batch processing systems for analyzing multiple repositories

This approach to model saving ensures that all the preprocessing steps, vocabulary mapping, and model architecture stay together, which makes deployment much simpler. Anyone using this model can provide raw README text and get predictions without needing to implement the preprocessing steps separately.

Improving Accuracy

Now that the notebook is complete let’s work to improve accuracy. As we’re not completing our full number of epochs increasing those won’t help. Let’s verify we’re not being impatient and increase the patience and see if we get improved accuracy.

Well that changed things, but didn’t help. Let’s look at changing the batch size:

Well again something different and a little better this time, but still no where near the diagonal line we’re looking for.

If massaging the model isn’t sufficient what if we use a larger training dataset? Let’s update our GitHub retrieval to 100 README files per language. By default the GitHub query returns 30 repositories. By appending per_page=100 to our query we get 100 repositories per language.

Now let’s see what we get. Let’s set our patience and batch size back to the original values.

I’ll save you walking through the tuning. Set:

epochs=20
batch_size=16
patience=6

We’re getting close to our expected diagonal and as we can see we’re starting to see some accurate predictions while not perfect it does show the feasibility of this and that increasing the training data increases the accuracy.

Conclusion and Future Improvements

We’ve successfully built a deep learning system that can identify programming languages based on README content. This demonstrates how TensorFlow can be applied to technical documentation classification problems.

To improve this model further, you could:

Collect more data: Increase the number and diversity of repositories
Fine-tune the architecture: Experiment with different model architectures like Transformers
Add more features: Incorporate repository metadata or code snippets
Extend to more languages: Add support for additional programming languages
Improve text preprocessing: Use more sophisticated NLP techniques like lemmatization

This project provides a foundation for document classification that can be extended to various technical documentation analysis tasks. The techniques we’ve covered—from data collection to model deployment—can be applied to many other NLP classification problems beyond README files.