fbpx
Skip to content

How to Perform Basic Text Analysis – No Training Dataset

    How to Perform Basic Text Analysis without Training Dataset

    Last Updated on: 11th May 2024, 10:47 am

    In this modern digital era, a large amount of information is generated every second. Much of the data humans generate through WhatsApp messages, tweets, blogs, news articles, product recommendations, and reviews is unstructured. We need to first convert this highly unstructured data into a structured and normalized form to gain useful insights.

    Natural language processing (NLP) is a branch of Artificial Intelligence that performs a series of processes on this unstructured data to extract meaningful insights. Language processing is completely non-deterministic because the same language can have different interpretations. It becomes tedious as something suitable for one person may not be suitable for another. Moreover, the use of colloquial language, acronyms, hashtags with attached words, and emoticons pose challenges for preprocessing.

    Introduction to NLP

    Natural language processing is the subfield of Artificial Intelligence that comprises systematic processes to convert unstructured data into meaningful information and extract useful insights from the same. NLP is further classified into two broad categories: Rule-based NLP and Statistical NLP.

    Rule-based NLP uses basic reasoning for processing tasks, therefore requiring manual effort without much training in a dataset. Statistical NLP, on the other hand, trains a large amount of data and gains insights from it. It uses Machine Learning algorithms for training the same. In this article, we will be learning about Rule-based NLP.

    Applications of NLP:

    1. Text Summarization
    2. Machine Translation
    3. Question and Answering Systems
    4. Spell Checks
    5. AutoComplete
    6. Sentiment Analysis
    7. Speech Recognition
    8. Topic Segmentation

    NLP Pipeline:

    The NLP pipeline is divided into five sub-tasks:

    1. Lexical Analysis: Lexical analysis is the process of analyzing the structure of words and phrases present in the text. Lexicon is defined as the smallest identifiable chunk in the text. It could be a word, phrase, etc. It involves identifying and dividing the whole text into sentences, paragraphs, and words.
    2. Syntactic Analysis: Syntactic analysis is the process of arranging words in a way that shows the relationship between the words. It involves analyzing them for patterns in grammar. For example, the sentence “The college goes to the girl.” is rejected by the syntactic analyzer.
    3. Semantic Analysis: Semantic analysis is the process of analyzing the text for meaningfulness. It considers syntactic structures for mapping the objects in the task domain. For example, the sentence “He wants to eat hot ice cream” gets rejected by the semantic analyzer.
    4. Discourse Integration: Discourse integration is the process of studying the context of the text. The sentences are organized in a meaningful order to form a paragraph, which means the sentence before a particular sentence is needed to understand the overall meaning. Also, the sentence succeeding the sentence is dependent on the previous one.
    5. Pragmatic Analysis: Pragmatic analysis is defined as the process of reconfirming that what the text actually meant is the same as that derived.

    Reading the Text File

    filename = "C:\Users\Dell\Desktop\example.txt" 
    text = open(filename, "r").read()

    Printing the Text

    print(text)

    Installing the Library for NLP

    We will use the spaCy library for this tutorial. spaCy is an open-source software library for advanced NLP written in the programming languages Python and Cython. The library is published under an MIT license. Unlike NLTK, which is widely used for teaching and research, spaCy focuses on providing software for production usage. spaCy also supports deep learning workflows that allow connecting statistical models trained by popular machine learning libraries like TensorFlow and PyTorch through its own machine learning library Thinc.

    pip install -U pip setuptools wheel 
    pip install -U spacy

    Since we are dealing with the English language, we need to install the en_core_web_sm package for it.

    python -m spacy download en_core_web_sm

    Verifying the Download and Importing the spaCy Package.

    import spacy
    nlp = spacy.load('en_core_web_sm')

    After successfully creating the NLP object, we can move to the preprocessing.

    Tokenization

    Tokenization is the process of converting the entire text into an array of words known as tokens. This is the first step in any NLP process. It divides the entire text into meaningful units.

    text_doc = nlp(text)
    print([token.text for token in text_doc])

    As we can observe from the tokens, many whitespaces, commas, stopwords are of no use from an analysis perspective.

    Sentence Identification

    Identifying the sentences from the text is useful when we want to configure meaningful parts of the text that occur together. So it is useful to find sentences.

    about_doc = nlp(about_text)
    sentences = list(about_doc.sents)

    Stopwords Removal

    Stopwords are defined as words that appear frequently in language. They don’t have any significant role in text analysis and they hamper the frequency distribution analysis. For example, “the”, “an”, “a”, “or”, “and” etc. So they must be removed from the text to get a clearer picture of the text.

    normalized_text = [token for token in text_doc if not token.is_stop]
    print(normalized_text)

    Punctuation Removal

    As we can see from the above output, there are punctuation marks that are of no use to us. So let us remove them.

    clean_text = [token for token in normalized_text if not token.is_punct]
    print(clean_text)

    Lemmatization

    Lemmatization is the process of reducing a word to its original form. Lemma is a word that represents a group of words called lexemes. For example: participate, participate, participated. They all get reduced to a common lemma i.e. participate.

    for token in clean_text:
        print(token, token.lemma_)

    Word Frequency Count

    Let us now perform some statistical analysis on the text. We will find the top ten words according to their frequencies in the text.

    from collections import Counter 
    words = [token.text for token in clean_text if not token.is_stop and not token.is_punct]
    word_freq = Counter(words)
    # 10 commonly occurring words with their frequencies
    common_words = word_freq.most_common(10)
    print(common_words)

    Sentiment Analysis

    Sentiment Analysis is the process of analyzing the sentiment of the text. One way of doing this is through the polarity of words, whether they are positive or negative.

    VADER (Valence Aware Dictionary and Sentiment Reasoner) is a lexicon and rule-based sentiment analysis library in Python. It uses a series of sentiment lexicons. A sentiment lexicon is a series of words that are assigned to their respective polarities, i.e., positive, negative, and neutral, according to their semantic meaning.

    For example:

    1. Words like “good”, “great”, “awesome”, “fantastic” are of positive polarity.
    2. Words like “bad”, “worse”, “pathetic” are of negative polarity.

    VADER sentiment analyzer finds out the percentages of words of different polarity and gives polarity scores of each of them respectively. The output of the analyzer is scored from 0 to 1, which can be converted to percentages. It not only tells about the positivity or negativity scores but also how positive or negative a sentiment is.

    Let us first download the package using pip.

    pip install vaderSentiment

    Then analyze the sentiment scores.

    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
    analyzer = SentimentIntensityAnalyzer()
    vs = analyzer.polarity_scores(text)
    vs

    Text Analyzer Dashboard

    The above steps can be summarized to create a dashboard for the text analyzer. It includes the following key components:

    1. Word Count: The total number of words in the text.
    2. Character Count: The total number of characters in the text.
    3. Numeric Count: The number of numeric characters present in the text.
    4. Top N-Words: The top N most frequently occurring words in the text, along with their frequencies.
    5. Intent of the Text: The overall intent or theme of the text, based on the most common words and phrases.
    6. Overall Sentiment: The general sentiment of the text, classified as positive, negative, or neutral.
    7. Positive Sentiment Score: The percentage of positive sentiment in the text.
    8. Negative Sentiment Score: The percentage of negative sentiment in the text.
    9. Neutral Sentiment Score: The percentage of neutral sentiment in the text.
    10. Sentiment-Wise Word Count: The number of words classified as positive, negative, or neutral based on the sentiment analysis.

    Here’s an example implementation of the Text Analyzer Dashboard using Python:

    import spacy
    from collections import Counter
    from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
    
    # Load the spaCy model
    nlp = spacy.load('en_core_web_sm')
    
    # Read the text file
    with open('example.txt', 'r') as file:
        text = file.read()
    
    # Create the text document
    text_doc = nlp(text)
    
    # Tokenization and stopwords removal
    normalized_text = [token for token in text_doc if not token.is_stop]
    
    # Punctuation removal
    clean_text = [token for token in normalized_text if not token.is_punct]
    
    # Word frequency count
    words = [token.text for token in clean_text if not token.is_stop and not token.is_punct]
    word_freq = Counter(words)
    top_words = word_freq.most_common(10)
    
    # Sentiment analysis
    analyzer = SentimentIntensityAnalyzer()
    sentiment_scores = analyzer.polarity_scores(text)
    
    # Dashboard
    print("Text Analyzer Dashboard:")
    print(f"Word Count: {len(words)}")
    print(f"Character Count: {len(text)}")
    print(f"Numeric Count: {sum(1 for char in text if char.isdigit())}")
    print("Top 10 Words:")
    for word, count in top_words:
        print(f"{word}: {count}")
    print(f"Overall Sentiment: {'Positive' if sentiment_scores['pos'] > sentiment_scores['neg'] else 'Negative' if sentiment_scores['neg'] > sentiment_scores['pos'] else 'Neutral'}")
    print(f"Positive Sentiment Score: {sentiment_scores['pos']*100:.2f}%")
    print(f"Negative Sentiment Score: {sentiment_scores['neg']*100:.2f}%")
    print(f"Neutral Sentiment Score: {sentiment_scores['neu']*100:.2f}%")
    
    # Sentiment-wise word count
    pos_words = [word for word in words if analyzer.polarity_scores(word)['pos'] > analyzer.polarity_scores(word)['neg']]
    neg_words = [word for word in words if analyzer.polarity_scores(word)['neg'] > analyzer.polarity_scores(word)['pos']]
    neu_words = [word for word in words if analyzer.polarity_scores(word)['pos'] == analyzer.polarity_scores(word)['neg']]
    print(f"Positive Word Count: {len(pos_words)}")
    print(f"Negative Word Count: {len(neg_words)}")
    print(f"Neutral Word Count: {len(neu_words)}")

    This dashboard provides a comprehensive overview of the text analysis, including the key statistics, word frequencies, and sentiment analysis. You can further customize and extend this dashboard to suit your specific needs, such as adding visualizations, exporting the data, or integrating it into a larger application.

    Remember, this is a basic example of text analysis without a training dataset. For more advanced use cases, you may need to explore machine learning-based approaches or utilize pre-trained models.

    Share this post on social!

    Comment on Post

    Your email address will not be published. Required fields are marked *