Anna Micros

AI assistant chrome extension for Google Docs

2026-02-02T00:00:00+00:00

Building a Grammarly-style AI writing assistant.

Motivation

I wanted this for personal use, my work involves writing up results for Genomic research, and I feel at times that I want real time improvement on the Google Doc I am working on. I also wanted the model to be context aware, here its working on Research, and it needs to be aware of terminology and the scope of the project;

Source Code: GitHub Repository

RAG Chatbot using OPENAI, FASTAPI and BM25 retrieval

2025-12-02T00:00:00+00:00

Greek RAG Vocabulary App

Live Demo: Visit the App Source Code: GitHub Repository

This project is an interactive Greek vocabulary assistant built with FastAPI and a retrieval-augmented generation (RAG) approach. It leverages a Greek dictionary dataset and BM25 indexing to retrieve relevant definitions and examples, then uses OpenAI’s GPT-4.1-mini model to generate concise explanations and example sentences in Greek.

Key Features

Search for Greek words or phrases and get detailed explanations.
Combines exact matches with BM25 similarity for robust retrieval.
Lightweight web interface with a chat-style interaction.
Fully deployable online via Render for live demonstration.

Tech Stack

Python, FastAPI, OpenAI API, BM25, Jinja2, Render

Predicting Positive vs Negative Movie Reviews

2024-05-26T00:00:00+00:00

This is a step-by-step walk through of sentiment analysis on IMDb movie reviews. The goal is to develop[ a model that accurately predicts whether a reciew is positive or negative by uncovering laanguage patterns that drive sentiment. This is a Text Classification task.

The Data

The dataset used consists of 50,000 film reviews scraped from IMDb that are labelled as either positive or negative. It can be found on Kaggle The goal is to develop an optimal model that accurately predicts sentiment and identifies patterns in review content.

Data Processing

We cleaned the text by:

Removing special characters, URLs, and HTML tags
Converting text to lowercase
Removing stopwords using NLTK
Tokenizing text into unigrams, bigrams, and trigrams
Adding START and STOP markers to preserve context

These steps reduced noise and made the linguistic structure suitable for modeling.

Exploratory Data Analysis

Before modeling, we visualized word frequency patterns.
Positive reviews frequently used terms like “excellent” and “fun,” while negative reviews often used “worst” and “poor.”
These clear distinctions highlighted the potential of text-based sentiment modeling.

Methodology

In order to capture relationships between text, we need to transform it into embeddings. Embedding are numerical representations of words, words with similar meaning will have similar representations.

There are different types of text embedding methods, some more complex than others. This schematic captures the 3 general categories.

We evaluated the following:

Bag of Words (BoW)
TF-IDF
GloVe
BERT embeddings

Each embedding was used across several machine learning models: Logistic Regression, K-Nearest Neighbors (KNN), Random Forest, a Deep Neural Network (DNN), and a fine-tuned BERT Transformer model.

Logistic Regression

The TF-IDF Logistic Regression model performed best, achieving 89.83% accuracy and AUC = 0.96, followed closely by BoW (89.4% accuracy).
Regularization experiments showed that L2 penalties (Ridge regression) were consistently optimal, suggesting that allowing all features to contribute led to more accurate predictions.

K-Nearest Neighbors

KNN models were optimized through grid search for distance metrics and weighting schemes.
The best-performing configuration used TF-IDF embeddings with distance-based weighting, achieving 79.92% accuracy and AUC = 0.88.
Performance decreased significantly for GloVe embeddings (54.66% accuracy), indicating poorer feature representation.

Random Forest

Random Forest models captured complex word interactions.
With TF-IDF embeddings, accuracy reached 84.02%, while BERT embeddings followed closely at 83.62%.
BoW and GloVe lagged behind due to their limited ability to represent nuanced sentiment context.

Deep Neural Network (DNN)

A 3-layer DNN trained with TF-IDF embeddings achieved 87.23% accuracy and AUC = 0.95, outperforming simpler models but not BERT.
The model used dropout regularization and ReLU activations to prevent overfitting.

Fine-tuned BERT Transformer

The pre-trained BERT model achieved the highest overall performance, with 91.74% accuracy and AUC = 0.97.
Fine-tuning for two epochs with a learning rate of 2e-5 and batch size of 16 yielded highly contextual sentiment understanding.

Results Summary

Model	Embedding	Accuracy	AUC
Logistic Regression	TF-IDF	89.83%	0.96
KNN	TF-IDF	79.92%	0.88
Random Forest	TF-IDF	84.02%	0.91
DNN	TF-IDF	87.23%	0.95
BERT Transformer	BERT	91.74%	0.97

TF-IDF consistently yielded the most reliable results across traditional models, while BERT surpassed them with its ability to capture deeper contextual meaning.

Conclusion & Future Work

This analysis demonstrates that while traditional models like Logistic Regression and DNN perform well with TF-IDF features, contextual embeddings like BERT significantly enhance accuracy for nuanced text.

Future improvements may include:

Comparing performance with XLNet or GPT-based models
Data augmentation via synonym replacement or paraphrasing
Combining ensemble models to enhance robustness

Overall, the project highlights how effective feature representation is central to sentiment prediction performance.

Repository: GitHub Project Link Dataset: IMDb 50K Movie Reviews

What Drives the Conversation? Exploring Energy News Through Topic Modeling

2024-05-26T00:00:00+00:00

Context

Today marks the longest government lockdown in government history. This shutdown has hit energy hard, with soaring prices and delays on energy production projects. At the same time, there is a huge conversation around meeting electricity demands of power hungry AI by building more data centers, and the impact that has on local communities. There are dozens of other conversations happening in the energy space and this is what my group was interested in looking at and how the recent change in political power has shifted these conversations.

The data for this project is almost 5000 articles from the same time windows in 2024 and 2025: spanning from February to July. Did some standard preprocessing for text, removed stopwords, punctuation and tokenized the data. We ended up with

Extract articles from txt files.

Each raw news text file was compliled by taking multiple energy articles from the same day scaped from the web. As a result, a lot of these files end up hacing inconsistent spacing and formatting, as well as relics from the web interface. Luckily I had a list of all article titles at hand that i could utilize, but they were not always an exact match to the text. In order to segment the articles accurately, I tried implementing fuzzy string matching (Levenshtein distance) to align detected titles with known article headers. I managed to successfully link about 2,859 of 4,751 titles. After segmentation, articles underwent standard NLP preprocessing: tokenization, stopword and punctuation removal, and exclusion of filler journalistic terms (e.g., “said,” “reported”), resulting in a clean, comparable corpus suitable for topic modeling.

Here we see some of the top word trigrams visualized. This help us contextualize what we 3 words phrases we typically see in these articles, which can help us later with our topic analysis.

Using LDA as a baseline

Latend Dirichlet Allocation (not to be confused with Linear Discriminant analysis) is a topic modelling technique that uses Bayesian modelling to cluster words into topics. In other words, it assumes that given a corpus of words, each word can be allocated to a discrete set of topics. As you can imagine, the number of topics chosen is important, it affects how cohesive the topics are.

The best method for topic number selection is using a key metric, plotting its value for a chosen range of possible topic numbers, and see where this value plateaus, where adding new topics offers no better value. I chose Topic Coherece, a measure of how similar words within a topic are.

Interestingly, the coherence score reaches a peak at 12 topics and then steadily goes down. These values are not dramatically different, as a C_V score in the range of 0.5-0.59 is considered decent, but it is not the behavior we expect. This sharp decrease in performance could be due to the fact that our corpus is exclusively concerning energy news, so there is a lot of overlap in the topics discussed. In a different topic modelling scenario, e.g. modelling research papers: we would end up getting very distinct topics such as History, Physics, Anthropology, Political Science etc. Here, we reach a limit where no more specificity can be achieved, as each word has been allocated to a redundant subtopic that does not offer any more nuance. In our case, the energy news corpus is lexically and conceptually dense, with few truly distinct themes, after 12 topics the model keeps slicing coherent topics into smaller, overlapping pieces, which coherence penalizes it harshly.

Another way however to evaluate our choice for number of topics is to check the topic:

This approach uncovers another aspect of LDA, topics may have similar meaning between each other, yielding a high coherence score, but they may also have a lot of overlap. This graph is from gensim’s package pyLDAvis, its a great tool to visualize topic similarity! It utilies the relevance of a word w to a topic t, calculated as:

\[r(w, t \mid \lambda) = \lambda \, p(w \mid t) + (1 - \lambda) \, \frac{p(w \mid t)}{p(w)}\]

Here, λ (lambda) controls the balance between how frequent and how distinctive a term is:

λ = 1 → ranks words by how common they are within the topic (general terms).
λ = 0 → ranks words by how unique they are to that topic (specific terms).
λ ≈ 0.6 (default) → balances both views for interpretability.

We see here that a 12 topic LDA is insufficient, we have a lot of topic overlap; topic 4 and topic 5 are virtually the same, covering renewable energy projects and climate change related words. So in order to balance high coherence with topic uniqueness I chose 10 topics, and plotted this instead;

Here we see that with nine topics a much better separated topics, so we are sticking with 9. Here are some words from the topics:

Topic ID	Top Terms	Likely Theme / Interpretation
0	bill, house, republican, senate, tax, trump, committee, credit, democrat, budget	Federal policy and legislation
1	data, plant, demand, coal, nuclear, center, price, july, generation, reactor	Data centers, power generation, and nuclear demand
2	agency, rule, court, epa, environmental, trump, ferc, climate, order, law	Regulation, courts, and environmental policy
3	ferc, transmission, order, utility, rate, miso, market, proposal, pjm, process	Transmission and FERC regulatory orders
4	dam, water, river, pipeline, county, safety, area, lake, land, million	Infrastructure, water resources, and safety
5	capacity, market, gw, mw, pjm, demand, load, price, generation, solar	Grid capacity, markets, and generation
6	utility, transmission, electricity, solar, demand, line, system, wind, data, center	Utilities, electrification, and renewables
7	lng, export, pipeline, wind, facility, global, offshore, million, terminal, permit	LNG exports and offshore energy infrastructure
8	tariff, trump, lng, market, trade, oil, april, global, supply, import	Trade, tariffs, and global energy markets

What stands out in these topics is how the vocabulary of energy policy reflects both technical and political worlds. The model doesn’t just surface buzzwords, it sketches the infrastructure of how the U.S. talks about power itself.

On one side, we see the hard engineering lexicon: words like transmission, capacity, load, generation, pipeline, and LNG anchor discussions in the physical systems that move electrons and fuel. These are the operational conversations, the ones about keeping the grid stable, expanding transmission lines, or meeting demand from new data centers.

On the other side sits the language of governance and law: bill, house, committee, court, rule, EPA, order. This vocabulary reveals that energy is never just a technical challenge; it’s also a legislative and judicial process. Policy and legal terms intermingle with industry keywords, reminding us that decisions about pipelines or nuclear reactors are made as much in congressional committees and courtrooms as in control rooms.

That overlap — between ferc, trump, epa, rule, and market — captures a particularly modern dynamic: the way energy debates intersect with climate regulation, trade policy, and even global supply chains. In other words, the “energy transition” isn’t just happening in the grid — it’s happening in the rhetoric that shapes how we understand and argue about it.

Beyond static interpretation, these topics also open the door to temporal comparison.

The most prominent theme, our federal and legislative energy policy topic, shows a noticeable decline in frequency over time, dropping from nearly 0.20 to about 0.15 in share. This pattern makes sense when viewed in the context of the 2025 administration change. In the months leading up to January, media attention was dominated by speculation around how new leadership might reshape national energy strategy. After the transition, however, discussion slowed, reflecting a pause as the administration’s policies began to take shape.

Meanwhile, coverage related to data centers and nuclear generation increased significantly between 2024 and 2025. This rise closely follows the surge in AI adoption and the corresponding spike in electricity demand. As questions about how to power the next wave of AI dominated headlines, nuclear energy began to reemerge as a central part of that conversation. With the new administration openly backing expanded nuclear production and new projects nationwide, the media narrative shifted toward emphasizing capacity growth, energy security, and technological reliability.

In contrast, topics focused on environmental policy declined sharply. This suggests that climate and sustainability have taken a quieter role in the national energy dialogue, likely displaced by more immediate economic and industrial priorities. Similarly, mentions of utilities, renewables, LNG, and offshore development all decreased, reflecting a broader pivot away from decarbonization and toward supply-side concerns such as grid stability and generation capacity.

Other areas, however, have become more prominent. Coverage surrounding FERC orders rose in intensity, underscoring the growing regulatory attention to managing grid reliability and adapting to new forms of generation. Water infrastructure also gained visibility, likely reflecting growing concern about the resource strain that large-scale data centers and energy production place on local ecosystems.

Finally, discussions of tariffs and trade policy saw renewed activity, particularly around Trump-era tariffs and their implications for energy technology imports and domestic production.

Taken together, these shifts tell a clear story: the public and media conversation has turned from long-term environmental ambitions toward short-term energy security and infrastructure resilience. The period from 2024 to 2025 marks not just a political transition, but a fundamental reframing of the energy narrative — one now centered on power capacity, regulation, and the balance between technological ambition and sustainable supply.

Final Thoughts

Taken together, these shifts tell a clear story: the public and media conversation has turned from long-term environmental ambitions toward short-term energy security and infrastructure resilience. The period from 2024 to 2025 marks not just a political transition, but a fundamental reframing of the energy narrative, one now centered on power capacity, regulation, and the balance between technological ambition and sustainable supply. This analysis is still in progress, some next steps will include:

Performing sentiment analysis: There are a lot of overlapping topics in 2024 and 2025, it would be interesting to see if they are spoken about in a different light.
Creating a web app: a dashboard with more information about the energy space for a broader audience, potentially including a RAG chat bot that answers frequent energy related questions.

US Education Trends

2024-03-15T00:00:00+00:00

Meteor Observation by Forward Scatter Radar

2023-04-23T00:00:00+00:00

Overview

This project explores how meteors can be detected using forward scatter radar, a method that captures radio reflections from ionized trails left by meteors entering Earth’s atmosphere. The system was built using a Yagi antenna and an SDR (Software Defined Radio) receiver tuned to the GRAVES transmitter frequency at 143.05 MHz.

Motivation

After radar’s use in World War II, researchers discovered that meteors produced “false echoes” on radar screens. These reflections inspired a new way of observing meteor activity, by detecting their ionized trails through radio signals rather than optical telescopes. The project aimed to recreate this phenomenon with affordable, open-source tools.

This was my final project for my bachelor’s degree in Physics back in 2023! I did really enjoy this project and I wanted to revisit it; at the time I completed it I did not have much experience in coding, but I think this project could really be streamlined using machine learning! Although sadly i no longer have the data, I want to go through how this could be improved has computer vision been incorporated. But first, here is the design and results:

System Design

Antenna Construction

The antenna used was a Yagi-Uda design optimized for 143.05 MHz (λ = 2.097 m). Its copper elements were cut and mounted on a wooden boom to ensure non-conductivity and stability.
The setup included:

Reflector, Dipole, and Director elements to enhance forward gain.
Coaxial connection to the SDR dongle for signal capture.
Orientation toward GRAVES in France, with calculated elevation for optimal reflection angles.

Software Setup

Signal processing was performed in SDRsharp, displaying intensity over time as a waterfall plot. Automated screen captures every 15 seconds enabled continuous observation from November 5–17. Meteor echoes were later classified manually as underdense or overdense based on signal shape and duration.

Results

Ths is what the signal lookslike; The measurements are centered around the Graves frequency. These waterfall plots show the signal evolution over time, which allows us to exctract useful information about the meteor, its composition, velocity etc.

Echo Classification

Underdense trails: Faint, short-lived reflections (<1 s), often appearing as thin horizontal lines on the waterfall diagram.
Overdense trails: Brighter and longer-lasting (up to 2 s), with diffraction-like signal fluctuations due to interference between reflection points. Here is an example of an underdense, vs dense signal:

Head echoes — reflections from the plasma surrounding the meteor head — were occasionally detected, allowing direct velocity calculations. A Doppler shift of 800 Hz, for example, corresponded to a radial velocity of 838.8 m/s.

Detection Patterns

Detection rates peaked around November 11–12, coinciding with the Taurid meteor shower. Most echoes appeared in the early morning hours, consistent with the Earth’s rotation — when the observation region faces the orbital direction, increasing meteor entry rates.

Hourly Rates

Discussion

The results confirm that forward scatter radar can successfully detect meteor trails with relatively simple equipment.
Signal characteristics matched theoretical expectations, with observed Doppler shifts corresponding to typical meteor velocities (11–72 km/s). Limitations were mainly due to SDRsharp’s resolution and manual analysis — future improvements could include automated image detection and machine learning–based signal classification.

Conclusion

This experiment demonstrates that meaningful meteor observations are achievable using low-cost SDR hardware. Beyond its technical outcomes, it shows how data-driven methods can reveal invisible celestial phenomena — translating radio noise into evidence of high-speed interplanetary travel.

Keywords: SDR, Radio Astronomy, Meteor Detection, Doppler Shift, Antenna Design
Tools: SDRsharp, Auto Screen Capture, Python (for data post-processing)