<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://anna-christina-mikr.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://anna-christina-mikr.github.io/" rel="alternate" type="text/html" /><updated>2026-02-25T15:22:17+00:00</updated><id>https://anna-christina-mikr.github.io/feed.xml</id><title type="html">Anna Micros</title><subtitle>Data Science &amp; Analytics Portfolio</subtitle><author><name>Anna Micros</name></author><entry><title type="html">AI assistant chrome extension for Google Docs</title><link href="https://anna-christina-mikr.github.io/project/2026/02/02/AI_assistanct_chrome_extension.html" rel="alternate" type="text/html" title="AI assistant chrome extension for Google Docs" /><published>2026-02-02T00:00:00+00:00</published><updated>2026-02-02T00:00:00+00:00</updated><id>https://anna-christina-mikr.github.io/project/2026/02/02/AI_assistanct_chrome_extension</id><content type="html" xml:base="https://anna-christina-mikr.github.io/project/2026/02/02/AI_assistanct_chrome_extension.html"><![CDATA[<p>Building a Grammarly-style AI writing assistant.</p>

<h3 id="motivation">Motivation</h3>

<p>I wanted this for personal use, my work involves writing up results for Genomic research, and I feel at times that I want real time improvement on the Google Doc I am working on. 
I also wanted the model to be context aware, here its working on Research, and it needs to be aware of terminology and the scope of the project;</p>

<p><strong>Source Code:</strong> <a href="https://github.com/anna-christina-mikr/chrome-extension">GitHub Repository</a></p>

<hr />]]></content><author><name>Anna Micros</name></author><category term="project" /><category term="Agentic AI" /><category term="RAG" /><summary type="html"><![CDATA[Building a Grammarly-style AI writing assistant.]]></summary></entry><entry><title type="html">RAG Chatbot using OPENAI, FASTAPI and BM25 retrieval</title><link href="https://anna-christina-mikr.github.io/project/2025/12/02/GreekRAGChatbot.html" rel="alternate" type="text/html" title="RAG Chatbot using OPENAI, FASTAPI and BM25 retrieval" /><published>2025-12-02T00:00:00+00:00</published><updated>2025-12-02T00:00:00+00:00</updated><id>https://anna-christina-mikr.github.io/project/2025/12/02/GreekRAGChatbot</id><content type="html" xml:base="https://anna-christina-mikr.github.io/project/2025/12/02/GreekRAGChatbot.html"><![CDATA[<h1 id="greek-rag-vocabulary-app">Greek RAG Vocabulary App</h1>

<p><strong>Live Demo:</strong> <a href="https://greek-rag-5.onrender.com/">Visit the App</a>
<strong>Source Code:</strong> <a href="https://github.com/anna-christina-mikr/greek-rag">GitHub Repository</a></p>

<p>This project is an interactive <strong>Greek vocabulary assistant</strong> built with <strong>FastAPI</strong> and a <strong>retrieval-augmented generation (RAG)</strong> approach. It leverages a Greek dictionary dataset and <strong>BM25 indexing</strong> to retrieve relevant definitions and examples, then uses <strong>OpenAI’s GPT-4.1-mini</strong> model to generate concise explanations and example sentences in Greek.</p>

<h2 id="key-features">Key Features</h2>
<ul>
  <li>Search for Greek words or phrases and get detailed explanations.</li>
  <li>Combines exact matches with BM25 similarity for robust retrieval.</li>
  <li>Lightweight web interface with a chat-style interaction.</li>
  <li>Fully deployable online via Render for live demonstration.</li>
</ul>

<h2 id="tech-stack">Tech Stack</h2>
<p>Python, FastAPI, OpenAI API, BM25, Jinja2, Render</p>]]></content><author><name>Anna Micros</name></author><category term="project" /><category term="RAG" /><category term="data science" /><category term="chatbot" /><category term="AI" /><summary type="html"><![CDATA[Greek RAG Vocabulary App]]></summary></entry><entry><title type="html">Predicting Positive vs Negative Movie Reviews</title><link href="https://anna-christina-mikr.github.io/project/2024/05/26/IMDBSentiment.html" rel="alternate" type="text/html" title="Predicting Positive vs Negative Movie Reviews" /><published>2024-05-26T00:00:00+00:00</published><updated>2024-05-26T00:00:00+00:00</updated><id>https://anna-christina-mikr.github.io/project/2024/05/26/IMDBSentiment</id><content type="html" xml:base="https://anna-christina-mikr.github.io/project/2024/05/26/IMDBSentiment.html"><![CDATA[<p>This is a step-by-step walk through of sentiment analysis on IMDb movie reviews. The goal is to develop[ a model that accurately predicts whether a reciew is positive or negative by uncovering laanguage patterns that drive sentiment. This is a <strong>Text Classification</strong> task.</p>

<h2 id="the-data">The Data</h2>

<p>The dataset used consists of 50,000 film reviews scraped from IMDb that are labelled as either positive or negative. It can be found on <a href="https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews">Kaggle</a>  The goal is to develop an optimal model that accurately predicts sentiment and identifies patterns in review content.</p>

<hr />

<h2 id="data-processing">Data Processing</h2>

<p>We cleaned the text by:</p>
<ul>
  <li>Removing special characters, URLs, and HTML tags</li>
  <li>Converting text to lowercase</li>
  <li>Removing stopwords using NLTK</li>
  <li>Tokenizing text into unigrams, bigrams, and trigrams</li>
  <li>Adding <code class="language-plaintext highlighter-rouge">START</code> and <code class="language-plaintext highlighter-rouge">STOP</code> markers to preserve context</li>
</ul>

<p>These steps reduced noise and made the linguistic structure suitable for modeling.</p>

<hr />

<h2 id="exploratory-data-analysis">Exploratory Data Analysis</h2>

<p>Before modeling, we visualized word frequency patterns.<br />
Positive reviews frequently used terms like <em>“excellent”</em> and <em>“fun,”</em> while negative reviews often used <em>“worst”</em> and <em>“poor.”</em><br />
These clear distinctions highlighted the potential of text-based sentiment modeling.</p>

<p><img height="200" alt="Screenshot 2025-11-06 at 1 28 55 PM" src="/images/imbd/wordcloud.png" /></p>

<hr />

<h2 id="methodology">Methodology</h2>
<p>In order to capture relationships between text, we need to transform it into embeddings. Embedding are numerical representations of words, words with similar meaning will have similar representations.</p>

<p>There are different types of text embedding methods, some more complex than others. This <a href="https://www.geeksforgeeks.org/nlp/word-embeddings-in-nlp/">schematic</a> captures the 3 general categories.</p>

<p><img width="1220" height="601" alt="Screenshot 2025-11-07 at 10 07 44 AM" src="https://github.com/user-attachments/assets/bc3f0bc5-7c91-48ed-8cfe-30f3e2a002f5" /></p>

<p>We evaluated the following:</p>
<ul>
  <li><strong>Bag of Words (BoW)</strong></li>
  <li><strong>TF-IDF</strong></li>
  <li><strong>GloVe</strong></li>
  <li><strong>BERT embeddings</strong></li>
</ul>

<p>Each embedding was used across several <strong>machine learning models</strong>: Logistic Regression, K-Nearest Neighbors (KNN), Random Forest, a Deep Neural Network (DNN), and a fine-tuned BERT Transformer model.</p>

<hr />

<h3 id="logistic-regression">Logistic Regression</h3>

<p>The <strong>TF-IDF Logistic Regression</strong> model performed best, achieving <strong>89.83% accuracy</strong> and <strong>AUC = 0.96</strong>, followed closely by BoW (89.4% accuracy).<br />
Regularization experiments showed that <strong>L2 penalties</strong> (Ridge regression) were consistently optimal, suggesting that allowing all features to contribute led to more accurate predictions.</p>

<hr />

<h3 id="k-nearest-neighbors">K-Nearest Neighbors</h3>

<p>KNN models were optimized through grid search for distance metrics and weighting schemes.<br />
The best-performing configuration used <strong>TF-IDF</strong> embeddings with distance-based weighting, achieving <strong>79.92% accuracy</strong> and <strong>AUC = 0.88</strong>.<br />
Performance decreased significantly for GloVe embeddings (54.66% accuracy), indicating poorer feature representation.</p>

<hr />

<h3 id="random-forest">Random Forest</h3>

<p>Random Forest models captured complex word interactions.<br />
With <strong>TF-IDF embeddings</strong>, accuracy reached <strong>84.02%</strong>, while <strong>BERT embeddings</strong> followed closely at <strong>83.62%</strong>.<br />
BoW and GloVe lagged behind due to their limited ability to represent nuanced sentiment context.</p>

<hr />

<h3 id="deep-neural-network-dnn">Deep Neural Network (DNN)</h3>

<p>A 3-layer DNN trained with TF-IDF embeddings achieved <strong>87.23% accuracy</strong> and <strong>AUC = 0.95</strong>, outperforming simpler models but not BERT.<br />
The model used dropout regularization and ReLU activations to prevent overfitting.</p>

<hr />

<h3 id="fine-tuned-bert-transformer">Fine-tuned BERT Transformer</h3>

<p>The pre-trained <strong>BERT</strong> model achieved the highest overall performance, with <strong>91.74% accuracy</strong> and <strong>AUC = 0.97</strong>.<br />
Fine-tuning for two epochs with a learning rate of 2e-5 and batch size of 16 yielded highly contextual sentiment understanding.</p>

<hr />

<h2 id="results-summary">Results Summary</h2>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Model</th>
      <th style="text-align: left">Embedding</th>
      <th style="text-align: left">Accuracy</th>
      <th style="text-align: left">AUC</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left">Logistic Regression</td>
      <td style="text-align: left">TF-IDF</td>
      <td style="text-align: left">89.83%</td>
      <td style="text-align: left">0.96</td>
    </tr>
    <tr>
      <td style="text-align: left">KNN</td>
      <td style="text-align: left">TF-IDF</td>
      <td style="text-align: left">79.92%</td>
      <td style="text-align: left">0.88</td>
    </tr>
    <tr>
      <td style="text-align: left">Random Forest</td>
      <td style="text-align: left">TF-IDF</td>
      <td style="text-align: left">84.02%</td>
      <td style="text-align: left">0.91</td>
    </tr>
    <tr>
      <td style="text-align: left">DNN</td>
      <td style="text-align: left">TF-IDF</td>
      <td style="text-align: left">87.23%</td>
      <td style="text-align: left">0.95</td>
    </tr>
    <tr>
      <td style="text-align: left">BERT Transformer</td>
      <td style="text-align: left">BERT</td>
      <td style="text-align: left"><strong>91.74%</strong></td>
      <td style="text-align: left"><strong>0.97</strong></td>
    </tr>
  </tbody>
</table>

<p>TF-IDF consistently yielded the most reliable results across traditional models, while BERT surpassed them with its ability to capture deeper contextual meaning.</p>

<hr />

<h2 id="conclusion--future-work">Conclusion &amp; Future Work</h2>

<p>This analysis demonstrates that while traditional models like Logistic Regression and DNN perform well with TF-IDF features, <strong>contextual embeddings like BERT</strong> significantly enhance accuracy for nuanced text.</p>

<p>Future improvements may include:</p>
<ul>
  <li>Comparing performance with <strong>XLNet</strong> or <strong>GPT-based</strong> models</li>
  <li><strong>Data augmentation</strong> via synonym replacement or paraphrasing</li>
  <li>Combining ensemble models to enhance robustness</li>
</ul>

<p>Overall, the project highlights how effective feature representation is central to sentiment prediction performance.</p>

<hr />

<p><strong>Repository:</strong> <a href="https://github.com/anna-christina-mikr/sentiment-analysis">GitHub Project Link</a>
<strong>Dataset:</strong> IMDb 50K Movie Reviews</p>]]></content><author><name>Anna Micros</name></author><category term="project" /><category term="nlp" /><category term="data science" /><category term="sentiment analysis" /><category term="text classification" /><summary type="html"><![CDATA[Using NLP methods to predict sentiment of IMDb movie reviews]]></summary></entry><entry><title type="html">What Drives the Conversation? Exploring Energy News Through Topic Modeling</title><link href="https://anna-christina-mikr.github.io/project/2024/05/26/FERCtopicModelling.html" rel="alternate" type="text/html" title="What Drives the Conversation? Exploring Energy News Through Topic Modeling" /><published>2024-05-26T00:00:00+00:00</published><updated>2024-05-26T00:00:00+00:00</updated><id>https://anna-christina-mikr.github.io/project/2024/05/26/FERCtopicModelling</id><content type="html" xml:base="https://anna-christina-mikr.github.io/project/2024/05/26/FERCtopicModelling.html"><![CDATA[<style>
    .container {
        display: flex;
        justify-content: center;
        align-items: center;
    }
    img {
        margin: 10px;
        max-width: 130%;
        height: auto;
    }

</style>

<h3 id="context">Context</h3>

<p>Today marks the longest government lockdown in government history. 
This shutdown has hit energy hard, with soaring prices and delays on energy production projects. At the same time, there is a huge conversation around meeting electricity demands of power hungry AI by building more data centers, and the impact that has on local communities. There are dozens of other conversations happening in the energy space and this is what my group was interested in looking at and how the recent change in political power has shifted these conversations.</p>

<p>The data for this project is almost 5000 articles from the same time windows in 2024 and 2025: spanning from February to July. Did some standard preprocessing for text, removed stopwords, punctuation and tokenized the data. We ended up with</p>

<h3 id="extract-articles-from-txt-files">Extract articles from txt files.</h3>

<p>Each raw news text file was compliled by taking multiple energy articles from the same day scaped from the web. As a result, a lot of these files end up hacing inconsistent spacing and formatting, as well as relics from the web interface. Luckily I had a list of all article titles at hand that i could utilize, but they were not always an exact match to the text. In order to segment the articles accurately, I tried implementing fuzzy string matching (Levenshtein distance) to align detected titles with known article headers. I managed to successfully link about 2,859 of 4,751 titles. After segmentation, articles underwent standard NLP preprocessing: tokenization, stopword and punctuation removal, and exclusion of filler journalistic terms (e.g., “said,” “reported”), resulting in a clean, comparable corpus suitable for topic modeling.</p>

<div class="container">
  <img src="/images/article_covers/trigram.png" alt="Trigram Frequency in news data" width="600" height="400" />
</div>

<p>Here we see some of the top word trigrams visualized. This help us contextualize what we 3 words phrases we typically see in these articles, which can help us later with our topic analysis.</p>

<h3 id="using-lda-as-a-baseline">Using LDA as a baseline</h3>
<p>Latend Dirichlet Allocation (not to be confused with Linear Discriminant analysis) is a topic modelling technique that uses Bayesian modelling to cluster words into topics. In other words, it assumes that given a corpus of words, each word can be allocated to a discrete set of topics. As you can imagine, the number of topics chosen is important, it affects how cohesive the topics are.</p>

<p>The best method for topic number selection is using a key metric, plotting its value for a chosen range of possible topic numbers, and see where this value plateaus, where adding new topics offers no better value. I chose <strong>Topic Coherece</strong>, a measure of how similar words within a topic are.</p>

<div class="container">
  <img width="600" height="400" alt="Screenshot 2025-11-05 at 3 26 56 PM" src="https://github.com/user-attachments/assets/3c37cc77-97f5-4fde-a6f0-6626def89925" />
</div>

<p>Interestingly, the coherence score reaches a peak at 12 topics and then steadily goes down. These values are not dramatically different, as a C_V score in the range of 0.5-0.59 is considered decent, but it is not the behavior we expect. This sharp decrease in performance could be due to the fact that our corpus is exclusively concerning energy news, so there is a lot of overlap in the topics discussed. In a different topic modelling scenario, e.g. modelling research papers: we would end up getting very distinct topics such as History, Physics, Anthropology, Political Science etc. Here, we reach a limit where no more specificity can be achieved, as each word has been allocated to a redundant subtopic that does not offer any more nuance. In our case, the energy news corpus is lexically and conceptually dense, with few truly distinct themes, after 12 topics the model keeps slicing coherent topics into smaller, overlapping pieces, which coherence penalizes it harshly.</p>

<p>Another way however to evaluate our choice for number of topics is to check the topic:</p>

<iframe src="/images/ferc/lda-viz.html" width="200%" height="700" style="border:none; border-radius:10px;">
</iframe>

<p>This approach uncovers another aspect of LDA, topics may have similar meaning between each other, yielding a high coherence score, but they may also have a lot of overlap. This graph is from gensim’s package  <strong>pyLDAvis</strong>, its a great tool to visualize topic similarity!
It utilies the relevance of a word <em>w</em> to a topic <em>t</em>, calculated as:</p>

\[r(w, t \mid \lambda) = \lambda \, p(w \mid t) + (1 - \lambda) \, \frac{p(w \mid t)}{p(w)}\]

<p>Here, λ (<em>lambda</em>) controls the balance between how <strong>frequent</strong> and how <strong>distinctive</strong> a term is:</p>

<ul>
  <li>λ = 1 → ranks words by how common they are within the topic (general terms).</li>
  <li>λ = 0 → ranks words by how unique they are to that topic (specific terms).</li>
  <li>λ ≈ 0.6 (default) → balances both views for interpretability.</li>
</ul>

<p>We see here that a 12 topic LDA is insufficient, we have a lot of topic overlap; topic 4 and topic 5 are virtually the same, covering renewable energy projects and climate change related words. So in order to balance high coherence with topic uniqueness I chose 10 topics, and plotted this instead;</p>

<iframe src="/images/ferc/ldaviz9.html" width="200%" height="750" style="border:none; border-radius:10px;">
</iframe>

<p>Here we see that with nine topics a much better separated topics, so we are sticking with 9. Here are some words from the topics:</p>

<table>
  <thead>
    <tr>
      <th><strong>Topic ID</strong></th>
      <th><strong>Top Terms</strong></th>
      <th><strong>Likely Theme / Interpretation</strong></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>0</strong></td>
      <td>bill, house, republican, senate, tax, trump, committee, credit, democrat, budget</td>
      <td>Federal policy and legislation</td>
    </tr>
    <tr>
      <td><strong>1</strong></td>
      <td>data, plant, demand, coal, nuclear, center, price, july, generation, reactor</td>
      <td>Data centers, power generation, and nuclear demand</td>
    </tr>
    <tr>
      <td><strong>2</strong></td>
      <td>agency, rule, court, epa, environmental, trump, ferc, climate, order, law</td>
      <td>Regulation, courts, and environmental policy</td>
    </tr>
    <tr>
      <td><strong>3</strong></td>
      <td>ferc, transmission, order, utility, rate, miso, market, proposal, pjm, process</td>
      <td>Transmission and FERC regulatory orders</td>
    </tr>
    <tr>
      <td><strong>4</strong></td>
      <td>dam, water, river, pipeline, county, safety, area, lake, land, million</td>
      <td>Infrastructure, water resources, and safety</td>
    </tr>
    <tr>
      <td><strong>5</strong></td>
      <td>capacity, market, gw, mw, pjm, demand, load, price, generation, solar</td>
      <td>Grid capacity, markets, and generation</td>
    </tr>
    <tr>
      <td><strong>6</strong></td>
      <td>utility, transmission, electricity, solar, demand, line, system, wind, data, center</td>
      <td>Utilities, electrification, and renewables</td>
    </tr>
    <tr>
      <td><strong>7</strong></td>
      <td>lng, export, pipeline, wind, facility, global, offshore, million, terminal, permit</td>
      <td>LNG exports and offshore energy infrastructure</td>
    </tr>
    <tr>
      <td><strong>8</strong></td>
      <td>tariff, trump, lng, market, trade, oil, april, global, supply, import</td>
      <td>Trade, tariffs, and global energy markets</td>
    </tr>
  </tbody>
</table>

<p>What stands out in these topics is how the vocabulary of energy policy reflects both technical and political worlds.
The model doesn’t just surface buzzwords, it sketches the infrastructure of how the U.S. talks about power itself.</p>

<p>On one side, we see the hard engineering lexicon: words like transmission, capacity, load, generation, pipeline, and LNG anchor discussions in the physical systems that move electrons and fuel.
These are the operational conversations, the ones about keeping the grid stable, expanding transmission lines, or meeting demand from new data centers.</p>

<p>On the other side sits the language of governance and law: bill, house, committee, court, rule, EPA, order.
This vocabulary reveals that energy is never just a technical challenge; it’s also a legislative and judicial process.
Policy and legal terms intermingle with industry keywords, reminding us that decisions about pipelines or nuclear reactors are made as much in congressional committees and courtrooms as in control rooms.</p>

<p>That overlap — between ferc, trump, epa, rule, and market — captures a particularly modern dynamic:
the way energy debates intersect with climate regulation, trade policy, and even global supply chains.
In other words, the “energy transition” isn’t just happening in the grid — it’s happening in the rhetoric that shapes how we understand and argue about it.</p>

<p>Beyond static interpretation, these topics also open the door to temporal comparison.</p>

<iframe src="https://public.tableau.com/views/newsarticletopicdistribution&#47;Dashboard1?:showVizHome=no&amp;:embed=true" width="200%" height="800" style="border:none; border-radius:10px;">
</iframe>

<p>The most prominent theme, our federal and legislative energy policy topic, shows a noticeable decline in frequency over time, dropping from nearly 0.20 to about 0.15 in share. This pattern makes sense when viewed in the context of the 2025 administration change. In the months leading up to January, media attention was dominated by speculation around how new leadership might reshape national energy strategy. After the transition, however, discussion slowed, reflecting a pause as the administration’s policies began to take shape.</p>

<p>Meanwhile, coverage related to data centers and nuclear generation increased significantly between 2024 and 2025. This rise closely follows the surge in AI adoption and the corresponding spike in electricity demand. As questions about how to power the next wave of AI dominated headlines, nuclear energy began to reemerge as a central part of that conversation. With the new administration openly backing expanded nuclear production and new projects nationwide, the media narrative shifted toward emphasizing capacity growth, energy security, and technological reliability.</p>

<p>In contrast, topics focused on environmental policy declined sharply. This suggests that climate and sustainability have taken a quieter role in the national energy dialogue, likely displaced by more immediate economic and industrial priorities. Similarly, mentions of utilities, renewables, LNG, and offshore development all decreased, reflecting a broader pivot away from decarbonization and toward supply-side concerns such as grid stability and generation capacity.</p>

<p>Other areas, however, have become more prominent. Coverage surrounding FERC orders rose in intensity, underscoring the growing regulatory attention to managing grid reliability and adapting to new forms of generation. Water infrastructure also gained visibility, likely reflecting growing concern about the resource strain that large-scale data centers and energy production place on local ecosystems.</p>

<p>Finally, discussions of tariffs and trade policy saw renewed activity, particularly around Trump-era tariffs and their implications for energy technology imports and domestic production.</p>

<p>Taken together, these shifts tell a clear story: the public and media conversation has turned from long-term environmental ambitions toward short-term energy security and infrastructure resilience. The period from 2024 to 2025 marks not just a political transition, but a fundamental reframing of the energy narrative — one now centered on power capacity, regulation, and the balance between technological ambition and sustainable supply.</p>

<h3 id="final-thoughts">Final Thoughts</h3>

<p>Taken together, these shifts tell a clear story: the public and media conversation has turned from long-term environmental ambitions toward short-term energy security and infrastructure resilience. The period from 2024 to 2025 marks not just a political transition, but a fundamental reframing of the energy narrative, one now centered on power capacity, regulation, and the balance between technological ambition and sustainable supply.
This analysis is still in progress, some next steps will include:</p>
<ul>
  <li>Performing sentiment analysis: There are a lot of overlapping topics in 2024 and 2025, it would be interesting to see if they are spoken about in a different light.</li>
  <li>Creating a web app: a dashboard with more information about the energy space for a broader audience, potentially including a RAG chat bot that answers frequent energy related questions.</li>
</ul>]]></content><author><name>Anna Micros</name></author><category term="project" /><category term="topic modelling" /><category term="data science" /><category term="analysis" /><summary type="html"><![CDATA[How does the media shape the conversation around energy?]]></summary></entry><entry><title type="html">US Education Trends</title><link href="https://anna-christina-mikr.github.io/project/2024/03/15/USEducation.html" rel="alternate" type="text/html" title="US Education Trends" /><published>2024-03-15T00:00:00+00:00</published><updated>2024-03-15T00:00:00+00:00</updated><id>https://anna-christina-mikr.github.io/project/2024/03/15/USEducation</id><content type="html" xml:base="https://anna-christina-mikr.github.io/project/2024/03/15/USEducation.html"><![CDATA[]]></content><author><name>Anna Micros</name></author><category term="project" /><category term="education" /><category term="data-visualization" /><category term="R" /><summary type="html"><![CDATA[]]></summary></entry><entry><title type="html">Meteor Observation by Forward Scatter Radar</title><link href="https://anna-christina-mikr.github.io/project/2023/04/23/MeteorDetection.html" rel="alternate" type="text/html" title="Meteor Observation by Forward Scatter Radar" /><published>2023-04-23T00:00:00+00:00</published><updated>2023-04-23T00:00:00+00:00</updated><id>https://anna-christina-mikr.github.io/project/2023/04/23/MeteorDetection</id><content type="html" xml:base="https://anna-christina-mikr.github.io/project/2023/04/23/MeteorDetection.html"><![CDATA[<h3 id="overview">Overview</h3>
<p>This project explores how meteors can be detected using <strong>forward scatter radar</strong>, a method that captures radio reflections from ionized trails left by meteors entering Earth’s atmosphere. The system was built using a <strong>Yagi antenna</strong> and an <strong>SDR (Software Defined Radio)</strong> receiver tuned to the GRAVES transmitter frequency at 143.05 MHz.</p>

<h3 id="motivation">Motivation</h3>
<p>After radar’s use in World War II, researchers discovered that meteors produced “false echoes” on radar screens. These reflections inspired a new way of observing meteor activity, by detecting their ionized trails through radio signals rather than optical telescopes. The project aimed to recreate this phenomenon with affordable, open-source tools.</p>

<p>This was my final project for my bachelor’s degree in Physics back in 2023! I did really enjoy this project and I wanted to revisit it; at the time I completed it I did not have much experience in coding, but I think this project could really be streamlined using machine learning! Although sadly i no longer have the data, I want to go through how this could be improved has computer vision been incorporated. But first, here is the design and results:</p>

<hr />

<h3 id="system-design">System Design</h3>

<h4 id="antenna-construction">Antenna Construction</h4>
<p>The antenna used was a <strong>Yagi-Uda</strong> design optimized for 143.05 MHz (λ = 2.097 m). Its copper elements were cut and mounted on a wooden boom to ensure non-conductivity and stability.<br />
The setup included:</p>
<ul>
  <li><strong>Reflector</strong>, <strong>Dipole</strong>, and <strong>Director</strong> elements to enhance forward gain.</li>
  <li><strong>Coaxial connection</strong> to the SDR dongle for signal capture.</li>
  <li>
    <p>Orientation toward <strong>GRAVES</strong> in France, with calculated elevation for optimal reflection angles.</p>

    <p><img height="700" alt="Screenshot 2025-10-28 at 1 04 07 PM" src="https://github.com/user-attachments/assets/18bf384b-0338-4543-956d-3fe1200b7157" /></p>
  </li>
</ul>

<h4 id="software-setup">Software Setup</h4>
<p>Signal processing was performed in <strong>SDRsharp</strong>, displaying intensity over time as a <em>waterfall plot</em>. Automated screen captures every 15 seconds enabled continuous observation from November 5–17. Meteor echoes were later classified manually as <strong>underdense</strong> or <strong>overdense</strong> based on signal shape and duration.</p>

<hr />

<h3 id="results">Results</h3>
<p>Ths is what the signal lookslike; The measurements are centered around the Graves frequency. These waterfall plots show the signal evolution over time, which allows us to exctract useful information about the meteor, its composition, velocity etc. 
<img width="500" alt="meteor scatter" src="/images/meteor/waterfall1.png" /></p>

<h4 id="echo-classification">Echo Classification</h4>
<ul>
  <li><strong>Underdense trails</strong>: Faint, short-lived reflections (&lt;1 s), often appearing as thin horizontal lines on the waterfall diagram.</li>
  <li><strong>Overdense trails</strong>: Brighter and longer-lasting (up to 2 s), with diffraction-like signal fluctuations due to interference between reflection points.
Here is an example of an underdense, vs dense signal:</li>
</ul>

<p float="left">
  <img src="/images/meteor/waterfall2.png" alt="meteor scatter" width="300" />
  <img src="/images/meteor/waterfall3.png" alt="meteor scatter" width="300" />
</p>

<p>Head echoes — reflections from the plasma surrounding the meteor head — were occasionally detected, allowing direct velocity calculations. A Doppler shift of 800 Hz, for example, corresponded to a <strong>radial velocity of 838.8 m/s</strong>.</p>

<h4 id="detection-patterns">Detection Patterns</h4>
<p>Detection rates peaked around <strong>November 11–12</strong>, coinciding with the <strong>Taurid meteor shower</strong>. Most echoes appeared in the early morning hours, consistent with the Earth’s rotation — when the observation region faces the orbital direction, increasing meteor entry rates.</p>

<h3 id="hourly-rates">Hourly Rates</h3>
<p><img width="1389" height="590" alt="download" src="https://github.com/user-attachments/assets/c18c72c2-3d2a-4e67-9ba7-14b1aa3d072a" /></p>

<hr />

<h3 id="discussion">Discussion</h3>
<p>The results confirm that forward scatter radar can successfully detect meteor trails with relatively simple equipment.<br />
Signal characteristics matched theoretical expectations, with observed Doppler shifts corresponding to typical meteor velocities (11–72 km/s). Limitations were mainly due to SDRsharp’s resolution and manual analysis — future improvements could include <strong>automated image detection</strong> and <strong>machine learning–based signal classification</strong>.</p>

<hr />

<h3 id="conclusion">Conclusion</h3>
<p>This experiment demonstrates that meaningful meteor observations are achievable using low-cost SDR hardware. Beyond its technical outcomes, it shows how data-driven methods can reveal invisible celestial phenomena — translating radio noise into evidence of high-speed interplanetary travel.</p>

<hr />

<p><strong>Keywords:</strong> SDR, Radio Astronomy, Meteor Detection, Doppler Shift, Antenna Design<br />
<strong>Tools:</strong> SDRsharp, Auto Screen Capture, Python (for data post-processing)</p>]]></content><author><name>Anna Micros</name></author><category term="project" /><category term="physics" /><category term="signal-processing" /><category term="sdr" /><category term="radar" /><summary type="html"><![CDATA[A hands-on exploration of meteor detection using a custom-built Yagi antenna and a software-defined radio system.]]></summary></entry></feed>