Adding Data Sources

The bedrock of projectpiglet.com, is the data. Nothing on the site can exist without it and it takes quite a bit of it (although not as much as you might think). Today, Piglet tracks roughly one million separate subjects, and has ingested 18.5 million comments, on 1.5 million stories, from 800,000 authors. When projectpiglet.com started recollecting data a year ago, it started with a mere 4 million comments.

Four-Million, That’s A Lot!

You’re not wrong. Four-million comments is quite a bit. However, when you’re trying to track a one million plus subjects, four-million is a only a drop in the bucket. In fact, the 18.5 million comments it currently has in ingested is still way too small. For instance, on the site today, we can only accurately track popular discussion of around 10% of the S&P 500 and the top 25 or so cryptocurrencies. That’s just too small.

So what would be enough data? Similar to many things in life:

Everything in excess! To enjoy the flavor of life, take big bites. Moderation is for monks.

– Robert A. Heinlein via Jubal Harshaw in Stranger is a Strange Land

But in all seriousness. It depends on whether or not the comments have the subject matter which we wish to be tracking. For example, if the comments are only ever discussing “Trump”, then projectpiglet.com wont be able to track anything related to “Bernie” or “Clinton.”

How to Add More Data

The neat part about projectpiglet.com, is that it was designed from the start to simple point a stream of comments to an ingestion point and BAM new data source.

It really does only take a few minutes. It’s designed to simply take in API calls using the following features:

  • Author – Required
  • Statement/Comment – Required
  • Source – Required
  • Related_Story – Optional
  • Timestamp – Optional

That’s it. The point is from there, Piglet can do everything it needs to ingest, and provide useful insights. It doesn’t even need a related story, as Piglet will parse out the comment for the stories as well.

The hard part has really just been keeping the data limited at this point.

Wait, you’re limiting data?

Yes. Unfortunately, it does costs money to add data sources. Either there needs to be a web scrapper which feeds data into the system, or we need to pay for data streams (such as Twitter). Further, Piglet needs to process the data once its sent and store the data – both of which costs money, albeit not too much.

To date, we’ve kept the data ingestion to a minimum, but enough to prove the Piglets worth. As we grow the user base, we intend to grow the data sources. There’s really no reason we can’t crawl the web in the end. However, until we prove it’s worth, we don’t have the funds to waste.

That being said, Piglet is looking to grow!

If you’re interested in using ProjectPiglet.com, use the coupon code: pigletblog2018

It’s 25% off for 6 months!

Leave a Reply

Your email address will not be published. Required fields are marked *