<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>paidContent &#187; data-science</title>
	<atom:link href="http://paidcontent.org/tag/data-science/feed/" rel="self" type="application/rss+xml" />
	<link>http://paidcontent.org</link>
	<description>The economics of digital content</description>
	<lastBuildDate>Sun, 19 May 2013 22:28:22 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='paidcontent.org' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://0.gravatar.com/blavatar/89ee7e1250b4095eefb87d28e6e64947?s=96&#038;d=http%3A%2F%2Fs2.wp.com%2Fi%2Fbuttonw-com.png</url>
		<title>paidContent &#187; data-science</title>
		<link>http://paidcontent.org</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://paidcontent.org/osd.xml" title="paidContent" />
	<atom:link rel='hub' href='http://paidcontent.org/?pushpress=hub'/>
		<item>
		<title>MIT researcher says he can predict Twitter trends</title>
		<link>http://gigaom.com/2012/11/01/mit-researcher-says-he-can-predict-twitter-trends/</link>
		<comments>http://gigaom.com/2012/11/01/mit-researcher-says-he-can-predict-twitter-trends/#comments</comments>
		<pubDate>Thu, 01 Nov 2012 18:06:11 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[algorithms]]></category>
		<category><![CDATA[data-science]]></category>
		<category><![CDATA[machine-learning]]></category>
		<category><![CDATA[predictive analytics]]></category>
		<category><![CDATA[social-media]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=579682</guid>
		<description><![CDATA[An MIT researcher says he has created an algorithm that can identify Twitter trends hours before the service can itself. If the algorithm works as he says, it could help Twitter -- and many more companies -- make a lot of money.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=paidcontent.org&#038;blog=33319749&#038;post=220031&#038;subd=gigaompaidcontent&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>A researcher at MIT claims to have developed an algorithm that can accurately predict what topics will trend on Twitter. But Twitter being a relatively minor business in the grand scheme of things, the algorithm might end up being more useful elsewhere, predicting stock prices, ticket sales and other dynamically changing quantities.</p>
<p>According to <a href="http://web.mit.edu/press/2012/predicting-twitter-trending-topics.html">a release from the MIT News Office</a>, Associate Professor Devavrat Shah says his model has been 95 percent accurate during testing and has been predicting trends hours before they appear on Twitter&#8217;s list. The algorithm incorporates a new approach to machine learning that compares real-time data with historical data and predicts outcomes based on past events that most closely align with the current situation. So, rather than analyzing a topic&#8217;s chances of trending equally against the entire historical corpus of topics, it will assign more weight to topics whose paths followed similar trajectories up the ranks of top trends.</p>
<p>And Twitter is certainly interested in the research. A company spokesperson emailed me to point out that Shah&#8217;s graduate research assistant, Stanislav Nikolov, is a Twitter employee.</p>
<div id="attachment_579769" class="wp-caption alignleft" style="width: 310px"><a href="http://gigaom2.files.wordpress.com/2012/11/trends.jpg"><img  title="trends" alt="" src="http://gigaom2.files.wordpress.com/2012/11/trends.jpg?w=300&#038;h=217" height="217" width="300" class="size-medium wp-image-579769" /></a><p class="wp-caption-text">Imagine knowing these topics before Twitter does.</p></div>
<p>However, the algorithm&#8217;s level of accuracy and speed would have to translate to a much-larger and more-complex stage &#8212; Twitter&#8217;s real-life firehose and stockpile of historical tweets &#8212; if the company were to use its predictions to charge premiums for ads associated with certain topics, as Shah suggests. Advertisers might not be happy to pay premium rates for topics that fizzle out before ever becoming top trends (although a tiered rate system based on the model&#8217;s confidence or, perhaps, projected ranking among top trends could work). Thus far, the algorithm has been trained using a set of 400 topics, half of which trended and half of which did not.</p>
<p>Shah thinks it&#8217;s a great fit for Twitter data because the data is relatively clean and he has found a strong correlation between past and future activity. Other historical data sets might be more messy or have more noise than does Twitter&#8217;s data set, which would make it much more difficult to filter out extraneous data and discern the real factors that lead to a particular result. However, even Twitter has presented research showing, in the case of its search engine at least, how the sheer volume of data it receives and the speed at which it comes in <a href="http://gigaom.com/cloud/twitter-shows-when-we-tweet-and-explains-why-its-search-sucks/">can make it difficult to accurately predict what someone wants to see</a>.</p>
<p>The good news, though, for anyone willing to give Shah&#8217;s algorithm a try is that it&#8217;s designed to process data in parallel across scale-out systems like those used by large web companies. Therefore, training it and then running it in production across a voluminous data set <a href="http://gigaom.com/cloud/skytree-intros-machine-learning-for-the-masses/">won&#8217;t run into the same obstacles traditionally faced by machine learning algorithms</a> as data sizes increase. And there are potentially more lucrative and rewarding endeavors that could benefit from this type of predictive power: Shah suggests stock markets, movie ticket sales and public transportation as possibilities, but others might include combating cybercrime by identifying threats earlier or predicting the severity of disease outbreaks.</p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-932215p1.html">Shutterstock user turtleteeth</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=paidcontent.org&#038;blog=33319749&#038;post=220031&#038;subd=gigaompaidcontent&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/PaidContent_RSS_300x250&#038;sz=300x250&#038;c=503340"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/PaidContent_RSS_300x250&#038;sz=300x250&#038;c=503340" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2012/11/01/mit-researcher-says-he-can-predict-twitter-trends/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/11/twitter-network-data.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/11/twitter-network-data.jpg?w=150" medium="image">
			<media:title type="html">twitter network data</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/11/trends.jpg?w=300" medium="image">
			<media:title type="html">trends</media:title>
		</media:content>
	</item>
		<item>
		<title>Forget your fancy data science, try overkill analytics</title>
		<link>http://gigaom.com/2012/09/21/forget-your-fancy-data-science-try-overkill-analytics/</link>
		<comments>http://gigaom.com/2012/09/21/forget-your-fancy-data-science-try-overkill-analytics/#comments</comments>
		<pubDate>Fri, 21 Sep 2012 17:00:24 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[big-data]]></category>
		<category><![CDATA[data-science]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[kaggle]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=565355</guid>
		<description><![CDATA[Carter S. won his first-ever Kaggle competition -- our own GigaOM WordPress Challenge -- using a brute force method of data science he calls overkill analytics. Rather than spend untold hours perfecting complex models, Carter used simple algorithms and let powerful microprocessors do the rest.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=paidcontent.org&#038;blog=33319749&#038;post=218093&#038;subd=gigaompaidcontent&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Meet Carter S. He used to be a lawyer, but now he writes predictive models for an insurance company. Admittedly green in certain new or advanced modeling methods, he prefers to use simple algorithms and throw as much computing power as possible problems. He <a href="http://www.overkillanalytics.net/about-overkill-analytics/">calls the technique &#8220;overkill analytics,&#8221;</a> and it just won him his first contest on Kaggle, defeating more than 80 other competitors in the <a href="http://www.kaggle.com/c/predict-wordpress-likes">GigaOM WordPress Challenge: Splunk Innovation Prospect</a>  <em>(see disclosure)</em>.</p>
<p>Not only was this Carter&#8217;s first win, it was also his first contest. You can <a href="http://www.overkillanalytics.net/kaggles-wordpress-challenge-the-like-graph/">read the detailed explanation of his victory</a> on his blog, but the gist is that he didn&#8217;t get too involved with complex social graphing to determine relationships or natural language processing to determine topics readers liked. He figured out that most of what people liked came from blogs they&#8217;ve already read, and that the vast majority of posts people liked fell within a three-node radius on a simple social graph.</p>
<p>Statistically speaking, he did a <a href="http://en.wikipedia.org/wiki/Generalized_linear_model">generalized linear regression model</a>, followed by a <a href="http://en.wikipedia.org/wiki/Random_forest">random forest model</a> and averaged the results. &#8220;I&#8217;m not sure it&#8217;s a very unique technique,&#8221; he told me, &#8220;but it&#8217;s certainly a very powerful one.&#8221;</p>
<div id="attachment_565426" class="wp-caption aligncenter" style="width: 590px"><a href="http://gigaom2.files.wordpress.com/2012/09/blog-wordpress-centralitylift-580x295.jpg"><img  title="blog-wordpress-centralitylift-580x295" src="http://gigaom2.files.wordpress.com/2012/09/blog-wordpress-centralitylift-580x295.jpg?w=708" alt=""   class="size-full wp-image-565426" /></a><p class="wp-caption-text">Source: Overkill Analytics</p></div>
<p>And therein lies the beauty of overkill analytics, a term that Carter might have coined, but that appears to be catching on &#8212; especially in the world of web companies and big data. Carter says he doesn&#8217;t want to spend a lot of time fine-tuning models, writing complex algorithms or pre-analyzing data to make it work for his purposes. Rather, he wants to utilize some simple models, reduce things to numbers and process the heck out of the data set on as much hardware as is possible.</p>
<p>It&#8217;s not about big data so much as it is about big computing power, he said. There&#8217;s still work to be done on smaller data sets like the majority of the world deals with, but Hadoop clusters and other architectural advances let you do more to that data in a faster time than was previously possible. Now, Carter said, as long as you account for the effects of overprocessing data, you can create a black-box-like system and run every combination of simple techniques on data until you get the most-accurate answer.</p>
<p>I <a href="http://gigaom.com/data/5-ideas-to-help-everyone-make-the-most-of-big-data/">wrote about the same general theory recently</a> in explaining why Sparked.com&#8217;s Daniel Wiesenthal believes that big data (i.e., lots and lots of data combined with new storage and processing technologies) improves the practice of data science (i.e., the application of statistical techniques to data). The gist of his theory is that although complex models are great for small data sets, simple models can close the accuracy gap when applied to large data sets. Combine that with infrastructure that can process a lot of data relatively fast and support a wide variety of jobs, and you have a simpler, faster equally effective method.</p>
<p>Still, Carter said he didn&#8217;t get involved in Kaggle just to prove the effectiveness of overkill analytics. He does hope to get exposed to new data science techniques that haven&#8217;t yet caught on in the insurance industry, and he also wants to make a name for himself. When you work for a company with little turnover, he said, your professional network doesn&#8217;t grow too much, but doing Kaggle competitions is a great way to meet other data scientists &#8212; and <a href="http://gigaom.com/data/can-kaggle-make-data-science-a-spectator-sport/">winning is a great way to earn respect</a>.</p>
<p>Ali Ahmad (username Xali) won the separate Splunk Innovation portion of the contest. According to a statement from Splunk, he &#8220;used Splunk&#8217;s built in statistical and visualization features to map out the relationship between blogs containing YouTube videos with those that are most likely to be viral, as measured by likes and shares. As a bonus, he fed the data into an app to view the YouTube videos most commonly liked and shared via WordPress blogs!&#8221;</p>
<p><em><strong>Disclosure</strong>: Automattic, maker of WordPress.com, is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, GigaOm. Om Malik, founder of GigaOm, is also a venture partner at True.</em></p>
<p><em>Feature image courtesy of <a href="http://www.shutterstock.com/gallery-674152p1.html">Shutterstock user nasirkhan</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=paidcontent.org&#038;blog=33319749&#038;post=218093&#038;subd=gigaompaidcontent&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/PaidContent_RSS_300x250&#038;sz=300x250&#038;c=600517"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/PaidContent_RSS_300x250&#038;sz=300x250&#038;c=600517" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/2012/09/21/forget-your-fancy-data-science-try-overkill-analytics/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/09/shutterstock_86909912.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/09/shutterstock_86909912.jpg?w=150" medium="image">
			<media:title type="html">workflow</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/09/blog-wordpress-centralitylift-580x295.jpg" medium="image">
			<media:title type="html">blog-wordpress-centralitylift-580x295</media:title>
		</media:content>
	</item>
		<item>
		<title>GigaOM Data Challenge: Predict which stories get read, win $10K</title>
		<link>http://gigaom.com/cloud/gigaom-meets-kaggle-predict-wholl-read-what-win-10k/</link>
		<comments>http://gigaom.com/cloud/gigaom-meets-kaggle-predict-wholl-read-what-win-10k/#comments</comments>
		<pubDate>Wed, 20 Jun 2012 16:30:18 +0000</pubDate>
		<dc:creator>Derrick Harris</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[big-data]]></category>
		<category><![CDATA[data-science]]></category>
		<category><![CDATA[kaggle]]></category>
		<category><![CDATA[splunk]]></category>

		<guid isPermaLink="false">http://gigaom.com/?p=534422</guid>
		<description><![CDATA[In publishing, there's a constant struggle to determine who'll read what posts, what the ideal headline might is and when is the best time to publish. GigaOM is teaming with Splunk to find some answers via a Kaggle competition worth a total of $25,000 in prizes.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=paidcontent.org&#038;blog=33319749&#038;post=211988&#038;subd=gigaompaidcontent&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p><a href="http://gigaom2.files.wordpress.com/2012/06/shutterstock_53433448.jpg"><img title="shutterstock_53433448" src="http://gigaom2.files.wordpress.com/2012/06/shutterstock_53433448.jpg?w=300&#038;h=200" alt="" width="300" height="200" class="alignleft size-medium wp-image-534438"></a>In publishing, analytics matter a lot. There’s a constant struggle to determine who will read what posts or articles, what the ideal headline might be and when publishing makes the most sense. That’s why GigaOM is teaming with <a href="http://www.splunk.com/">Splunk</a> to help find that answer.</p>
<p>We’re <a href="https://www.kaggle.com/c/predict-wordpress-likes">hosting a competition</a> on <a href="http://kaggle.com">Kaggle’s data science platform</a> to find the best models around likely readership across the WordPress <em>(see disclosure) </em>ecosystem of blogs. Here are the details:</p>
<blockquote><p>The challenge is to predict whether a particular user will like a particular WordPress blog post.  The data consists of eight weeks of posts collected by WordPress, along with anonymized user responses to each post.  This challenge is an interesting mix of natural language processing (the raw blog posts) and metadata on the blogs and users. Contestants can download the data and submit prediction through the Kaggle platform, but a <strong>new feature for this competition</strong> is that they will also have free access to a Splunk server containing all the data, which they can employ for data exploration, visualization, feature extraction and modeling.</p></blockquote>
<p>Aside from offering resources to work on the data, Splunk is also putting up $25,000 in prize money. The winning model will receive $10,000, second place $5,000, third place $3,000 and fourth place $2,000.</p>
<p>There’s also a $5,000 Splunk Innovation Prize for the most innovative use of data science, whether that comes in the form of a visualization, app, business model, you name it. Submissions for the latter track can be submitted through <a href="http://gigaom.com/cloud/kaggle-is-now-crowdsourcing-data-science-creativity/">Kaggle’s new Prospect platform</a>. Winners for both competitions will be announced at <a href="http://event.gigaom.com/mobilize/?utm_source=cloud&amp;utm_medium=editorial&amp;utm_campaign=intext&amp;utm_term=211988+gigaom-meets-kaggle-predict-wholl-read-what-win-10k&amp;utm_content=dharrisstructure">GigaOM Mobilize</a> in September.</p>
<p>You can find out <a href="https://www.kaggle.com/c/predict-wordpress-likes">more about the competition here</a>. Good luck!</p>
<p><em><strong>Disclosure:</strong> Automattic, maker of WordPress.com, is backed by True Ventures, a venture capital firm that is an investor in the parent company of this blog, GigaOM. Om Malik, founder of GigaOM, is also a venture partner at True.</em></p>
<p><em>Image courtesy of <a href="http://www.shutterstock.com/gallery-421981p1.html">Shutterstock user sukiyaki</a>.</em></p>
<br />  <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=paidcontent.org&#038;blog=33319749&#038;post=211988&#038;subd=gigaompaidcontent&#038;ref=&#038;feed=1" width="1" height="1" /><p><a href="http://pubads.g.doubleclick.net/gampad/jump?iu=/1008864/PaidContent_RSS_300x250&#038;sz=300x250&#038;c=478056"><img src="http://pubads.g.doubleclick.net/gampad/ad?iu=/1008864/PaidContent_RSS_300x250&#038;sz=300x250&#038;c=478056" /></a></p>]]></content:encoded>
			<wfw:commentRss>http://gigaom.com/cloud/gigaom-meets-kaggle-predict-wholl-read-what-win-10k/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:thumbnail url="http://gigaom2.files.wordpress.com/2012/06/shutterstock_53433448.jpg?w=150" />
		<media:content url="http://gigaom2.files.wordpress.com/2012/06/shutterstock_53433448.jpg?w=150" medium="image">
			<media:title type="html">shutterstock_53433448</media:title>
		</media:content>

		<media:content url="http://0.gravatar.com/avatar/9e48ffa0913f65c577727457dd63023f?s=96&#38;d=retro&#38;r=PG" medium="image">
			<media:title type="html">dharrisstructure</media:title>
		</media:content>

		<media:content url="http://gigaom2.files.wordpress.com/2012/06/shutterstock_53433448.jpg?w=300" medium="image">
			<media:title type="html">shutterstock_53433448</media:title>
		</media:content>
	</item>
	</channel>
</rss>
